Bottom - Index of papers - Previous - Next - Paper in HTML - Abstract - Paper as PDF - CUBIC

Title: Protein structure prediction in 1D, 2D, and 3D
Author:Burkhard Rost
Quote: P. von Rague-Schleyer, N. L. Allinger, T. CClark, J. Gasteiger, P. A. Kollman, H. F. Schaefer 'Encyclopedia of Computational Chemistry', John Wiley: Sussex, (1998), 2242-2255

Introduction for 'Protein structure prediction in 1D, 2D, and 3D'

Proteins are the machinery of life. The information for life is stored by a four-letter alphabet in the genes (DNA). Proteins are, among others, the macromolecules that perform all important tasks in organisms, such as catalysis of biochemical reactions, transport of nutrients, recognition, and transmission of signals. Thus, genes are the blueprints or library, and proteins are the machinery of life. Proteins are formed by joining amino acids by peptide bonds into a stretched chain. This protein sequence comprises a translation of the four-letter DNA alphabet into a 20-letter alphabet of native amino acids. Proteins differ in length (from 30 to over 30,000 amino acids), and in the arrangement of the amino acids (dubbed residues, when joined in proteins). In water, the chain folds up into a unique three-dimensional (3D) structure. The main driving force is the need to pack residues for which a contact with water is energetically unfavourable (hydrophobic residues) into the interior of the molecule. A detailed analysis of the underlying chemistry shows that this is only possible if the protein forms regular patterns of a macroscopic substructure called secondary structure (Fig. 1; for an excellent introduction into protein structure for a short review of the basic principles of folding:).

Sequence determines structure determines function. Protein three-dimensional (3D) structure (i.e. the co-ordinates of all atoms) determines protein function. But what determines 3D structure? The hypothesis that structure (also referred to as 'the fold') is uniquely determined by the specificity of the sequence, has been verified for many proteins. While it is now known that particular proteins (chaperones) often play a rôle in the folding pathway, and in correcting misfolds, it is still generally assumed that the final structure is at the free-energy minimum. Thus, all information about the native structure of a protein is coded in the amino acid sequence, plus its native solution environment. Can the code be deciphered, i.e. can 3D structure be predicted from sequence? In principle, the code could by deciphered from physico-chemical principles using, for example, molecular dynamics methods. In practice, however, such approaches are frustrated by two principle obstacles. Firstly, energy differences between native and unfolded proteins are extremely small (order of 1 kcal/mol). Secondly, the high complexity (i.e. co-operativity) of protein folding requires several orders of magnitudes more computing time than we anticipate to have over the next decades. Thus, the inaccuracy in experimentally determining the basic parameters, and the limited computing resources become fatal for predicting protein structure from first principles. The only successful structure prediction tools are knowledge-based, using a combination of statistical theory and empirical rules.

The sequence-structure gap is rapidly increasing. Currently, databases for protein sequences (e.g. SWISS-PROT are expanding rapidly, largely due to large-scale genome sequencing projects. The first four entire genome sequences have been published; they represent all three terrestrial kingdoms: (1) prokaryotes: haemophilus influenzae, and mycoplasma genitalium; (2) eucaryotes: yeast, and (3) archeans: methanococcus jannaschii, At least, another dozen of genomes will be completely sequenced before the end of 1997 (Terry Gaasterland, priv. communication); the entire human genome is likely to be known in the year 2003. This implies that the explosion of genome, and hence, protein sequences is supposedly the only field outgrowing the speed in development of computer hardware. It also implies, that despite significant improvements of structure determination techniques the gap between the number of proteins for which structure is deposited in public databases (PDB), and the number of proteins for which sequences are known is increasing.

Can the egg be unboiled? When an egg is boiled, the proteins it contains unfold. Can this procedure be reversed in theory? Can the encrypted code of protein structure be deciphered? Or, can theory help to bridge the sequence-structure gap? Indeed, for over 30 years, there has been an ardent search for methods to predict protein structure from the sequence. Many methods were found which looked initially very promising - but always the hope has been dashed. How well do we do?

No general prediction of structure from sequence, yet. An important experiment has been initiated by John Moult (CARB, Washington): those who determine protein structures submitted the sequences of proteins for which they were about to solve the structure to a 'to-be-predicted' database; for each entry in that database predictors could send in their predictions before a given deadline (the public release of the structure); finally, the results were compared, and discussed during a workshop (in Asilomar, California). Two such experiments have been completed: in December 1994 (Proteins special issue, Vol. 23, 1995), and in December 1996 (to be published in Proteins, 1997). The results of both experiments demonstrated clearly that the goal to predict structure from sequence has not been reached, yet. So, no improvement despite ardent attempts, and the explosion of knowledge deposited in databases?

Indeed, there is a flood of literature on protein structure prediction attempting to keep track with the expanding databases. In this review focus will be laid on recent prediction methods that do actually contribute to bridging the sequence structure gap in particular in view of analysing entire genomes. The first section will provide a brief sketch about where we are today in protein structure prediction. The following chapters will sketch the problems, and some of the solutions in database searches, and the prediction of protein structure in 1D, 2D, and 3D (Fig. 1).



Top - Index of papers - Previous - Next - Paper in HTML - Abstract - Paper as PDF - CUBIC