Protein structure prediction in 1D, 2D, and 3D

In: P. von Rague-Schleyer, N. L. Allinger, T. CClark, J. Gasteiger, P. A. Kollman, H. F. Schaefer 'Encyclopedia of Computational Chemistry', John Wiley: Sussex, 1998, 2242-2255

Burkhard Rost

EMBL, 69 012 Heidelberg, Germany

In: P. von Rague-Schleyer, N. L. Allinger, T. CClark, J. Gasteiger, P. A. Kollman, H. F. Schaefer 'Encyclopedia of Computational Chemistry', John Wiley: Chisester, 2242-2255

Table of contents

List of Abbreviations

3D, three-dimensional; 3D structure, three-dimensional (co-ordinates of protein structure); 2D, two-dimensional; 2D structure, two-dimensional (e.g. inter-residue distances); 1D, one-dimensional; 1D structure, one-dimensional (e.g. sequence or string of secondary structure); U, protein sequence of unknown 3D structure (e.g. search sequence); T, target used for homology modelling (protein of known 3D structure); PDB, Protein Data Bank of experimentally determined 3D structures of proteins; SWISS-PROT, data base of protein sequences

Key words

protein structure prediction, threading, remote homology detection, fold recognition, secondary structure, relative solvent accessibility, multiple alignments, dynamic programming, neural networks.

1 Introduction

Proteins are the machinery of life. The information for life is stored by a four-letter alphabet in the genes (DNA). Proteins are, among others, the macromolecules that perform all important tasks in organisms, such as catalysis of biochemical reactions, transport of nutrients, recognition, and transmission of signals. Thus, genes are the blueprints or library, and proteins are the machinery of life. Proteins are formed by joining amino acids by peptide bonds into a stretched chain. This protein sequence comprises a translation of the four-letter DNA alphabet into a 20-letter alphabet of native amino acids. Proteins differ in length (from 30 to over 30,000 amino acids), and in the arrangement of the amino acids (dubbed residues, when joined in proteins). In water, the chain folds up into a unique three-dimensional (3D) structure. The main driving force is the need to pack residues for which a contact with water is energetically unfavourable (hydrophobic residues) into the interior of the molecule. A detailed analysis of the underlying chemistry shows that this is only possible if the protein forms regular patterns of a macroscopic substructure called secondary structure (Fig. 1; for an excellent introduction into protein structure: [1]; for a short review of the basic principles of folding: [2]).

Sequence determines structure determines function. Protein three-dimensional (3D) structure (i.e. the co-ordinates of all atoms) determines protein function. But what determines 3D structure? The hypothesis that structure (also referred to as 'the fold') is uniquely determined by the specificity of the sequence, has been verified for many proteins [3]. While it is now known that particular proteins (chaperones) often play a rôle in the folding pathway, and in correcting misfolds [4], it is still generally assumed that the final structure is at the free-energy minimum. Thus, all information about the native structure of a protein is coded in the amino acid sequence, plus its native solution environment. Can the code be deciphered, i.e. can 3D structure be predicted from sequence? In principle, the code could by deciphered from physico-chemical principles using, for example, molecular dynamics methods [5]. In practice, however, such approaches are frustrated by two principle obstacles. Firstly, energy differences between native and unfolded proteins are extremely small (order of 1 kcal/mol). Secondly, the high complexity (i.e. co-operativity) of protein folding requires several orders of magnitudes more computing time than we anticipate to have over the next decades. Thus, the inaccuracy in experimentally determining the basic parameters, and the limited computing resources become fatal for predicting protein structure from first principles [6]. The only successful structure prediction tools are knowledge-based, using a combination of statistical theory and empirical rules.

The sequence-structure gap is rapidly increasing. Currently, databases for protein sequences (e.g. SWISS-PROT [7]) are expanding rapidly, largely due to large-scale genome sequencing projects. The first four entire genome sequences have been published; they represent all three terrestrial kingdoms: (1) prokaryotes: haemophilus influenzae [8], and mycoplasma genitalium [9]; (2) eucaryotes: yeast [10]; and (3) archeans: methanococcus jannaschii [11]. At least, another dozen of genomes will be completely sequenced before the end of 1997 (Terry Gaasterland, priv. communication); the entire human genome is likely to be known in the year 2003. This implies that the explosion of genome, and hence, protein sequences is supposedly the only field outgrowing the speed in development of computer hardware. It also implies, that despite significant improvements of structure determination techniques the gap between the number of proteins for which structure is deposited in public databases (PDB [12]), and the number of proteins for which sequences are known is increasing.

Can the egg be unboiled? When an egg is boiled, the proteins it contains unfold. Can this procedure be reversed in theory? Can the encrypted code of protein structure be deciphered? Or, can theory help to bridge the sequence-structure gap? Indeed, for over 30 years, there has been an ardent search for methods to predict protein structure from the sequence. Many methods were found which looked initially very promising - but always the hope has been dashed. How well do we do?

No general prediction of structure from sequence, yet. An important experiment has been initiated by John Moult (CARB, Washington): those who determine protein structures submitted the sequences of proteins for which they were about to solve the structure to a 'to-be-predicted' database; for each entry in that database predictors could send in their predictions before a given deadline (the public release of the structure); finally, the results were compared, and discussed during a workshop (in Asilomar, California). Two such experiments have been completed: in December 1994 (Proteins special issue, Vol. 23, 1995), and in December 1996 (to be published in Proteins, 1997). The results of both experiments demonstrated clearly that the goal to predict structure from sequence has not been reached, yet. So, no improvement despite ardent attempts, and the explosion of knowledge deposited in databases?

Indeed, there is a flood of literature on protein structure prediction attempting to keep track with the expanding databases (reviews: [13, 14]; books: [15, 16]; a practical approach to structure prediction and sequence analysis: [17-19]. In this review focus will be laid on recent prediction methods that do actually contribute to bridging the sequence structure gap in particular in view of analysing entire genomes. The first section will provide a brief sketch about where we are today in protein structure prediction. The following chapters will sketch the problems, and some of the solutions in database searches, and the prediction of protein structure in 1D, 2D, and 3D (Fig. 1).

Fig. 1

Representation of HIV-1 protease monomer (Protein Data Bank code 1HHP) in one, two and three dimensions. Each of the representations gives rise to a different type of prediction problem.

1D - predict secondary structure and solvent accessibility. From left to right: amino acids for the first 33 residues (one letter code, first column); alignment exemplified by 5 sequences (second column); secondary structure [80] (H, helix; E, strand; blank, other: third column), solvent accessibility (measured in Å, fourth column, [80]), and a typical prediction by the neural network program PHD [39] for secondary structure and solvent accessibility (in italics, fifth and sixth column).

2D - predict contact map. The 3D structure can be projected onto a two-dimensional matrix of inter-residue distances or contacts (as shown here). The entry at position ij of the matrix gives the contact strength between residue i and residue j . The stronger a contact, the darker the marker. Horizontal and vertical lines give borders of secondary structure segments. Graph made with CONAN [81].

3D - predict three-dimensional coordinates. The trace of the protein chain in 3D is plotted schematically as a ribbon Ca-trace. Strands are indicated by arrows, the short helix is on the right towards the end (C-term) of the protein. Graph made with MOLSCRIPT [82]. Prediction not shown.

2 State-of-the-art in protein structure prediction

Bridging the sequence-structure gap for 10 - 30% of all sequences. The gap between the number of known sequences (>170,000 [20]) and the number of known structures (about 5,000 [12]) is widening rapidly. The most successful theoretical approach to bridging this gap is homology modelling. The principle idea bases on the following observation. Each native protein sequence adopts a unique structure. However, many different sequences can adopt the same basic fold. In other words, proteins with similar sequences tend to fold into similar structures. Indeed, for a pair of naturally evolved proteins levels of 25-30% pairwise sequence identity (percentage of residues identical between the two proteins) are sufficient to assure that the two proteins fold into similar structures [21-23]. Thus, if a sequence of unknown structure (denote U) has significant sequence similarity to a protein of known structure (T), it is possible to construct an approximate 3D model for U based on the assumption that U simply has basically the same structure as T . This technique is referred to as homology modelling. It effectively raises the number of 'known' 3D structures from 5,000 to over 50,000 ([24]; Fig. 2).

Fig. 2

Scope of structure prediction. Given any expressed protein, how likely can theory predict its 3D structure? For example, for 30% of the proteins in the current SWISS-PROT database we can find regions for which homology modelling is applicable [24], but for the first four entirely sequenced genomes (shown is yeast) this is true for less than 10% of all proteins [83]. Thus, SWISS-PROT contains a bias introduced, e.g., by limitations of previous sequencing techniques. Estimating the contribution of fold recognition or threading techniques is problematic. Margins given are certainly over-estimated in terms of the accuracy of current threading methods, and supposedly under-estimated in terms of the number of remote homologues that could be detected. (Note however: today threading techniques are not accurate enough for any large-scale prediction of 3D structure!) The remaining region (50-80%) is occupied by unknown folds for which no somehow accurate predictions in 3D can be obtained.

Widening the bridge by threading. Homology modelling allows prediction of 3D structure for 10-30% of all protein sequences. However, there is evidence that most pairs of proteins with similar structure are remote homologues with less than 25% pairwise sequence identity [25]. These remote homologues cannot usually be recognised by conventional sequence alignments, as this level of sequence identity is not significant for structural similarity in the following sense. If one were to collect all pairwise alignments of < 25% sequence identity that result from a search with U against a database of protein sequences, than the vast majority of these pairs would be entirely unrelated proteins. Thus, most similar structures appear to be remote homologues, but most possible pairs at low levels of sequence identity are, in fact, unrelated. Consequently, searching for remote homologues is similar to the task of finding a needle in a haystack [26]. Techniques to manage this difficult task are referred to as 'threading techniques'. Most of these techniques are applicable if and only if the remote homologue to U has known structure. Once a remote homology is detected, remote homology modelling may be used to construct a 3D model. This could potentially reduce the sequence-structure gap by an additional 10,000 - 50,000 proteins (Fig. 2). Given a sequence U from one of the complete genome sequences which have recently become available; what is the likelihood that the 3D structure can be predicted for U by homology modelling or remote homology modelling? A conservative answer is: 10%, based on the success of sequence alignment-based homology modelling (Fig. 50%, assuming all remote homologues could be recognised (Fig. 2).

Accurate prediction for 1D aspects of 3D structure. If no remote homologue can be detected for U , we are forced to simplify the prediction problem. There is a pay-off from making this simplification: using the rich diversity of information in current databases, it is possible to make very accurate 1D predictions from the sequence alone. Automatic prediction services are readily available for secondary structure, solvent accessibility, location and topology for transmembrane helices [27], and the location of helices for the special class of coiled-coil proteins [28].

3 Sequence alignments

Basic concept. The principle problem of sequence alignments is to find the optimal superposition between two strings of amino (or nucleic) acid sequences, i.e. to optimally align the two strings. The most simple objective is to optimise the percentage of residues that are identical between the two sequences. A dynamic programming algorithm is guaranteed to find the optimal solution for the problem in an algorithmic time quadratic in the length of both sequences [29] (Fig. 3). For the alignment of protein sequences this simple approach is not sufficient as finding the best alignments, usually, requires to introduce gaps in one sequence, or insertions in the other [30] (Fig. 3: rather than placing two dots into sequence T , residues A and E could be deleted in sequence U ; note: the gap increases the score from 2 to 4 identical residues). The introduction of such gaps is mathematically treated by adding a constant (gap open penalty) to the final score (here number of identical residues). However, to sensitively align protein sequences, this still is not sufficient. The major addition to the simple approach described so far is to evaluate scores not based on residue identities but based on bio-chemical properties of amino acid. For example, aligning two hydrophobic residues (I and L) is more beneficial than aligning a hydrophobic and a charged residue (L and K; note: when treating hydrophobic residues as identical, the score for the best gapped alignment in Fig. 3 increases from 4 to 6).

Fig. 3

The simple dynamic programming algorithm simply proceeds in the following way. The two sequences to be aligned (U and T ) are written into a matrix. Starting from the first element in the matrix, identities are counted and summed along the diagonal. The two best paths are marked by grey lines. The two best alignments match only two identical residues for the example given. However, if insertions (marked by dots) were allowed, the best alignment, actually, matches four residues.

Evolution distinguishes signal from noise. At the level of protein molecules, selective pressure results from the need to maintain function, which in turn requires maintenance of the specific 3D structure. This evolutionary history is the basis for the success in aligning protein (or nucleotide) sequences. Accordingly, conservation and mutation patterns observed in alignments contain very specific information about 3D structure. How much variation is tolerated without loss of structure? Two naturally evolved proteins with more than 25% identical residues (length > 80 residues) are very likely to be similar in 3D structure[23].

Task trivial for high levels of sequence identity. Any sequence analysis starts with database searches: all known databases are scanned by sequence alignment procedures for proteins homologous to the search sequence U . When the pairwise sequence identity between U and a putative homologue H is over 25-30% (for more than 80 residues), alignment procedures are usually straightforward [31-35]. For less similar protein pairs, alignments may fail.

Routine database searches by simplified procedures. Aligning two sequences by dynamic programming is a matter of seconds on a modern workstation. However, database searches require to repeat this many times, and since the databases grow, CPU time becomes a constraint in everyday sequence analysis. This bottleneck is opened by methods that start to find 'identical words' (sub-strings), and then grow the alignment around such blocks. The most widely used programs of this sort are BLAST and FASTA [32, 34]. In practice, advanced alignment algorithms typically proceed by first running a fast scan with BLAST and/or FASTA, and by then applying the full dynamic programming algorithm. To illustrate sequence analysis in practice: aligning the 6,000 sequences of yeast against all known proteins was recently accomplished in 72 hours on 64 SGI 10 000 processors [36].

Multiple alignments improve as data banks grow. The most advanced sequence alignment tools base the alignment on profiles derived from databases or particular sequence families [37, 14]. One new generation of alignment methods is based on Hidden Markov Models, another on genetic algorithms. These new methods may be more successful in intruding into the twilight zone of sequence alignments (20-30% sequence identity; [22]) than advanced profile-based methods. However, this remains to be proven.

Drawback: lack of sufficiently tested cut-off criteria. There are many different alignment methods available for those who need to run database searches for their everyday's work. Which method is best? One of the difficulties in comparing different alignment procedures is the lack of well-defined criteria for measuring the alignment quality. Very few papers have attempted to define such measures for the comparison of various methods [38]. The second problem for users is that most methods do not supply a cut-off criterion for distinguishing between homologous and non-homologous sequences (i.e., false positives). For some large sequence families remote homologues can be aligned correctly, but for most cases sequences aligned to the search protein U at levels below 25% pairwise sequence identity will be false positives, i.e., will have no structural or functional similarity to U . A simple length-dependent cut-off based on sequence identity is provided by the program MAXHOM [23]. However, this threshold neither quantifies the influence of bio-chemical similarities between amino acids, nor the occurrence of gaps.

4 Prediction in 1D

4.1 Secondary structure

Basic concept. The principal idea underlying most secondary structure prediction methods is the fact that segments of consecutive residues have preferences for certain secondary structure states [1, 39]. Thus, the prediction problem becomes a pattern-classification problem tractable by pattern recognition algorithms. The goal is to predict whether the residue at the centre of a segment of typically 13-21 adjacent residues is in a helix, strand or in none of the two (no regular secondary structure, often referred to as the 'coil' or 'loop' state). Many different algorithms have been applied to tackle this simplest version of the protein structure prediction problem: physico-chemical principles, rule-based devices, expert systems, graph theory, linear and multi-linear statistics, nearest-neighbour algorithms, molecular dynamics, and neural networks [39]. However, until recently performance accuracy seemed to have been limited to about 60% (percentage of residues correctly predicted in either helix, strand, or other). The limited accuracy was argued to result from the fact that all methods used only information local in sequence (window of less than 20 adjacent residues). Local information was estimated to account for roughly 65% of the secondary structure formation. Two additional problems were common to all methods developed from 1957 to 1993: (1) strands were predicted at levels of accuracy only slightly superior to random predictions, and (2) predicted secondary structure segments were, on average, only half as long as observed segments. The later two shortcomings could be surmounted by using a particular combination of neural networks [39].

Evolutionary information key to significantly improved predictions. On the one hand, about 75 out of 100 residues can be exchanged in a protein without changing structure. On the other hand, exchanges of 1-5 residues can already destabilise a protein structure. These statements may appear contradictory. However, the explanation is simple: evolution has explored exactly the unlikely exchanges of particular amino acids at particular positions that do not change structure, as a change of structure usually results in a loss of function (and thus would not survive). Thus, the residue exchange patterns extracted from a protein family (i.e. alignments of similar sequences) are highly indicative of the specific structural details for that family. The first method that reached a sustained level of a three-state prediction accuracy above 70% was the profile-based neural network system PHD which uses exactly such evolutionary information derived from multiple sequence alignments as input [39]. By stepwise incorporation of particular evolutionary information, prediction accuracy (Fig. 72% accuracy [39]. An interesting, technical detail of this network system is that the use of a global 'descriptor', namely the overall amino acid composition (percentage of occurrence of each of the 20 amino acids) does not affect the local score for accuracy as measured by the percentage of correctly predicted single residues. Using amino acid composition, however, improves the accuracy in terms of a more global score, such as the difference between the percentage of observed and predicted secondary structure [39]. Was the neural network an essential tool for the most accurate secondary structure prediction? A nearest-neighbour algorithm can be used to incorporate evolutionary information in a similar manner as the neural network system; the result is a similar performance [40]. Methods combining statistics, and multiple alignment information have been clearly less successful, so far. In comparison to methods using single sequence information only, methods making use of the growing databases are 6-14 percentage points more accurate. Thus, using evolutionary information secondary structure can now be predicted more accurately and reliably than other features of protein structure.

Fig. 4

PHDsec: profile-based neural network system for secondary structure prediction [39]. From the multiple alignment (here guide sequence SH3 plus 4 other proteins a1-a4, note: lower case letters indicate deletions in the aligned sequence) a profile of amino acid occurrences is compiled. To the resulting 20 values at one particular position m in the protein (one column) three values are added: the number of deletions and insertions, and the conservation weight (CW). 13 adjacent columns are used as input. The whole network system for secondary structure prediction consists of 3 layers: 2 network layers and 1 layer averaging over independently trained networks.

Secondary structure predictions now extremely useful, in practice. How good is a prediction accuracy of 72% in practice? It is certainly reasonably good compared with the prediction of secondary structure by homology modelling [41]. However, prediction accuracy varies between different proteins, i.e., prediction accuracy is 72%±9% (one standard deviation) [39]. For applications this implies that predictions can be as good as > 95%, but also as bad as <54%. Can users distinguish one from the other? A few methods successfully use reliability indices allowing to label residues for which predictions are, on average, likely to be more accurate. Indeed, for the neural network system PHD the correlation between such a reliability index and accuracy is linear [39]. Thus, the reliability index effectively becomes a means to predict prediction accuracy, and hence to assess to which class a protein of unknown structure (U) belongs: to the well predicted, or to the badly predicted ones. Various methods successfully use secondary structure predictions as a first step, e.g., prediction-based threading (indeed one of the problems of the Asilomar 1996 prediction contest was that many developers of threading algorithms used the same PHD secondary structure predictions as a first step), inter-strand, and inter-residue distance predictions. However, the use of secondary structure predictions is not limited to structure prediction. Instead, the results of, for instance, the public prediction service (PredictProtein [39]) have been used to assist the determination of protein structures (chain tracing in X-ray crystallography), as well as to formulate hypothesis about protein structure, and function that guided experiments in molecular biology, in general (in particular: prediction of binding sites, homologous proteins, design of residue mutations).

Separate prediction of secondary structure content not very useful. Proteins have been partitioned into various structural classes, e.g., based on the percentage of residues assigned to helix, strand, and other [42]. However, such a coarse-grained classification is not well-defined [43]. Consequently, given a protein sequence U of unknown structure, attempts to first predict the secondary structure content for U and then to use the result to predict the secondary structural class (i.e., all-a, all-b or intermediates) is of limited practical use. How do alignment-based predictions compare to experimental means of determining the content in secondary structure? For example, PHD is, on average surprisingly, about as accurate as circular dichroism spectroscopy [43]. Of course, this does not imply that predictions can replace experiments. In particular, variation of secondary structure as a result of changes in environmental conditions (e.g. solvent) is generally only accessible experimentally.

4.2 Solvent accessibility

Basic concept. It has long been argued that if the segments of secondary structure could be accurately predicted, the 3D structure could be predicted by simply trying different arrangements of the segments in space [44]. One criterion for assessing each arrangement could be to use predictions of residue solvent accessibility [45, 46]. The principal goal is to predict the extent to which a residue embedded in a protein structure is accessible to solvent (Fig. 5). Solvent accessibility can be described in several ways [45, 46]. The simplest is a two-state description distinguishing between residues that are buried (relative solvent accessibility < 16%) and exposed (relative solvent accessibility ³ 16%). The classical method to predict accessibility is to assign either of the two states, buried or exposed, according to residue hydrophobicity. However, a neural network prediction of accessibility has been shown to be superior to simple hydrophobicity analyses [47].

Fig. 5

Residue solvent accessibility is usually measured by rolling a spherical water molecule over a protein surface and summing the area that can be accessed by this molecule on each residue (typical values range from 0-300 Å2). To allow comparisons between the accessibility of long extended and spherical amino acids, typically relative values are compiled (actual area as percentage of maximally accessible area). A more simplified descriptions distinguishes two states: buried (here residues C and D) and exposed (here residues A, B, E, F, and G) residues. Since the packing density of native proteins resembles that of crystals, values for solvent accessibility provide upper and lower limits to the number of possible inter-residue contacts.

Evolutionary information improves prediction accuracy. Solvent accessibility at each position of the protein structure is evolutionarily conserved within sequence families. This fact has been used to develop methods for predicting accessibility using multiple alignment information [48]. Prediction accuracy is about 75±7%, four percentage points higher than for methods not using alignment information. Predictions of solvent accessibility have also been used successfully for prediction-based threading, as a second criterion towards 3D prediction by packing secondary structure segments according to upper and lower bounds provided by accessibility predictions, and as basis for predicting functional sites [48]. More recently, predictions of accessibility were also used successfully to predict sub-cellular location (Andrade & O'Donoghue, priv. communication).

4.3 Transmembrane helices

Basic concept. Even in the optimistic scenario that in the near future most protein structures will be either experimentally determined, one class of proteins will still represent a challenge for experimental determination of 3D structure: transmembrane proteins. The major obstacle with these proteins is that they do not crystallise, and are hardly tractable by NMR spectroscopy. Consequently, for this class of proteins structure prediction methods are even more needed than for globular water-soluble proteins. Fortunately, the prediction task is simplified by strong environmental constraints on transmembrane proteins: the lipid bilayer of the membrane reduces the degrees of freedom to such an extent that 3D structure formation becomes almost a 2D problem. Two major classes of membrane proteins are known: proteins which insert helices into the lipid bilayer (Fig. 6), and proteins that form pores by a barrel of 16 strands (the only known cases of this type are porins [49]). Since there is not much experimental information available on different porin-like membrane proteins, we can hardly estimate prediction accuracy for this class. The situation is quite different for helical membrane proteins. Once the location of transmembrane segments is known for helical transmembrane proteins, 3D structure can be predicted by exploring all possible conformations [50]. Additionally, predicting the locations of these transmembrane helices is a much simpler problem than is the prediction of secondary structure for soluble proteins. Elaborated combinations of expert-rules, hydrophobicity analyses and statistics yields a two-state per-residue accuracy of about 90% (residues predicted correctly as either transmembrane helix, or other).

Fig. 6

Topology of helical transmembrane proteins. In one class of membrane proteins, typically apolar helical segments are embedded in the lipid bilayer oriented perpendicular to the surface of the membrane. The helices can be regarded as more or less rigid cylinders. The orientation of the helical axes, i.e. the topology of the transmembrane protein, can be defined by the orientation of the first N-terminal residues with respect to the cell. Topology is defined as out when the protein N-term (first residue) starts on the extra-cytoplasmic region (protein A), and as in if the N-term starts on the intra-cytoplasmic side (proteins B and C).

Evolutionary information improves prediction accuracy. For two methods the use of multiple alignment information is reported to clearly improve the accuracy of predicting transmembrane helices [51, 39]. The best current prediction methods have a similar high accuracy around 95%. One such method uses a system of neural networks similar to the one sketched in Fig. 4. In order to predict the orientation of the helices (i.e. the topology Fig. 6) a simple rule is applied: positively charged residues occur more often in intra-cytoplasmic than in extra-cytoplasmic regions. The advanced neural network system has been improved significantly by adding a dynamic programming algorithm to the neural network output. The principle idea is to use the neural network output as an energy landscape and to find the optimal path through this landscape [27]. As reliable data for the locations of transmembrane helices exists only for a few proteins, data used for deriving these methods originate predominantly from experiments in cell biology and gene-fusion techniques. Different experimental groups often report different locations for transmembrane regions. Thus, the level of 95% accuracy is not verifiable. Despite this uncertainty in detail, the prediction of transmembrane helices is a valuable tool to quickly scan entire chromosomes [27]. The classification into membrane/not-membrane proteins has an expected error rate of less than two percent, i.e., about two percent of the proteins predicted to contain transmembrane regions will probably be false positives. The predictions of transmembrane helices has provided a lower bound to approach the question of how many proteins organisms need for, e.g., communication: the percentage of proteins with transmembrane helices has been estimated to be about 25% for yeast and haemophilus influenzae, and around 10-15% for mycoplasma genitalium and methanococcus jannaschii (Rost, manuscript in preparation).

5 Prediction in 2D

5.1 Inter-residue contacts

Prediction problem is a hard one, but the stakes are high. Given all inter-residue contacts or distances (Fig. 3D structure can be reconstructed by distance geometry or molecular dynamics. This is used for the determination of 3D structures by nuclear magnetic resonance (NMR) spectroscopy which produces experimental data of distances between protons [52]. Can inter-residue contacts be predicted? Obviously, some fraction of these contacts can be: helices and strands can be assigned based on hydrogen-bonding pattern between residues. Thus, a successful prediction of secondary structure implies a successful prediction of some fraction of all the contacts. However, contacts predicted from secondary structure assignment are short-ranged, i.e., between residues nearby in sequence. For a successful application of distance geometry, long-range contacts have to be predicted, i.e., contacts between residues far apart in sequence. A few methods have been proposed for the prediction of long-range inter-residue contacts. Two questions surround such methods: first, how accurate are these prediction methods on average; and second, are all important contacts predicted?

Correlated mutations can imply spatial proximity. In sequence alignments, some pairs of positions appear to co-vary in a physico-chemically plausible manner, i.e., a 'loss of function' point mutation is often rescued by an additional mutation that compensates for the change [53]. One hypothesis is that compensations would be most effective in maintaining a structural motif if the mutated residues were spatial neighbours. Attempts have been made to quantify such a hypothesis and to use it for contact predictions [54-56]. In general, prediction accuracy is rather poor, with a direct trade-off between predicting enough contacts, and predicting only correct ones, e.g., taking 5% of the best-predicted long-range contacts (sequence separation above 10 residues) the accuracy prediction is about 50% (A. Valencia, priv. communication).

Distinction between different models, no prediction of 3D, yet. Analysing correlated mutations is only one way to predict long-range inter-residue contacts. Other methods use statistics, mean-force potentials, or neural networks. So far none of the methods appears to find a path between the Scylla of missing too many true contacts and the Charibdis of predicting too many false contacts. However, some of the methods provide sufficient information to distinguish between alternative models of 3D structure (Valencia, manuscript in preparation). The ambitious goal to predict long-range inter-residue contacts sufficiently accurately will hopefully continue to attract intellectual resources.

5.2 Inter-strand contacts

Simplifying the contact prediction problem. One simplification of the problem to predict inter-residue contacts focuses on predicting the contacts between residues in adjacent strands (Fig. 1). Such an attempt is motivated by the hope that such interactions are more specific than are sequence-distant (long-range) contacts in general, and hence are easier to predict.

Identifying the correct b-strand alignment. The only method published for predicting inter-strand contacts is based on potentials of mean-force [57] similar to those used in the evaluation of strand-strand threading [58]. Propensities are compiled by database counts for 2 ¥ 2 ¥ 2 classes (parallel/anti-parallel, H-bonded/not H-bonded, N-/C-terminal). Each of the eight classes is divided further into five sub-classes in the following way. Suppose the two strand residues at positions i and j are in close in space. Then the following five residue pairs are counted in separate tables: i/j-2, i/j-1, i/j, i/j+1, i/j+2 . Such pseudo-potentials identify the correct b-strand alignment in 35-45% of the cases.

Using evolutionary information to predict inter-strand contacts. Even if the locations of strands in the sequence are known exactly, the pseudo-potentials cannot predict the correct inter-strand contacts in most cases [57]. However, when using multiple alignment information, the signal-to-noise ratio increases such that inter-strand contacts have been predicted correctly for most of the strands inspected in some test cases [57]. For the purpose of reliable contact prediction, this result is inadequate, especially as the locations of the strands are not known precisely. Can the pseudo-potentials handle errors resulting from incorrect prediction of strands? Various test examples using predictions by PHDsec [39] as input to the strand pseudo-potentials indicate that the accuracy in predicting inter-strand contacts drops (T Hubbard, unpublished), but in some cases is still high enough to be useful for approximate modelling of 3D structure [59].

6 Prediction in 3D

6.1 Known folds: homology modelling

Basic concept. An analysis of PDB reveals that all protein pairs with more than 30% pairwise sequence identity (for alignment length > 80; [23]) have homologous 3D structures, i.e., the essential fold of the two proteins is identical, details such as additional loop regions - regions not in helices or strands - may vary. Structure is more conserved than is sequence. This is the pillar for the success of homology modelling. The principal idea is to model the structure of U (protein of unknown structure) based on the template of a sequence homologue of known structure (T). Consequently, the precondition for homology modelling is that a sequence homologue of known structure is found in PDB. Since homology modelling is currently the only theoretical means to successfully predict 3D structure, this has two implications. First, homology modelling is applicable to 'only' one quarter of the known protein sequences (Fig. 2). Second, as the template of a homologue is required, no unique 3D structure can yet be predicted, i.e., no structure that has no similarity to any experimentally determined 3D structure. Suppose, there is a protein with a sequence similar to U in PDB (say T), is homology modelling straightforward?

High level of sequence identity: atomic resolution. The basic assumption of homology modelling is that U and T have identical backbones (main chain C). The task is to correctly place the side chains of U into the backbone of T . For very high levels of sequence identity between U and T (ideally differing by one residue only), side chains can be 'grown' during molecular dynamics simulations [60]. For slightly lower levels (still of high sequence similarity), side chains are built based on similar environments in known structures [61, 62]. Rotamer libraries (libraries containing all side-chain orientations observed in known structures) are used in the following way. (1) Rotamer distributions are extracted from a database of non-redundant sequences. (2) Fragments of seven (helix, strand) or five residues (other) are compiled. (3) Fragments of the same length are successively shifted through the backbone of U . (4) For modelling the side chains of U only those fragments from the rotamer library are accepted which have the same amino acid in the centre as U , and for which the local backbone is similar to that around the evaluated position). Over the whole range of sequence identity between U and T for which homology modelling is applicable, the accuracy of the model drops with decreasing similarity. For levels of at least 60% sequence identity, the resulting models are quite accurate [62, 63], for even higher values, the models are as accurate as is experimental structure determination. The limiting factor is the computation time required. How accurate is homology modelling for lower levels of sequence identity?

Low level of sequence identity: loop regions sometimes correct. With decreasing sequence identity between the known structure H and the query protein U the number of loops that have to be inserted to align the two grows. An accurate modelling of loop regions, however, implies solving the structure prediction problem. The problem is simplified in two ways. First, loop regions are often relatively short and can thus be simulated by molecular dynamics (note the CPU time required for molecular dynamics simulations grows exponentially with the number of residues of the polypeptide to be modelled). Second, the ends of the loop regions are fixed by the backbone of the template structure. Various methods are employed to model loop regions. The best have the orientation of the loop regions correct in some cases [63]. This illustrates the current limitations of molecular dynamics: not even short loop regions can be predicted from sequence. Furthermore, for experimental structure refinement (use of molecular dynamics to improve consistency, and accuracy of experimental data) molecular dynamics is successfully applied to find a better solution when starting from an almost correct structure. However, for homology modelling, molecular dynamics refinement usually reduces prediction accuracy [63]. Below about 40% sequence identity the accuracy of the sequence alignment used as basis for homology modelling becomes an additional problem. Nevertheless, even down to levels of 25-30% sequence identity, homology modelling produces coarse-grained models for the overall fold of proteins of unknown structure.

6.2 Known folds: remote homology modelling (threading)

Basic concept. As noted in the previous section, naturally evolved sequences with more than 30% pairwise sequence identity have homologous 3D structures [23]. Are all others non-homologous? Not at all. In the current PDB database there are thousands of pairs of structurally homologous pairs of proteins with less than 25% pairwise sequence identity (remote homologues). Actually, most similar protein structures are such remote homologues (Rost, in press). If a correct alignment between U (sequence of unknown structure) and a remote homologue T (pairwise sequence identity to U < 25%) is given, one could build the 3D structure of U by (remote) homology modelling based on the template of T . A successful remote homology modelling must solve three different tasks. (1) The remote homologue (T) has to be detected. (2) U and T have to be correctly aligned. (3) The homology modelling procedure has to be tailored to the harder problem of extremely low sequence identity (with many loop regions to be modelled). Most methods developed so far have been primarily addressed to solve the first, and the second problem. The basic idea is to thread the sequence of U into the known structure of T and to evaluate the fitness of sequence for structure by some kind of environment-based or knowledge-based potential [64, 65]. Threading is in some respects a harder problem than is the prediction of 3D structure (NP-complete: [66]; no physical connection between remote homologues, as many remotely homologous protein pairs may have originated from different ancestors, Rost, in preparation; [67]). However, the stakes are high: solving the threading problem could enable the prediction of thousands of protein structures. Indeed, threading has evolved to become one of the most active fields in the arena of protein structure prediction (with well over 100 annual publications).

Variety of threading techniques. The optimism generated by one of the first papers on threading published in the 90s [68] has boosted attempts to develop threading methods. The principle idea has been to use structural propensities of amino acids (such as preferences for secondary structure formation, hydrophobicity, and polarity), and to then assess whether or not a given sequence with its structural preferences fits into the structural environment of a given structure [65]. A principally different approach has been pushed by Manfred Sippl [69, 67]. The idea is to use the rich knowledge deposited in the database of protein structures (PDB) by extracting mean-force potentials. Such potentials monitor the observed distances between residue pairs of particular amino acids, with a particular sequence separation (number of residues between the two). Until 1995, most threading methods have used mean-force-potentials [64, 37, 67]. A more recent generation of threading methods is based on 1D predictions [48]: first 1D structure (secondary structure and solvent accessibility) is predicted for a sequence of unknown structure, then the 1D structure is extracted for a library of known structures, and finally the observed and the predicted 1D structure strings are aligned by typical dynamic programming algorithms [30]. Has all this effort achieved to crack the hard nut threading?

Remote homologues can often be detected. First the good news: since the different mean-force-potentials which have been proposed capture different aspects of protein structure, the correct remote homologue is likely to be found by at least one of them [70]. Now the bad news: so far, no single method has been able to detect the correct remote homologue for more than half of all test cases [70]. For the methods which have been rigorously evaluated using large test sets, the correct remote homologue is detected in less than 40% of all cases [48]. However, this performance is clearly superior to that of traditional sequence alignments at this low level (<25%) of sequence identity. Furthermore, the success of the last Asilomar experiment on structure prediction (forthcoming issue of Proteins) suggests that the likelihood to detect the correct remote homologue is reasonably high when the choice is refined by experts.

3D prediction by threading is still not reliable. Detecting the remote homology is only the first of the three obstacles. It appears that the second obstacle (correct alignment between U and T ) is much more difficult and, unfortunately, there is no general solution so far. Thus the final step, building a 3D model, usually fails since the modelling procedures available today cannot correct the mistakes in the alignments. Although the last Asilomar experiment on structure prediction (forthcoming issue of Proteins) suggested that major improvements have been accomplished over the last two years, there are still very few publications to date which report accurate 3D predictions from threading methods. Currently, the successful use of threading methods requires sceptical, expert user intervention to spot wrong hits and false alignments. It is still possible that threading method will become the most successful structure prediction method, but a lot of detailed work lies ahead.

6.3 Unknown folds: ab initio prediction of structure?

Recent breakthrough in structure prediction? In the 1994 Asilomar meeting, none of the 3D ab initio methods were able to predict the correct protein structure [63]. Since that time, new methods have been proposed which indicate possible directions for the future. Several groups have obtained promising results using distance geometry methods [48]. Simplified force-fields in combination with dynamic optimisation strategies have yielded promising, but still relatively inaccurate results [71, 72]. Srinivasan and Rose have reported very encouraging results with their hierarchical search method [73]. However, the second Asilomar experiment on structure prediction (forthcoming issue of Proteins) concluded similarly to the first: no prediction of 3D structure from sequence, yet.

Accurate prediction of 3D structure for coiled-coil proteins. A particular class of proteins are coiled-coils. These are proteins can be defined by a rather simple geometry of long helices, of which two or more wind around one another [28]. Nilges and Brünger [74] have achieved atomic-accuracy in an ab initio prediction of the GCN4 leucine zipper using a hybrid molecular dynamics/simulated annealing search strategy. Recently, equally accurate models for three leucine zippers were obtained with faster calculations based on mean-force-potentials (O'Donoghue, in press).

Recognising incorrect structures. The single most important theoretical advance in 3D prediction in recent years may have been the development of mean-force-potentials. Before these potentials, structure prediction was normally done with 'physical' potentials, i.e., bonds, angles, torsion angles, and van der Waals as well as electrostatic non-bonded terms which describe the internal energy of the molecule [6]. In contrast, the mean-force-potentials, derived from databases of protein structure [75], attempt to describe the free-energy of the molecule. The physical potentials have been used very successfully to refine experimentally determined structures [52]. However, these terms cannot distinguish between a native fold and a grossly misfolded structure [75]. In contrast, mean-force-potentials of pairwise residue distances are quite successful in fold recognition, as well as remote homology modelling [67]. It remains to be seen how best to combine these two different potentials. In one pilot study on the use of mean-force-potentials for 3D structure prediction, best results where obtained by combining both potentials (O'Donoghue and Nilges, private communication).

Extracting principles about structure formation from structures? The mean-force-potential approach has recently been extended to study protein folding. Both the physical basis and the general characteristics of protein folding remain controversial [76]. Simulations and other studies indicate that the free energy balance of hydrogen bond formation is close to zero, or slightly unfavourable [77, 78], and that a specific fold is selected primarily by side-chain interactions [76]. Recently, Sippl et al. have extended the concept of deriving mean-force potentials to a formalism of describing Helmholtz free energies of atom-pair interactions [79]. The formalism starts with the following two assumptions: (1) that protein structures can be described by Helmholtz free energies (or mean-force-potentials), and (2) that the distribution of intra-molecular distances in experimentally determined protein structures does, on average, not deviate substantially from the corresponding distribution in native proteins. To normalise the absolute free energy contributions, the ideal gas is chosen (no internal interactions). Without any further assumptions or approximations, atom-atom mean-force-potentials are derived from a data set of known protein structures. The resulting Helmholtz mean-force-potentials unravel interesting principles about protein structure formation. (1) Backbone H-bonds (except for the a-helix interaction Oi ... Ni+4) do not contribute to the thermodynamic stability of native folds. (2) H-bond formation (except for Oi ... Ni+4) requires energy input that is regained when H-bonds are formed. Once formed, H-bonds are locked in a deep, narrow minimum. (3) The energy gain of forming one ionic or two hydrophobic contacts can provide roughly the activation energy required for forming a H-bond. Both the eloquence and the conclusions of the approach have prompted strong criticism, even unanimous rejection of these findings. Do we witness an error in a method laid out to spot errors, or the begin of a new era of force fields? Further applications of these mean-force-potentials will be needed to answer this question.

7 Conclusions

Native 3D structures of proteins are encoded by a linear sequence of amino acid residues. To predict 3D structure from sequence is a task challenging enough to have occupied a generation of researchers. Have they finally succeeded in their goal? The bad news is: no, we still cannot predict structure for any sequence. The good news are: we have come closer, and growing databases facilitate the task.

Prediction in 3D: theory bridges the sequence-structure gap. The only source for new, unique protein structures (structures for which no homologue exists in the database) are experiments. However, given the amount of time needed to determine a protein structure experimentally, more non-unique structures can be predicted at atomic resolution by homology modelling in a month than have been determined by experiment over the last three decades. Homology derived models are frequently accurate at the level of atomic resolution. Unfortunately, most models typically have considerable co-ordinate errors in loop regions. Coarse-grained homology derived models are available for almost one third of the sequences deposited in the SWISS-PROT database [24]. Threading techniques could increase this ratio considerably by finding more distant homologues. However, for large scale sequence analyses threading techniques are not yet reliable.

Predictions in 1D: significant improvement by larger databases. The rich information contained in the growing sequence and structure databases has been used to improve the accuracy of predictions of some aspects of protein structure. Evolutionary information is successfully used for predictions of secondary structure, solvent accessibility, and transmembrane helices. These predictions of protein structure in 1D are significantly more accurate, and more useful than five years ago. Some methods have indicated that 1D predictions can be useful as an intermediate step on the way to predicting 3D structure (inter-strand contacts; prediction-based threading). Another advantage of predictions in 1D is that they are not very CPU-intensive, i.e., 1D structure can be predicted for the protein sequence of, for example, entire yeast chromosomes overnight.

Predictions in 2D: so far of limited success. The prediction accuracy of chain-distant inter-residue contacts is so far relatively limited. Analysis of correlated mutations can be used to distinguish between alternative models (e.g. for threading techniques). The prediction of inter-strand contacts appears to be useful in some cases. An accurate method for the automatic prediction of contacts between residues not close in sequence remains to be developed.

Most breakthroughs in protein structure prediction were achieved over the last six years. Thus, although we still cannot solve the general prediction problem, progress has been made. In general, however, we could ask the question - is it worth persevering with structure prediction, given that it is clearly such a difficult task? The answer is: yes. The methods which have spun off from structure prediction have already given us considerable insight into the first four complete genomes. Perseverance with structure prediction will yield fruit in about five years time when the human genome will be known.