Rising accuracy of protein secondary structure prediction

Burkhard Rost

CUBIC, Columbia University

Columbia University, Department of Biochemistry and Molecular Biophysics, 650 West 168th Street, New York, NY 10032, USA
rost@columbia.edu, http://cubic.bioc.columbia.edu/ Tel: +1-212-305-3773, fax: +1-212-305-7932

contact e-mail:rost@columbia.edu

Title: Rising accuracy of protein secondary structure prediction
Author:Burkhard Rost
Quote: in: 'Protein structure determination, analysis, and modeling for drug discovery' (ed. D Chasman), New York: Dekker, pp. 207-249

Table of Contents


We still cannot predict protein 3D structure from sequence, in general. But bioinformatics continuously improve methods predicting simplified aspects of structure. Particularly, the field of secondary structure has achieved a break-through by combining algorithms from artificial intelligence with evolutionary information. PHD, the first third generation method surmounted the 'magic' line of predicting more than 70% of all residues correctly in one of three states (helix, strand, other). Furthermore, b -strands were predicted almost twice as often correct as by methods of the first and second generation. Finally, predicted segments look like those observed. Recently, the evolutionary information resulting from improved searches and larger databases has again boosted prediction accuracy by more than four percentage points to its current height around 77%. Divergent evolutionary profiles not only contain enough information to substantially improve prediction accuracy, but even to correctly predict long stretches of identical residues observed in alternative secondary structure states depending on non local conditions. An example is a method automatically identifying structural switches, and thus finding a remarkable connection between predicted secondary structure and aspects of function. Due to their remarkable success, secondary structure predictions have become the working horse for numerous methods aiming at predicting protein structure and function. Moreover, performance can be improved even further by using these methods in an 'expert' rather than in an 'automatic' mode. Have we, now reached the limit of prediction accuracy? Time will tell.

Key words: genome sequence analysis, predicting globularity, protein domains, protein structure prediction, solvent accessibility, multiple alignments, trans-membrane helices.


The sequence-structure gap is rapidly increasing. Currently, databases for protein sequences (e.g. SWISS-PROT/TrEMBL × [14] ) are expanding rapidly, largely due to large-scale genome sequencing projects: at the beginning of 2001, we know all sequences for more than 40 entire genomes [15, 16, 17] . This implies that the gap between known structures and known sequences is rapidly increasing, despite significant improvements of structure determination techniques (PDB [18, 6] ). The most successful theoretical approach to bridging this gap is comparative modelling. It effectively raises the number of 'known' 3D structures from 10,000 to over 100,000 [19, 20] . In fact, our ability to find the appropriate template so that we can apply comparative modelling has risen continuously over the last decade. Now, we can predict 3D structure through comparative modelling for more than twice as many proteins as in 1993. However, after four decades of ardent research, we still cannot predict structure from sequence [21, 22, 23] . Nevertheless, the field has had its success: now the best methods come frequently get some features of the fold right [24] .

Simplifying the structure prediction problem. The rapidly growing sequence-structure gap has enticed theoreticians to solve simplified prediction problems [25] . An extreme simplification is the prediction of protein structure in one dimension (1D), as represented by strings of, e.g., secondary structure, or residue solvent accessibility. Theoreticians are lucky in that a simplified predictions in 1D (e.g. secondary structure, or solvent accessibility [26, 25, 27] ) - even when only partially correct - are often useful, e.g., for predicting protein function, or functional sites.

Topics left out here. This review focuses on methods predicting secondary structure for globular proteins, in general. At the infancy of analysing the proteome of entirely sequenced organisms, the most useful structure prediction methods are those that focus on particular classes of proteins, such as proteins containing membrane helices and coiled-coil regions [28, 29, 30, 31] . For predicting the topology of helical membrane proteins, a number of new methods add interesting new facets [32, 33, 34, 35, 36, 37] . However, no method has really utilised the flood of recent experimental information about membrane proteins [38] . Overall, membrane helices can be predicted much more accurately than globular helices. Current state-of-the-art is to correctly predict all membrane helix topology for more than 80% of the proteins, and to falsely predict membrane helices for less than four percent of all globular proteins. We have recently come across evidence suggesting that this figure over-estimates performance (Rost, unpublished). Clearly, methods developed to predict helices in globular proteins go completely wrong for membrane helices! In contrast, porins appear to be predicted relatively accurately by methods developed for globular proteins [39, 40] . Few methods specifically predicting coiled-coil regions have recently been published (older review in: [41] ). Two interesting developments are the prediction of the dimeric state of coiled-coils [42] , and a method predicting 3D structure for coiled-coil regions [43] . In fact, the later is the only existing method predicting 3D structure below 2 Ångstrøm main chain deviation over more than 30 residues. Another example for successful specialised secondary structure prediction methods is the focus on beta-turns [44, 45] . The method from the Thornton group appears to be the most accurate current means of predicting turns. Successful methods specialised to predicting alpha-helix propensities have resulted from the experimental studies of short peptides in solution [46, 47] . Neither the turn, nor the helix-in-solution methods have yet been combined with other secondary structure prediction methods.


Secondary structure assigned by DSSP. Secondary structure is most often assigned automatically based on the hydrogen bonding pattern between the backbone carbonyl and NH groups (e.g. by DSSP [48] ). DSSP distinguishes eight secondary structure states that are often grouped into three classes: H = helix, E = strand, and L = non-regular structure. Typically the grouping is as follows: 'H' (a -helix) -> H, 'G' (310 -helix) -> H, 'I' (p -helix) -> H, 'E' (extended strand) -> E, and 'B' (residue in isolated b -bridge) -> E, 'T' (turn) -> L, 'S' (bend) -> L, ' ' (blank = other) -> L, with the 'corrections': 'B ' -> EE, but 'B_B' -> LLL. Note some developers use different projections of the eight DSSP classes onto three predicted classes; most of these yield seemingly higher levels of prediction accuracy. For example, short helices are more difficult to predict ( [49] , see also Fig 5). Hence, converting 'GGG' to 'LLL' lets authors report higher numbers.

Per-residue prediction accuracy. The simplest and most widely used score for secondary structure prediction is the three-state per-residue accuracy giving the percentage of correctly predicted residues predicted correctly in either of the three states: helix, strand, other:

( eqn. 1 )

where ci is the number of residues predicted correctly in state i (H, E, L), and N is the number of residues in the protein (or in a given data set). As typical data sets contain about 32% H (helix), 21% E (strand), and 47% L (other), correct prediction of the non-regular class (L) tends to dominate the three-state accuracy. More fine-grained methods that avoid this shortcoming are defined in detail elsewhere [50, 51] .

Per-segment prediction accuracy. Measures for single-residue accuracy do not completely reflect the quality of a prediction [52, 53, 54, 55, 51, 56] . Three simple measures assess the quality of predicting segments: (1) the number of correctly predicted segments, (2) the predicted vs. observed average segment length, and (3) the predicted vs. observed distributions of segments with length L [57] . All these measures can, e.g., identify methods with fairly high per-residue accuracy, yet an unrealistic distribution of segments. More elaborated scores base on the overlap between predicted and observed segments (SOV: [51, 58] ).

Conditions for evaluating sustained performance. A systematic testing of performance is a pre-condition for any prediction to become reliably useful. For example, the history of secondary structure prediction has partly been a hunt for highest accuracy scores, with over-optimistic claims by predictors seeding the scepticism of potential users. Given a separation of a data set into a training set (used to derive the method) and a test set (or cross-validation set, used to evaluate performance), a proper evaluation (or cross-validation) of prediction methods needs to meet four requirements. (1) No significant pairwise sequence identity between proteins used for training and test set, i.e., < 25% (length-dependent cut-off [59] ). (2) All available unique proteins should be used for testing, since proteins vary considerably in structural complexity; certain features are easier to predict others harder. (3) No matter which data sets are used for a particular evaluation, a standard set should be used for which results are also always reported. (4) Methods should never be optimised with respect to the data set chosen for final evaluation. In other words, the test set should never be used before the method is set up.

Number of cross-validation experiments of NO meaning. Most methods are evaluated in n-fold cross-validation experiments (splitting the data set into n different training and test sets). How many separations should be used, i.e., which number of n yields the best evaluation? A misunderstanding is often spread in the literature: the more separations (the larger n ) the better. However, the exact number of n is not important provided the test set is representative, comprehensive and the cross-validation results are NOT miss-used to again change parameters. In other words, the choice of n is of no meaning for the user.


Dinosaurs of secondary structure prediction still alive!

1st generation: single residue statistics. The first experimentally determined 3D structures of haemoglobin and myoglobin were published in 1960 [60, 61] . Almost a decade earlier, Pauling and Corey suggested an explanation for the formation of certain local conformational patterns like a -helices and b -strands [62, 63] . Shortly later (and still prior to the first experimental structure), the first attempt was made to correlate the content of a certain amino acids (e.g. Proline) with the content of a -helix [64] . The idea was expanded by correlating the content for all amino acids with that of a -helix and b -strand [65, 66] . The field of secondary structure prediction had been opened. Most early methods were first generation methods in that they based on single residue statistics. Preferences of particular amino acids for particular secondary structure states were extracted from the given small databases [67, 68, 69, 70, 71, 72, 73, 74, 75] . By 1983, it became clear that the accuracy of these methods had been over-estimated [76] ( Fig. 1 ).

Fig. 1

Fig. 1. Three-state per-residue accuracy of various prediction methods. I included only methods for which I could run independent tests. Unfortunately, for most old methods this was not possible. However, for each method I had independent results from PHD [50, 78, 7] available. I normalised the differences between data set by simply compiling levels of accuracy with respect to PHD. For comparison, I added the worst possible prediction (random), and the best possible one (through comparative modelling of close homologue). The methods were: C+F Chou & Fasman (1st generation) [73, 242] ; Lim (1st) [74] ; GORI (1st) [83] ; Schneider (2nd) [117] ; ALB (2nd) [92] ; GORIII (2nd) [84] ; COMBINE (2nd) [243] ; S83 (2nd) [116] ; LPAG (3rd) [152] ; NSSP (3rd) [114] ; PHDpsi (3rd) [8] ; JPred2 (3rd) [5] ; SSpro (3rd) [13] ; PSIPRED (3rd) [11] ; PROF (3rd) [244] .

2nd generation: segment statistics. The principal improvement of the 2nd generation profited from the growth of experimental information about protein structure. This data enabled to parameterise the information contained in consecutive segments of residues. Typically 11-21 adjacent residues are taken from a protein and statistics are compiled to evaluate how likely the residue central in that segment is to be in a particular secondary structure state. Similar segments of adjacent residues were also used to base predictions on more elaborated algorithms, some of which were spun off from artificial intelligence. Since, almost any algorithm has been applied to the problem of predicting secondary structure; all were limited to accuracy levels around 60% ( Fig. 1 ). Reports of higher levels of accuracy were usually based on too small, or non-representative data sets [50, 77, 54, 78] . The main algorithms based on: (i) statistical information [79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91] ; (ii) physico-chemical properties [92] ; (iii) sequence patterns [93, 94, 95] ; (iv) multi-layered (or neural) networks [96, 97, 98, 99, 100, 101, 102, 103] ; (v) graph-theory [104, 105] ; (vi) multivariate statistics [106, 107] ; (vii) expert rules [108, 109, 110, 105, 111, 112] ; and (viii) nearest-neighbour algorithms [113, 114, 115] .

Problems with 1st and 2nd generation methods. All methods from the first and second generation shared, at least, two of the following problems (most all three):

(1) three-state per-residue accuracy was below 70%,

(2) b -strands were predicted at levels of 28-48%, i.e., only slightly better than random,

(3) predicted helices and strands were too short.

The first problem (<100% accuracy) is commonly linked to two features. (A) Secondary structure formation is partially determined by long-range interactions, i.e., by contacts between residues that are not visible by any method based on segments of 11-21 adjacent residues. (B) Secondary structure assignments vary by 5-12% even between different crystals of the same protein. Hence, 100% identical assignments are an unrealistic and unreasonable aim. The second problem (b -strands < 50% accuracy) has been explained by the fact that b -sheet formation is determined by more non-local contacts than is a -helix formation. The third problem (too short segments predicted) was basically overlooked by most developers (exceptions: [116, 117] ). This problem makes predictions very difficult to use, in practice ( Fig. 2 .). Many of the recent third generation prediction methods address all three problems simultaneously, and are clearly superior to the old methods ( Fig. 1 ). Nevertheless, many of the secondary structure prediction methods available today (e.g. in GCG [118] , or from internet services [119] ) are unfortunately still using the dinosaurs of secondary structure prediction.

Fig. 2

Fig. 2. Example for typical secondary structure prediction of the 2nd generation. The protein sequence (SEQ ) given was the SH3 structure [184] . The observed secondary structure (OBS ) was assigned by DSSP [48] (H = helix; E = strand; blank = non-regular structure; the dashes indicate the continuation of the 2nd strand that was missed by DSSP). The typical prediction of too short segments (TYP ) poses the following problems in practice. (i) Are the residues predicted to be strand in segments 1, 5, and 6 errors, or should the helices be elongated? (ii) Should the 2nd and 3rd strand be joined, or should one of them be ignored, or does the prediction indicate two strands, here? Note: the three-state per-residue accuracy is 60% for the prediction given.

Quantum leap through using pairwise evolutionary information

Evolutionary odyssey informative?

Variation in sequence space. The exchange of a few residues can already destabilise a protein [120] . This implies that the majority of the 20N possible sequences of length N form different structures. Has evolution really created such an immense variety? Random errors in the DNA sequence lead to a different translation of protein sequences. These 'errors' are the basis for evolution. Mutations resulting in a structural change are not likely to survive, since the protein can no longer function appropriately. Furthermore, the universe of stable structures is not continuous: minor changes on the level of the 3D structure may destabilise the structure. Thus, residue exchanges conserving structure are statistically extremely unlikely. However, the evolutionary pressure to conserve structure and function has led to a record of this unlikely event: structure is more conserved than sequence [121, 122, 123] . Indeed, all naturally evolved protein pairs that have 35 of 100 pairwise identical residues have similar structures [124, 59] . However, the attractors of protein structures are larger, even: the majority of protein pairs of similar structures has levels below 15% pairwise sequence identity [125, 59, 126] .

Long-range information in multiple sequence alignments. The residue substitution patterns observed between proteins of a particular structural family, i.e., changes that conserved structure, are highly specific for the structure of that family. Furthermore, multiple alignments of sequence families, implicitly also contain information about long-range interactions. Suppose residues i and i + 100 are close in 3D, then the types of amino acids that can be exchanged (without changing structure) at position i are constrained by that their physico-chemical characteristics have to fit the amino acid types at position i + 100 [127, 128] .

Can evolutionary information be used?

Expert predictions: visual use of alignment information. The first method that used information from family alignments was proposed in the 70's [129] . Since, experts have based single-case predictions successfully on multiple alignments [130, 129, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146] . In fact, analysing the conservation patterns in sequence families is the first step of any expert when wanting to learn anything about a particular protein. Conversely, proteins without homologues constitute the dead-end road of sequence analysis.

Automatic use of pairwise alignment information. The simplest way to use alignment information automatically has been proposed first by Maxfield & Scheraga and by Zvelebil et al. [147, 148] : predictions were compiled for each protein in an alignment, and then averaged over all proteins. A slightly more elaborated way of automatically using evolutionary information is to directly base prediction on a profile compiled from the multiple sequence alignment [50, 78, 7] . The following steps are applied in particular for the PHD methods [149, 7] ( Fig. 3 ).
A sequence of unknown structure (U) is quickly (typically by Blast [150] ) aligned against the data base of known sequences (i.e. no information of structure required!). (2) Proteins with sufficient sequence identity to U to assure structural similarity are extracted and re-aligned by a multiple alignment algorithm MaxHom [151] . (3) For each position the profile of residue exchanges in the final multiple alignment is compiled, and used as input to a neural network.

Fig. 3

Fig. 3. Using evolutionary information to predict secondary structure. Starting from a sequence of unknown structure (SEQUENCE ) the following steps are required to finally feed evolutionary information into the PHD neural networks (upper right): (1) a data base search for homologues through iterated PSI-Blast [150] (protocol from [8] ), (2) a decision for which proteins will be considered as homologues (BLAST score or length-depend cut-off for pairwise sequence identity [124, 59] ), (3) a reduction of redundancy (purge too many too similar proteins), and (4) a final refinement, and extraction of the resulting multiple alignment. Numbers 1-5 illustrated where users of the PredictProtein server [7, 119] can interfere to improve prediction accuracy without changes made to the actual prediction method PHD.

3rd generation: evolution to better predictions

Example chosen: PHD. In the following, I illustrated the principle concepts of 3rd generation methods based on the particular neural network-based method PHD because it has been the most accurate method for many years, and because most of these concepts were introduced by PHD [50, 78] . Meanwhile, several other methods have reported and/or achieved similar levels of performance [152, 50, 144, 78, 114, 153, 154, 155, 156, 157, 158, 7, 159, 160] . More recent methods will be discussed in more detail below.

Multiple levels of computations. PHD processes the input information on multiple levels (neural network in Fig. 3 ). The first level is a feed-forward neural network with three layers of units (input, hidden, and output). Input to this first level sequence-to-structure network consists of two contributions: one from the local sequence, i.e., taken from a window of 13 adjacent residues, and another from the global sequence. Output of the first level network is the 1D structural state of the residue at the centre of the input window. The second level is a structure-to-structure network. The next level consists of an arithmetic average over independently trained networks (jury decision). The final level is a simple filter.

Balanced predictions by balanced training. The distribution of the training examples (known structures) is rather uneven: about 32% of the residues are observed in helix, 21% in strand, and 47% in loop. Choosing the training examples proportional to the occurrence in the data set (unbalanced training), results in a prediction accuracy that mirrors this distribution, e.g., strands are predicted inferior to helix or loop [50, 78, 49] . A simple way around the data base bias is a balanced training: at each time step one example is chosen from each class, i.e., one window with the central residue in a helix, one with the central residue in a strand and one representing the loop class. This training results in a performance well balanced between the output states ( Fig. 4 ).

Fig. 4

Fig. 4. Prediction balanced between three secondary structure states. The pies were valid for a simple neural network prediction not using evolutionary information (2nd generation). The entire pies represented 100% of (A + D ) all correctly predicted residues, (B ) all residues in a representative subset of PDB, and (C ) all residues presented during balanced training. The basic message is that the prediction of strand is not inferior to the one for helix for 2nd generation methods (A ) because strand formation is more dominated by long-range interactions (as previously argued), but because the data base distributions differ between the three states (B ). Simply skewing the distribution (C ) resulted in an equally accurate prediction for all three states (D ).

Better segment prediction by structure-to-structure networks. The first level sequence-to-structure networks use as input the following information from 13 adjacent residues: (1) profile of amino acid substitutions for all 13 residues; (2) conservation

weights compiled for each column of the multiple alignment; (3)
number of insertions, and deletions in each column; (4) position of the current segment of 13 residues with respect to the N- and C-term; (5) amino acid composition; and (6) length of the protein. The network output consists of three units for helix, strand, and non-regular structure. The second level structure-to-structure networks use the same output. The major input for the second level structure-to-structure networks are the output values of the first level sequence-to-structure networks. The reason for introducing a second level is the following. Networks are trained by changing the connections between the units such that the error is reduced for each of the examples successively presented to the network during training. The examples are chosen at random. Therefore, the examples taken at time step t and at time step t+1 are usually not adjacent in sequence. This implies that the network cannot learn that, e.g., helices contain at least three residues. The second level structure-to-structure network introduces a correlation between adjacent residues with the effect that predicted secondary structure segments have length distributions similar to the ones observed [57] . The problem that remains after the second level networks are is the under-prediction of short segments (Fig 5).

Fig. 5

Fig. 5. Distribution of segment lengths. The number of secondary structure segments observed (thick black line; according to DSSP [48] ) and predicted is plotted against their length. All methods miss short helices and strands. However, also short regions lacking regular secondary structure were also under-predicted by all methods. Overall, most methods predict segments around the lengths of those observed. All results base on a data set of 201 proteins taken from the EVA server [180, 181] that contained no protein used for training of any of the methods (also used for results in Table 6).

Continuous advance through profile searches in growing databases

Automatically aligning protein families based on profiles: the PSI-BLAST wonder. Just as experts have been using alignment information to predict aspects of structure and function, they have intruded into the twilight zone of sequence alignments [122] using profile-based alignment techniques. The idea of profile-based searches is simply to use the fact that profiles of evolutionary conservation are highly specific for every protein family. For example, Glycines can often be mutated without major changes. However, in particular families, the conservation of some Glycines may be crucial to maintain mobility. Many groups have successfully implemented semi-automatic profile-based databases searches [150, 161, 162, 163, 164, 165, 166] . However, the breakthrough to large-scale routine searches has been achieved by the development of PSI-BLAST [10] and Hidden Markov models [167, 12] . In particular, the gapped, profile-based, and iterated search tool PSI-BLAST continues to revolutionise the field of protein sequence analysis through its unique combination of speed and accuracy. More distant relationships are found through iteration starting from the safe zone of comparisons and intruding deeply and reliably into the twilight zone ( Fig. 6 ).

Fig. 6

Fig. 6. Profile-based searches extend evolutionary information. The cloud signifies a protein structural family for the query protein U, i.e. all proteins that have a similar 3D structure. A simple pairwise comparison of U with all other proteins covers the 'safe zone' of sequence alignment (blue circle around U). This zone can be defined, e.g., by BLAST scores below 10-10, or by more than 35% pairwise identical residues over long alignments. Assume that there are only five other proteins (small white circles) in the safe zone falling all on the same side of U. Now, PSI-BLAST starts the next iteration with the family-specific given by the proteins found in the safe zone. Searching the database again with this profile, reaches safely into the twilight zone (zone reached marked by double-lined egg indicated in figure). However, no current method generally reaches all members of family U. Furthermore, in particular for PSI-BLAST the new region may fall outside of the initial safe zone (black/yellow moon left of safe zone). Finally, the regions that could have been reached by sequence-space hopping or intermediate sequence searches (light blue circles around five initial hits; [245, 246, 59] ) are not entirely covered by the profile-based search. The tricky bit is to avoid that the profile will pick unrelated proteins (transparent egg), and thus connect two separate structural families (U and X). The three circles on the right hand side signify the three zones: safe (no error), twilight (some to many errors), midnight (hardly any correct hit) of database searches. Their size are proportional to the regions they occupy 'real' size. The graph shows that the number of false positives explodes upon a brief region of the twilight zone (0 error at 33% sequence identity, > 90% error at 25%). At the same time, more than half the true homologues are only found in the midnight zone. Conclusions: (1) Iterated PSI-BLAST searches can safely identify fairly divergent family members. (2) Close homologues may be lost during the extension of the family. (3) The advanced search can lead astray.

Jones broke through by using PSI-BLAST searches of large databases. David Jones pioneered using iterated PSI-BLAST searches automatically [11] . The most important step climbed by the resulting method PSIPRED has been the detailed strategy to avoid polluting the profile through unrelated proteins ( Fig. 6 ). To avoid this trap, the database searched has to be filtered first [11] . At the CASP meeting at which David Jones introduced PSIPRED, Kevin Karplus and colleagues presented their prediction method (SAM-T99sec) finding more diverged profiles through Hidden Markov models [168, 169] . Recently, Cuff & Barton also successfully used PSI-BLAST alignments for JPred2 [170] . Jennings et al. [171] explore an alternative to increasing divergence: they started with a safe zone alignment through ClustalW [163] and HMMer [167] , and iteratively refined the alignment using the secondary structure prediction from DSC [157] . The resulting alignment is reported to be more accurate and to yield higher prediction accuracy than the initial ClustalW / HMMer alignments [171] .

SSpro: advanced recursive neural network system. The only method published recently that appears to improve prediction accuracy significantly not through more divergent profiles but through the particular algorithm is SSpro [13] . The major idea of the method aims at solving the problem of predicting too short segments. PHD addressed this problem by a second level structure-to-structure network [50] . Most authors have since implemented this idea (in particular PSIPRED and JPred2). Pierre Baldi and colleagues deviated substantially from this concept. Instead of using an additional network, they embedded the correlation into one single recursive neural network. In principle, the idea of a recursive network had been implemented before [172] . However, the particular details of the algorithm implemented in SSpro are novel and - as Table 1 illustrates - prove highly successful. Interestingly, SSpro is less successful on improving the prediction of segments length than on improving overall accuracy ( Fig. 5 ).

HMMSTR: hidden Markov models for connecting library of structure fragments. Can we predict secondary structure for protein U by local sequence similarity to segments of known structures {S} even when overall U differs from any of the known structures {S}? Yes, as shown by many nearest-neighbour-based prediction methods, the most successful of which seems to be NSSP [160] . A conceptually quite different realisation of the same concept has been implemented in HMMSTR by Chris Bystroff, David Baker and colleagues [2] . Firstly, build a library of local stretches (3-19) of residues with 'basic structural motifs' (I-sites). Secondly, assemble these local motifs through Hidden Markov models introducing structural context on the level of super-secondary structure. Thus, the goal is to predict protein structure through identification of 'grammatical units of protein structure formation'. Although HMMSTR intrinsically aims at predicting higher order aspects of 3D structure, a side-result is the prediction of 1D secondary structure. I find two results surprising. (1) The authors do not find any significant effect of 'over-optimising' their method, i.e. HMMSTR appears as accurate in predicting secondary structure for proteins known today as it will be for those known next year. (2) Three-state per-residue accuracy is reported to be about 74% [2] . This value may be over-estimated. Nevertheless, HMMSTR is clearly one of the better prediction methods.

Plethora of new concepts for secondary structure prediction explored recently. The following five methods are a small subset of new ideas explored to improve secondary structure prediction. (1) Ouali & King [173] combine neural networks and rule-based statistics in a cascade of classifiers. Based on a similar data set they estimate a level of prediction accuracy comparable to that of JPred2 (see Table 1 ). (2) Chandonia & Karplus [174] combined simplified output schemes (two output states) with networks trained on different tasks and a particular variant of early stopping; input are non-divergent alignments picked from the safe zone (Fig. 1). Based on a protocol similar to the one applied by the Danish group [175] , the authors estimate a level of > 76% accuracy, i.e. a level that if holding up is similar to SSpro ( Table 1 ). (3) Supposedly the simplest new method that claims to almost approach the performance of PHD combines the information for secondary structure formation contained in amino acid singlets, doublets, and triplets. (4) Schmidler et al. [176] use of a simple statistical model, the novel aspect is to replace compiling statistics over fixed stretches of N residues by segments signifying regular secondary structure (helix, strand). The underlying formalism resembles a hidden semi-Markov model allowing to explicitly incorporate particular propensities such as helix caps [177] . Based on non-comparable data sets the authors estimated prediction accuracy to be around 69%, if correct, this value is extremely impressive for a 2nd generation method. (5) Without claims to surprising levels of accuracy, Figureau et al. [178] combine cleverly chosen pentapeptides from the database to obtain the final prediction.

Caution: over-optimism has become even more likely!

Seemingly improve accuracy by ignoring short segments. There are many ways to publish higher levels of accuracy. Amongst the simplest for secondary structure prediction is to convert 310 helices and beta-bulges assigned by DSSP [48] to non-regular structure. This yields higher levels of accuracy since all methods - on average - are better at predicting the middle of helices and strands than their caps, and hence are more accurate for longer regular secondary structure segments [49, 174] . When using predicted secondary structure to predict 3D structure, short helices are important. Thus, I suggest bearing with the more conservative conversion strategy.

Comparing apples and oranges, or too few apples with one another. To overstate the point: there is NO value in comparing methods evaluated on different data sets. Most secondary structure prediction methods are available. Thus, developers may want to compare their results to public methods based on the same data set (not previously used for any of the two). Many methods predicting aspects of protein structure and function have to fight with limited data availability. This is not at all the case for secondary structure prediction. Hundreds of new protein structures are added every year [6] . If because of some reason or other, small data sets have to be used, developers should painstakingly try to estimate what 'significant difference' means for their data set. For example, about 20 new protein structures are clearly too few! This is the number of proteins that were available for the CASP4 meeting. Based on that set all 3rd generation methods were equal!

Seemingly achieve 100% accuracy by using correlated sets. Many publications on predicting secondary structural class from amino acid composition allowed correlations between 'training' and testing sets. Consequently, levels of prediction accuracy published exceeded by far the theoretical possible margins [179] . A very simple operational definition for 'independent sets' is the following: Two proteins A and B are correlated if the sequence similarity between A and B suffice to predict the structure of B knowing A's structure. Assume we have two un-correlated sets of proteins S1 and S2. Can we train the method on set S1 and develop it on set S2 without further ado? While developing PROF, I realised that the answer is negative. In fact, I trained neural networks on about 2000 structures that had no significant level of sequence similarity to our original set of 126 proteins [50] . I used the 126 only after I had completed developing the method and found a prediction accuracy exceeding 80% (unpublished). When testing PROF on a set of about 200 new structures that had been added to PDB in the meantime (different to that given in Table 1 ), prediction accuracy dropped. Do the 126 differ from the set used for Table 1? I failed to answer this question.

EVA: automatic evaluation of automatic prediction servers. In collaboration with Volker Eyrich (Columbia), Marc Marti-Renom, Andrej Sali (both Rockefeller), Florencio Pazos, and Alfonso Valencia (both CNB Madrid), we have started to address the above problems through the automatic server EVA [180, 181] . Leszek Rychlewski (IIMCB Warsaw) and Dani Fischer (Ben-Gurion Univ.) are implementing similar ideas in LiveBench [182] . The simple concept is the following: take the N newest experimental structures added to PDB, send the sequences to all prediction servers, collect the results, and accumulate a continuous evaluation of prediction accuracy every weak. EVA has been evaluating secondary structure prediction methods for more than six months now. I found it instructive to see how the 'ranking' of methods initially changed from week to week due to too small sets. Currently, EVA also provides results for evaluating comparative modelling (Sali group), and residue-residue contacts (Valencia group). We hope that EVA will eventually simplify life for developers, referees, editors and users.

State-of-the-art secondary structure prediction

What does 77% accuracy mean, in practice?

Prediction accuracy peaks at 77% accuracy. The currently best methods reach a level around 77% three-state per-residue accuracy ( Table 1 ). This constitutes a sustained level about five percentage points above last century's best method not using diverged profiles (PHD in Table 1 ). Fortunately, the improvement is valid for helix, strand and non-regular regions (information and correlation indices in Table 1 ). Furthermore, significantly fewer residues are confused between the states helix and strand (BAD score, Table 1). Finally, some new methods also improve in a more global sense by improving the accuracy of assigning the secondary structural class (all-alpha, all-beta, alpha/beta, other) based on the predicted content of regular secondary structure (Class score, Table 1 ).

Tab. 1

Difference between 60% and 70% accuracy may matter a lot! Some of the 3rd generation methods for secondary structure prediction are clearly superior to previous methods: b -strands are predicted more accurately; predicted segments look like those observed; and the overall accuracy is about ten percentage points higher. The advantage in practice is illustrated in Fig. 7 . Not only that the 3rd generation method (here PHD) gets most segments right, but it also enables to focus on more reliably predicted residues. The reliability index (Rel in Fig. 7) is compiled as the difference between the output unit with highest value (winner unit) and the output unit with the next highest value (normalised to a scale from 0 (low) to 9 (high)). All strongly predicted residues (* in Fig. 7) are predicted correctly.

Fig. 7

Fig. 7. Example for secondary structure prediction of 1st-3rd generation. TOP panel: SH3 structure [184] . The dashes indicated the continuation of the 2nd strand that was missed by DSSP. The methods are 1st generation: C+F [73] ; 2nd generation: GOR [243] (= GORIII), and 3rd generation: PHD [7] . The levels of three-state accuracy were: C+F = 59%; GOR = 65%; and PHD = 72%. Whereas the 1st and 2nd generation methods performed above their average accuracy (Fig. 1) for this protein, the PHD prediction was average (Fig. 1; Fig. 7). The strength of the PHD prediction was reflected in the one-digit reliability index (Rel , 0 = low, 9 = high) correlated with prediction accuracy. All residues predicted at values of Rel > 4 (marked by *) were predicted correctly. LOWER panel: translation elongation factor beta-1 [247] : shown are examples for methods exploring extended profile searches (Table 1 for abbreviations). An N-terminal strand and helix (not shown) were correctly predicted by all methods. Although the combination of various methods (EVA-4) is better on average (Table 1), it is debatable which prediction is most useful here.

Values for expected prediction accuracy are distributions. Statements such as 'secondary structure is about 90% conserved within sequence families' [51] refer to averages over distributions. The same holds for the expected prediction accuracy ( Fig. 8 ). Such distributions explain why some developers have over-estimated the performance of their tools using data sets of only tens of proteins (or even fewer). In general, single sequences yield accuracy values about ten percentage points lower than multiple alignments [50, 54, 78] . Note that for most proteins some helix and strand residues are confused (BAD predictions in Fig. 8 ).

Fig. 8

Fig. 8. Expected variation of prediction accuracy with protein chain for PHD. (A) Three-state per-residue accuracy (eq. 1; PDB identifier given for the proteins predicted worst); (B) percentage of BAD predictions, i.e., residues either predicted in helix and observed in strand, or predicted in strand and observed in helix (introduced by [56] ); (B inlet) cumulative percentage of proteins with BADly predicted residues (e.g. for 80% of the proteins the percentage of confusing helix and strand residues is < 7%; however, for only for 30% of all proteins such a confusion never happened). Given: distributions (over 721 unique protein chains), averages, and one standard deviation. Distributions of all other third generation methods given in Table 1 are qualitatively similar.

Reliability of prediction correlates with accuracy. For the user interested in a particular protein U, the fact that prediction accuracy varies with the protein ( Fig. 8 ) implies a rather unfortunate message: the accuracy for U could be lower than 40%, or it could be higher than 90% ( Fig. 8 ). Is there any way to provide an estimate at which end of the distribution the accuracy for U is likely to be? Indeed, the reliability index correlates with accuracy. In other words, residues with higher reliability index are predicted with higher accuracy [50, 78, 7] . Thus, the reliability index offers an excellent tool to focus on some key regions predicted at high levels of expected accuracy. Furthermore, the reliability index averaged over an entire protein correlates with the overall prediction accuracy for this protein ( Fig. 9 ). (Note however, that the reliability indices tend to be unusually high for alignments of sequence families without very divergent sequences.) Plotting the reliability of the prediction against accuracy ( Fig. 9 ) also reveals that minor differences in overall accuracy may matter. For example, JPred2 and PROF differ by only two percentage points ( Table 1 ), however, JPred2 reaches 88% accuracy for 'only' 45% of all residues whereas PROF reaches that level for more than 60% of all residues ( Fig. 9 ).

Fig. 9

Fig. 9. Correlation between reliability and accuracy. Residues predicted at higher reliability are predicted more accurately [50, 78, 7] . In fact, proteins with higher average reliability index are predicted above average (A, method: PROF). For example, no protein predicted at an average reliability ³ 6 has less than 76% accuracy, and only 3 out of 201 are below 70% accuracy for an average index ³ 5. PROF predictions were ³ 5 on average for one fourth of all proteins; for these the prediction accuracy was 83%. Reliability indices are now being used by most methods (B). They also enable users to spot particular regions predicted more accurately than others. For example, PROF and PSIPRED reach a level of accuracy similar to comparative modelling (around 88%, dotted line) for about 60% of all residues, and more than 93% of the quarter of the residues predicted at highest indices are correctly predicted. Note: the values in B are cumulative; e.g. 100% of all residues for PROF are predicted at 77.4 accuracy (Table 1).

Table 1

Table 1: Accuracy of secondary structure prediction methods A
Method BQ3 C Q3 Claim DBAD E Info FCorrH G CorrE HSOV I Class K
PHD71.771.6 0.606878
JPred275.076.4 2.40.340.64 0.637077
PHDpsi75.0 0.627081
PROF77.0 2.10.370.67 0.657383
PSIPRED76.776.5-78.3 M 2.40.370.66 0.647381
SSpro76.276 2.60.360.67 0.657183
EVA-477.8 2.00.380.69 0.6783

A: Data set and sorting: the results are compiled by EVA [248] . All methods for which details are listed have been tested on 201 different new protein structures (EVA version Feb 2001). None of these proteins was similar to any protein used to develop the respective method. This set comprised the largest such set by Feb 7, 2001 for which we had results. Sorting and grouping reflects the following concept: if the data set is too small to distinguish between methods, these two are grouped. For the given set of 195 protein this yielded four groups. Inside each group, results are sorted alphabetically. Due to a lack of data, I could not add the performance of SAM-T99sec [168] ; on a set of 105 proteins SAM-T99sec appears comparable to the best three: PSIPRED, SSpro, and PROF. Another method that appeared at least as accurate when tested on an earlier EVA set is missing since it is not publicly available [175] . B: Method: see abbreviations on top of article; EVA-4 refers to a simple average over the binary prediction output from PHDpsi, PSIPRED, SSpro, and PROF; C Q3: three-state per-residue accuracy (eq. 1); D Q3 Claim: three-state per-residue accuracy published in original publication of method: PSIPRED [11] , SSpro [13] , JPred2 [5] , PHD [78] , E BAD: percentage of helical residues predicted as strand, and of strand residues predicted as helix [56] ; F Info: per-residue information content [50] ; G CorrH: Matthew's correlation coefficient for state helix [249] ; H CorrE: Matthew's correlation for state strand [249] ; I SOV: three-state per-segment score averaged over the three-state segment overlap between predicted and observed segments [51, 58] ; K Class: percentage of proteins correctly sorted into one of the four classes: all-alpha (length > 60, helix > 45%, strand < 5%), all-beta (length > 60, helix < 5%, strand > 45%), alpha/beta (length > 60, helix > 30%, strand > 20%), other (thresholds for classification from: [99, 250, 78] ); M accuracy range: PSIPRED result were published for different conversions of the eight DSSP states to three states.

Understandable why certain proteins predicted poorly? For some of the worst predicted proteins, the low level of accuracy could be anticipated from their unusual features, e.g., for crambin, or the antifreeze glycoprotein type III. However, this procedure turned out to be rather arbitrary. First, some proteins with the same 'unusual features' are predicted at high levels of accuracy. Second, occasionally similar proteins are predicted at very different levels of accuracy, e.g. both the phosphotidylinonitol 3-kinase [183] and the Src-homology domain of cytoskeletal spectrin have homologous structure [184] but prediction accuracy varies between less than 40% (pik) and more than 70% (spectrin). None of the conclusions from studying poor predictions has yielded a way to better predictions, yet. Nevertheless, two observations may be added. First, bad alignments (i.e. non-informative and/or falsely aligned residues) result in bad predictions. Second, frequently the BAD predictions ( Fig. 8 Table 1 ), i.e., the confusion of helix and strand are observed in regions that are stabilised by long range interactions. For example, the peptide around the fourth strand of SH3 ( Fig. 7 ) forms a helix in solution (Luis Serrano, personal communication). Furthermore, helices and strands that are confused despite a high reliability index often have functional properties, or are correlated to disease states (Rost, unpublished data). Regions predicted with equal propensity in two different states often correlate with 'structural switches' (see ASP below).

What is at the base of the recent improvement?

Sources of improvement: 4 parts database growth, 3 extended search, 2 other. Jones solicited two causes for the improved accuracy: (1) training and (2) testing the method on PSI-BLAST profiles. Cuff & Barton examined in detail how different alignment methods improve [5] . However, which fraction of the improvement results from the mere growth of the database, which from using more diverged profiles, and which from training on larger profiles? Using PHD from 1994 to separate the effects [8] , we first compared a non-iterative standard BLAST [150] search against SWISS-PROT [14] with one against SWISS-PROT + TrEMBL [14] + PDB [6] . The larger database improves performance by about two percentage points [8] . Secondly, we compared the standard BLAST against the big database with an iterative PSI-BLAST search. This yielded less than two percentage points additional improvement [ [8] . Thus, overall, the more divergent profile search against today's databases supposedly improves any method using alignment information by almost four percentage points (PHDpsi in Table 1). The improvement through using PSI-BLAST profiles to develop the method, are relatively small: PHDpsi was trained on a small database of not very divergent profiles in 1994, e.g., PROF was trained on PSI-BLAST profiles of a 20 times larger database in 2000. The two differ by only one percentage point ( Table 1 ), and part of this difference resulted from implementing new concepts into PROF (Rost, unpublished; [9] ).

Combination improves on non-systematic errors. Any prediction method has two sources of errors: (1) systematic errors, e.g., through non-local effects, and (2) white noise errors caused by, e.g., the succession of the examples during training neural networks. Theoretically, combining any number of methods improves accuracy as long as the errors of the individual methods are mutually independent and are not only systematic [185] . PHD - and more recently others [174, 5, 175] - utilised this fact by combining different neural networks. The idea of combining different prediction methods has been around in secondary structure prediction since long [186] ; Cuff & Barton [3, 4] implemented it in JPred for different third generation methods. In particular, JPred uses a simple expert-rule for compiling the final average. Ross King et al. [187] have tested a variety of different combination strategies. Selbig et al. [188] have compiled the jury through an elaborated decision-tree based system. Guermeur et al. [189] have used a more refined variant of the JPred idea of weighting methods. Overall, combinations of independent prediction methods seem to yield levels of accuracy higher than that of the single best method. In particular, combining the four current best methods (PROF, PSIPRED, SSpro, and PHDpsi) improved prediction accuracy to 77.8% ( Table 1 EVA-4). However, for every protein one method tends to be clearly superior to the combined prediction. Is it really wise to include significantly inferior methods into combined prediction? No: averaging over all methods used for EVA decreased accuracy over the best individual methods, although averaging over the better ones was better than averaging the best ones (data not shown). Is there any criterion for when to include a method and when not? Concepts weighting the individual methods based on its accuracy and 'entropy' [175] appear successful only for large numbers of methods ( [175] , Rost, unpublished). Nevertheless, methods that are significantly over-trained can improve when combined (Krogh, unpublished). More rigorous studies for the optimal combination may provide a better picture. The technical problem of utilising many methods in a public server is that the field is advancing too fast: today's methods are more accurate than averages over yesterday's methods (hence the JPred server now returns JPred2 results by default).

Availability of methods

Internet prediction services for secondary structure, in general. Programs for the prediction of secondary structure available as internet services have mushroomed since the first prediction service PredictProtein went on line in 1992 [149, 119] (a list of links in [190] ). Our META-PredictProtein server [191] enables users to access a number of the best prediction methods through one single interface. Unfortunately, not all methods available have been sufficiently tested, and some are not very accurate. We try to address this problem by maintaining EVA for the automatic evaluation of prediction servers [180, 181] . In general, prediction accuracy is significantly superior if predictions are based on multiple alignments [192, 155, 25] .

Completely vs. almost automatic. The PHD/PROF prediction methods are automatically available via the internet service PredictProtein [119] (use the web interface at http://cubic.bioc.columbia.edu/predictprotein or send the word help to PredictProtein@columbia.edu). Users have the choice between the fully automatic procedure taking the query sequence through the entire cycle, or expert intervention into the generation of the alignment. Indeed, without spending much time users typically can improve prediction accuracy easily by choosing 'good' alignments.

Are secondary structure predictions useful, in practice?

Regions likely to undergo structural change predicted successfully. Young, Kirshenbaum, Dill & Highsmith [1] have unravelled an impressive correlation between local secondary structure predictions and global conditions. The authors monitor regions for which secondary structure prediction methods give equally strong preferences for two different states. Such regions are processed combining simple statistics and expert-rules. The final method is tested on 16 proteins known to undergo structural rearrangements, and on a number of other proteins. The authors report no false positives, and identify most known structural switches. Subsequently, the group applied the method to the myosin family identifying putative switching regions that were not know before, but appeared reasonable candidates [193] . I find this method most remarkable in two ways: (1) it is the most general method using predictions of protein structure to predict some aspects of function, and (2) it illustrates that predictions may be useful even when structures are known (as in the case of the myosin family).

Classifying proteins based on secondary structure predictions in the context of genome analysis. Proteins can be classified into families based on predicted and observed secondary structure [28, 194] . However, such procedures have been limited to a very coarse-grained grouping only exceptionally useful to infer function. Nevertheless, in particular predictions of membrane helices and coiled-coil regions are crucial for genome analysis. Recently, we came across an observation that may have important implications for structural genomics, in particular: More than one fifth of all eukaryotic proteins appeared to have regions longer than 60 residues apparently lacking any regular secondary structure [195] . Most of these regions were not of low-complexity, i.e. not composition-biased. Surprisingly, these regions appeared evolutionarily as conserved as all other regions in the respective proteins. This application of secondary structure prediction may aid in classifying proteins, and in separating domains, possibly even in identifying particular functional motifs.

Aspects of protein function predicted based on expert-analysis of secondary structure. The typical scenario in which secondary structure predictions help to learn about function are experts combining predictions and their intuition, most often to find similarities to proteins of known function but insignificant sequence similarity [196, 197, 40, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207] . Usually, such applications base on very specific details about predicted secondary structure. Thus, these successful correlations of secondary structure and function appear difficult to incorporate into automatic methods.

Exploring secondary structure predictions to improve database searches. Initially, three groups independently applied secondary structure predictions for fold recognition, i.e., the detection of structural similarities between proteins of unrelated sequences [208, 209, 210] . A few years later, almost every other fold recognition/threading method has adopted this concept [211, 212, 213, 214, 215, 216, 217, 218, 219, 220] . Two recent methods extended the concept by not only refining the database search, but by actually refining the quality of the alignment through an iterative procedure [221, 171] . A related strategy has been implored by Ng and the Henikoffs to improve predictions and alignments for membrane proteins [222] .

From 1D predictions to 2D, and 3D structure. Are secondary structure predictions accurate enough to help predicting higher order aspects of protein structure automatically? 2D (inter-residue contacts) predictions: Baldi, Pollastri, Andersen & Brunak [223] have recently improved the level of accuracy in predicting beta-strand pairings over earlier work [153] through using another elaborate neural network system. 3D predictions: the following list of five groups exemplifies that secondary structure predictions have now a popular first step toward predicting 3D structure. (1) Ortiz et al. [224] successfully use secondary structure predictions as one component of their 3D structure prediction method. (2) Eyrich et al. [225, 226] minimises the energy

of arranging predicted rigid secondary structure segments. (3)
Lomize et al. [227] also start from secondary structure segments. (4) Chen et al. [228] suggest using secondary structure predictions
to reduce the complexity of molecular dynamics simulations. (5)
Levitt et al. [229, 230] combine secondary structure-based simplified presentations with a particular lattice simulation attempting to enumerate all possible folds.

And what is the limit of prediction accuracy?

88% is a limit, but shall we ever reach close to there? Protein secondary structure formation is influenced by long-range interactions [231, 46, 47] and by the environment [1, 232] . Consequently, stretches of up to 11 adjacent residues (dubbed chameleon after [231] ) can be found in different secondary structure states [233, 234, 235] . Implicitly, such non-local effects are contained in the exchange patterns of protein families. This is reflected by the fact that strand is predicted almost as accurately as helix ( Table 1 ), although sheets are stabilised by more non-local interactions than helices. Local profiles can even suffice to identify structural switches [193, 1] . Surprisingly, we can find some traces of folding events in secondary structure predictions [236] . Even more amazing is a study suggesting that alignment-based methods achieve similar levels of accuracy for chameleon regions as for all other regions [234] . Secondary structure assignments may vary for two versions of the same structure. One reason is that protein structures are no rocks but dynamic objects with some regions more mobile than others. Another reason is that any assignment method has to choose particular thresholds (e.g. DSSP chooses a cut-off in the Coulomb energy of a hydrogen bond). Consequently, assignments differ by about 5-15 percentage points between different X-ray versions or different NMR models for the same protein (Andersen & Rost, unpublished), and by about 12 percentage points between structural homologues [51] . The latter number provides the upper limit for secondary structure prediction of error-free comparative modelling. I doubt that ab initio predictions of secondary structure will ever become more accurate than that. Hence, I believe a value around 88% constitutes an operational upper limit for prediction accuracy. After the advances over the last two years we reached above 76%. Thus, we need to mount another twelve percentage points (or even less). What is the major obstacle to reaching another six percentage points higher? The size of the experimental database as suggested [233] ? I doubt this, since PHDpsi trained on only 200 proteins using PSI-BLAST input is almost as accurate as PSIPRED trained on 2000 proteins ( Table 1 ). Will the current explosion of sequences boost accuracy? In fact, current databases have less than 10 homologues for more than one third of the 150 tested proteins ( Table 1 ), and more than 100 for only 20% of the proteins. Although based on a too small set for conclusions, for these 20% highly populated families the accuracy of PROF was four percentage points above average (data not shown). Thus, larger databases may get us six percentage points higher, and it may not. The answer remains nebulous.


The following notes have resulted from nine years of experience with running the PredictProtein server [119] and from various structure prediction workshops [237] . Some comments apply in particular to the PHD/PROF methods [7, 238] . However, most hold also for using other secondary structure prediction methods (a detailed list of 'Hints for users' is given on our WWW pages [119] ).

What can you expect from secondary structure prediction?

How accurate are the predictions? The expected levels of accuracy (PROF Q3 = 77±10%) are valid for typical globular, water-soluble proteins when the multiple alignment contains many and diverse sequences. High values for the reliability indices indicate more accurate predictions (Fig. 9). However, for alignments with little variation in the sequences, the reliability indices adopt misleadingly high values. PHD/PROF predictions tend to be relatively accurate for porins [7] ; however, for helical membrane proteins other programs ought to be used [7, 238, 39] .

Confusion between strand and helix? PHD (as well as other methods) focuses on predicting hydrogen bonds. Consequently, occasionally strongly predicted (high reliability index) helices are observed as strands and vice versa ( Fig. 8 Table 1 ). In fact, some of these BAD predictions correspond to structural switching regions.

Strong signal from secondary structure caps? The ends of helices and strands contain a strong signal. However, on average PHD predict the core of helices and strands more accurately than the caps [49] . This is also hold for the other methods listed in Table 1 (data not shown).

Internal helices predicted poorly? Steven Benner has indicated that internal helices are difficult to predict [137, 53] . On average, this is not the case for PHD predictions [239] .

What about protein design and synthesised peptides? The PHD networks are trained on naturally evolved proteins. However, the predictions have been useful in some cases to investigate the influence of single mutations (e.g. for Chameleon [231, 240] , or for Janus [241] , Rost, unpublished). For short poly-peptides, users should bear in mind that the network input consists of 17 adjacent residues. Thus, shorter sequences may be dominated by the ends (which are treated as solvent by the current version of PHD).

How can you avoid pitfalls?

70% correct implies 30% incorrect. The most accurate methods for predicting secondary structure reach sustained levels of about 70% accuracy. When interpreting predictions for a particular protein it is often instructive to mark the 30% of the residues you suspect to be falsely predicted.

Special classes of proteins. Prediction methods are usually derived from knowledge contained in proteins from subsets of current databases. Consequently, they should not be applied to classes of proteins not included in these subsets, e.g., methods for predicting helices in globular proteins are likely to fail when applied to predict transmembrane helices. In general, results should be taken with caution for proteins with unusual features, such as proline-rich regions, unusually many cysteine bonds, or for domain interfaces.

Better alignments yield better predictions. Multiple alignment-based predictions are substantially more accurate than single sequence-based predictions. How many sequences do you need in your alignment for an improvement; and how sensitive are prediction methods to errors in the alignment? The more divergent sequences contained in the alignment, the better (two distantly related sequences often improve secondary structure predictions by several percentage points). Regions with few aligned sequences yield less reliable predictions. The sensitivity to alignment errors depends on the methods, e.g., secondary structure prediction is less sensitive to alignment errors than accessibility prediction.

1D structure may or may not be sufficient to infer 3D structure. Say you obtain as prediction for regular secondary structure: helix-strand-strand-helix-strand-strand (H-E-E-H-E-E). Assume, you find a protein of known structure with the same motif (H-E-E-H-E-E). Can you conclude that the two proteins have the same fold? Yes, and no, your guess may be correct, but there are various ways to realise the given motif by completely different structures. For example, at least, 16 structurally unrelated proteins contain the secondary structure motif 'H-E-E-H-E-E'.


Particular thanks to Volker Eyrich for his crucial help with setting up the META-PP and EVA servers without which most of the results presented here would not exist. Last, not least, thanks to all those who deposit their experimental data in public databases, and to those who maintain these databases.


  1. Young, M., Kirshenbaum, K., Dill, K. A. & Highsmith, S. (1999). Predicting conformational switches in proteins. Prot. Sci.,8, 1752-1764.
  2. Bystroff, C., Thorsson, V. & Baker, D. (2000). HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins. J. Mol. Biol.,301, 173-190.
  3. Cuff, J. A., Clamp, M. E., Siddiqui, A. S., Finlay, M. & Barton, G. J. (1998). JPred: a consensus secondary structure prediction server. Bioinformatics,14, 892-893.
  4. Cuff, J. A. & Barton, G. J. (1999). Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins,34, 508-519.
  5. Cuff, J. A. & Barton, G. J. (2000). Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins,40, 502-511.
  6. Berman, H. M., Westbrook, J., Feng, Z., Gillliland, G., Bhat, T. N. et al. (2000). The Protein Data Bank. Nucl. Acids Res.,28, 235-242.
  7. Rost, B. (1996). PHD: predicting one-dimensional protein structure by profile based neural networks. Meth. Enzymol.,266, 525-539.
  8. Przybylski, D. & Rost, B. (2001). PSI-BLAST for structure prediction: plug-in and win. Columbia University.

  9. Rost WWW, B. (2000). Better secondary structure prediction through more data. Columbia University, WWW document (http://cubic.bioc.columbia.edu/predictprotein).
  10. Altschul, S., Madden, T., Shaffer, A., Zhang, J., Zhang, Z. et al. (1997). Gapped Blast and PSI-Blast: a new generation of protein database search programs. Nucl. Acids Res.,25, 3389-3402
  11. <
  12. Jones, D. T. (1999). Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol.,292, 195-202.
  13. Karplus, K., Barrett, C. & Hughey, R. (1998). Hidden Markov models for detecting remote protein homologies. Bioinformatics,14, 846-856.
  14. Baldi, P., Brunak, S., Frasconi, P., Soda, G. & Pollastri, G. (1999). Exploiting the past and the future in protein secondary structure prediction. Bioinformatics,15, 937-946.
  15. Bairoch, A. & Apweiler, R. (2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucl. Acids Res.,28, 45-48.
  16. Gaasterland, T. & Sensen, C., W. (1996). Fully automated genome analysis that reflects user needs and preferences - a detailed introduction to the MAGPIE system architecture. Biochimie,78, 302-310
  17. Gaasterland, T. & Sensen, C. W. (1996). MAGPIE: automated genome interpretation. TIGS,12, 76-78.
  18. Liu WWW, J. & Rost, B. (2000). Analysing all proteins in entire genomes: distribution of protein length.
  19. Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F., Brice, M. D. et al. (1977). The Protein Data Bank: a computer based archival file for macromolecular structures. J. Mol. Biol.,112, 535-542.
  20. Sanchez, R., Pieper, U., Mirkovic, N., de Bakker, P. I., Wittenstein, E. et al. (2000). MODBASE, a database of annotated comparative protein structure models. Nucl. Acids Res.,28, 250-3.
  21. Liu, J. & Rost, B. (2001). Similar percentages of helical membrane proteins in all organisms. Prot. Sci.,in submission.
  22. CASP1 (1995). Special issue: First Meeting on Critical Assessment of Protein Structure prediction (CASP). Proteins,23.
  23. CASP2 (1997). Special issue: Second Meeting on Critical Assessment of Protein Structure prediction (CASP). Proteins,Suppl. 2.
  24. CASP3 (1999). Special issue: Third Meeting on Critical Assessment of Protein Structure prediction (CASP). Proteins,Suppl. 2.
  25. CASP4WWW (2000). Fourth meeting on the critical assessment of techniques for protein structure prediction. Prediction Center, Lawrence Livermore National Lab, WWW document: http://PredictionCenter.llnl.gov/casp4/Casp4.html.
  26. Rost, B. & Sander, C. (1996). Bridging the protein sequence-structure gap by structure predictions. Annu. Rev. Biophys. Biomol. Struct.,25, 113-136.
  27. Rost, B. & Sander, C. (1994). Structure prediction of proteins - where are we now? Curr. Opin. Biotech.,5, 372-380.
  28. Rost, B. (1998). Protein structure prediction in 1D, 2D, and 3D. In The Encyclopaedia of Computational Chemistry (Schleyer, P. v. R., Allinger, N. L., Clark, T., Gasteiger, J., Kollman, P. A. et al., eds.), pp. 2242-2255, John Wiley & Sons, Chichester.
  29. Gerstein, M. & Levitt, M. (1997). A structural census of the current population of protein sequences. Proc. Natl. Acad. Sci. U.S.A.,94, 11911-11916.
  30. Teichmann, S. A., Chothia, C. & Gerstein, M. (1999). Advances in structural genomics. Curr. Opin. Str. Biol.,9, 390-399.
  31. Frishman, D. (2000). PEDANT: protein extraction, description, and analysis tool. Max-Planck-Institute, Munich.
  32. Liu WWW, J. & Rost, B. (2000). Analysing all proteins in entire genomes.
  33. Chou, K. C. & Elrod, D. W. (1999). Prediction of membrane protein types and subcellular locations. Proteins,34, 137-153.
  34. Monne, M., Hermansson, M. & von Heijne, G. (1999). A turn propensity scale for transmembrane helices. J. Mol. Biol.,288, 141-145.
  35. Pappu, R. V., Marshall, G. R. & Ponder, J. W. (1999). A potential smoothing algorithm accurately predicts transmembrane helix packing [published erratum appears in Nat Struct Biol 1999 Feb;6(2):199]. Nat. Struct. Biol.,6, 50-55.
  36. Pasquier, C., Promponas, V. J., Palaios, G. A., Hamodrakas, J. S. & Hamodrakas, S. J. (1999). A novel method for predicting transmembrane segments in proteins based on a statistical analysis of the SwissProt database: the PRED-TMR algorithm. Prot. Engin.,12, 381-385.
  37. Pilpel, Y., Ben-Tal, N. & Lancet, D. (1999). kPROT: a knowledge-based scale for the propensity of residue orientation in transmembrane segments. Application to membrane protein structure prediction. J. Mol. Biol.,294, 921-935.
  38. Lio, P. & Vannucci, M. (2000). Wavelet change-point prediction of transmembrane proteins. Bioinformatics,16, 376-82.
  39. Kühlbrandt, W. & Gouaux, E. (1999). Membrane proteins. Curr. Opin. Str. Biol.,9, 445-447.
  40. Rost, B. & O'Donoghue, S. I. (1997). Sisyphus and prediction of protein structure. CABIOS,13, 345-356.
  41. de Fays, K., Tibor, A., Lambert, C., Vinals, C., Denoel, P. et al. (1999). Structure and function prediction of the Brucella abortus P39 protein by comparative modeling with marginal sequence similarities. Prot. Engin.,12, 217-223.
  42. Lupas, A. (1997). Predicting coiled-coil regions in proteins. Curr. Opin. Str. Biol.,7, 388-393.
  43. Wolf, E., Kim, P. S. & Berger, B. (1997). MultiCoil: a program for predicting two- and three-stranded coiled coils. Prot. Sci.,6, 1179-1189.
  44. O'Donoghue, S. I. & Nilges, M. (1997). Tertiary structure prediction using mean-force potentials and internal energy functions: successful prediction for coiled-coil geometries. Folding & Design,2, S47-S52.
  45. Kolinski, A., Skolnick, J., Godzik, A. & Hu, W. P. (1997). A method for the prediction of surface "U"-turns and transglobular connections in small proteins. Proteins,27, 290-308.
  46. Shepherd, A. J., Gorse, D. & Thornton, J. M. (1999). Prediction of the location and type of beta-turns in proteins using neural networks. Prot. Sci.,8, 1045-55.
  47. Muñoz, V., Cronet, P., López-Hernández, E. & Serrano, L. (1996). Analysis of the effect of local interactions on protein stability. Folding & Design,1, 167-178.
  48. Villegas, V., Zurdo, J., Filimonov, V. V., Aviles, F. X., Dobson, C. M. et al. (2000). Protein engineering as a strategy to avoid formation of amyloid fibrils. Prot. Sci.,9, 1700-8.
  49. Kabsch, W. & Sander, C. (1983). Dictionary of protein secondary structure: pattern recognition of hydrogen bonded and geometrical features. Biopolymers,22, 2577-2637.
  50. Rost, B. & Sander, C. (1994). 1D secondary structure prediction through evolutionary profiles. In Protein Structure by Distance Analysis (Bohr, H. & Brunak, S., eds.), pp. 257-276, IOS Press, Amsterdam, Oxford, Washington.
  51. Rost, B. & Sander, C. (1993). Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol.,232, 584-599.
  52. Rost, B., Sander, C. & Schneider, R. (1994). Redefining the goals of protein secondary structure prediction. J. Mol. Biol.,235, 13-26.
  53. Thornton, J. M., Flores, T. P., Jones, D. T. & Swindells, M. B. (1992). Prediction of progress at last. Nature,354, 105-106.
  54. Benner, S. A. & Gerloff, D. L. (1993). Predicting the conformation of proteins: man versus machine. FEBS Lett.,325, 29-33.
  55. Rost, B., Sander, C. & Schneider, R. (1993). Progress in protein structure prediction? TIBS,18, 120-123.
  56. Russell, R. B. & Barton, G. J. (1993). The limits of protein secondary structure prediction accuracy from multiple sequence alignment. J. Mol. Biol.,234, 951-957.
  57. Defay, T. & Cohen, F. E. (1995). Evaluation of current techniques for ab initio protein structure prediction. Proteins,23, 431-445.
  58. Rost, B. & Sander, C. (1993). Improved prediction of protein secondary structure by use of sequence profiles and neural networks. Proc. Natl. Acad. Sci. U.S.A.,90, 7558-7562.
  59. Zemla, A., Venclovas, C., Fidelis, K. & Rost, B. (1999). A modified definition of SOV, a segment-based measure for protein secondary structure prediction assessment. Proteins,34, 220-223.
  60. Rost, B. (1999). Twilight zone of protein sequence alignments. Prot. Engin.,12, 85-94.

  61. Kendrew, J. C., Dickerson, R. E., Strandberg, B. E., Hart, R. J., Davies, D. R. et al. (1960). Structure of myoglobin: a three-dimensional Fourier synthesis at 2 Å resolution. Nature,185, 422-427.
  62. Perutz, M. F., Rossmann, M. G., Cullis, A. F., Muirhead, G., Will, G. et al. (1960). Structure of haemoglobin: a three-dimensional Fourier synthesis at 5.5 Å resolution, obtained by X-ray analysis. Nature,185, 416-422.
  63. Pauling, L. & Corey, R. B. (1951). Configurations of Polypeptide Chains with Favored Orientations Around Single Bonds: Two New Pleated Sheets. Proc. Natl. Acad. Sci. U.S.A.,37, 729-740.
  64. Pauling, L., Corey, R. B. & Branson, H. R. (1951). The Structure of Proteins: Two Hydrogen-bonded Helical Configurations of the Polypeptide Chain. Proc. Natl. Acad. Sci. U.S.A.,37, 205-234.
  65. Szent-Györgyi, A. G. & Cohen, C. (1957). Role of proline in polypeptide chain configuration of proteins. Science,126, 697.
  66. Blout, E. R., de Lozé, C., Bloom, S. M. & Fasman, G. D. (1960). Dependence of the conformation of synthetic polypeptides on amino acid composition. J. Am. Chem. Soc.,82, 3787-3789.
  67. Blout, E. R. (1962). The dependence of the conformation of polypetides and proteins upon amino acid composition. In Polyamino Acids, Polypeptides, and Proteins (Stahman, M., eds.), pp. 275-279, Univ. of Wisconsin Press, Madison.
  68. Scheraga, H. A. (1960). Structural studies of ribonuclease III. A model for the secondary and tertiary structure. J. Am. Chem. Soc.,82, 3847-3852.
  69. Davies, D. R. (1964). A correlation between amino acid composition and protein structure. J. Mol. Biol.,9, 605-609.
  70. Schiffer, M. & Edmundson, A. B. (1967). Use of helical wheels to represent the structures of proteins and to identify segments with helical potential. Biophys. J.,7, 121.
  71. xx 70 Pain, R. H. &aon, B. (1970). Analysis of the Code Relating Sequence to Secondary Structure in Proteins. Nature,227, 62-63. xxx 71 Finkelstein, A. V. & Ptitsyn, O. B. (1971). Statistical analysis of the correlation among amino acid residues in helical,b-structural and non-regular regions of globular proteins. J. Mol. Biol.,62, 613-624.
  72. Robson, B. & Pain, R. H. (1971). Analysis of the Code Relating Sequence to Conformation in Proteins: Possible Implications for the Mechanism of Formation of Helical Regions. J. Mol. Biol.,58, 237-259.
  73. Chou, P. Y. & Fasman, U. D. (1974). Prediction of protein conformation. Biochem.,13, 211-215.
  74. Lim, V. I. (1974). Structural Principles of the Globular Organization of Protein Chains. A Stereochemical Theory of Globular Protein Secondary Structure. J. Mol. Biol.,88, 857-872.
  75. Rose, G. D. (1978). Prediction of chain turns in globular proteins on a hydrophobic basis. Nature,272, 586-90.
  76. Kabsch, W. & Sander, C. (1983). How good are predictions of protein secondary structure? FEBS Lett.,155, 179-182.
  77. Rost, B. & Sander, C. (1993). Secondary structure prediction of all-helical proteins in two states. Prot. Engin.,6, 831-836.
  78. Rost, B. & Sander, C. (1994). Combining evolutionary information and neural networks to predict protein secondary structure. Proteins,19, 55-72.
  79. Kabat, E. A. & Wu, T. T. (1973). The influence of nearest-neighbor amino acids on the conformation of the middle amino acid in proteins: Comparison of predicted and experimental determination of b-sheets in concanavalin A. Proc. Natl. Acad. Sci. U.S.A.,70, 1473-1477.
  80. Maxfield, F. R. & Scheraga, H. A. (1976). Status of Empirical Methods for the Prediction of Protein Backbone Topography. Biochem.,15, 5138-5153.
  81. Robson, B. (1976). Conformational properties of amino acid residues in globular proteins. J. Mol. Biol.,107, 327-56.
  82. Nagano, K. (1977). Triplet Information in Helix Prediction Applied to the Analysis of Super-secondary Structures. J. Mol. Biol.,109, 251-274.
  83. Garnier, J., Osguthorpe, D. J. & Robson, B. (1978). Analysis of the accuracy and Implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol.,120, 97-120.
  84. Gibrat, J.-F., Garnier, J. & Robson, B. (1987). Further developments of protein secondary structure prediction using information theory. New parameters and consideration of residue pairs. J. Mol. Biol.,198, 425-443.
  85. Biou, V., Gibrat, J. F., Levin, J. M., Robson, B. & Garnier, J. (1988). Secondary structure prediction: combination of three different methods. Prot. Engin.,2, 185-91.
  86. Gascuel, O. & Golmard, J. L. (1988). A simple method for predicting the secondary structure of globular proteins: implications and accuracy. CABIOS,4, 357-365.
  87. Lupas, A., Van Dyke, M. & Stock, J. (1991). Predicting coiled coils from protein sequences. Science,252, 1162-1164.
  88. Viswanadhan, V. N., Denckla, B. & Weinstein, J. N. (1991). New joint prediction algorithm (Q7-JASEP) improves the prediction of protein secondary structure. Biochem.,30, 11164-11172.
  89. Juretic, D., Lee, B., Trinajstic, N. & Williams, R. W. (1993). Conformational preference functions for predicting helices in membrane proteins. Biopolymers,33, 255-273.
  90. Mamitsuka, H. & Yamanishi, K. (1993). Protein a-helix region prediction based on stochastic-rule learning. In 26th Annual Hawaii International Conference on System Sciences eds.), pp. 659-668, IEEE Computer Society, Maui, HI, U.S.A.
  91. Donnelly, D., Overington, J. P. & Blundell, T. L. (1994). The prediction and orientation of a-helices from sequence alignments: the combined use of environment-dependent substitution tables, Fourier transform methods and helix capping rules. Prot. Engin.,7, 645-653.
  92. Ptitsyn, O. B. & Finkelstein, A. V. (1983). Theory of protein secondary structure and algorithm of its prediction. Biopolymers,22, 15-25.
  93. Taylor, W. R. & Thornton, J. M. (1983). Prediction of super-secondary structure in proteins. Nature,301, 540-542.
  94. Cohen, F. E. & Kuntz, I. D. (1989). Tertiary Structure Prediction. In Prediction of protein structure and the principles of protein conformation (Fasman, G. D., eds.), pp. 647-706, Plenum Press, New York, London.
  95. Rooman, M. J., Kocher, J. P. & Wodak, S. J. (1991). Prediction of protein backbone conformation based on seven structure assignments: influence of local interactions. J. Mol. Biol.,221, 961-979.
  96. Bohr, H., Bohr, J., Brunak, S., Cotterill, R. M. J., Lautrup, B. et al. (1988). Protein secondary structure and homology by neural networks. FEBS Lett.,241, 223-228.
  97. Qian, N. & Sejnowski, T. J. (1988). Predicting the secondary structure of globular proteins using neural network models. J. Mol. Biol.,202, 865-884.
  98. Holley, H. L. & Karplus, M. (1989). Protein secondary structure prediction with a neural network. Proc. Natl. Acad. Sci. U.S.A.,86, 152-156.
  99. Kneller, D. G., Cohen, F. E. & Langridge, R. (1990). Improvements in Protein Secondary Structure Prediction by an Enhanced Neural Network. J. Mol. Biol.,214, 171-182.
  100. Stolorz, P., Lapedes, A. & Xia, Y. (1992). Predicting protein secondary structure using neural net and statistical methods. J. Mol. Biol.,225, 363-377.
  101. Zhang, X., Mesirov, J. P. & Waltz, D. L. (1992). Hybrid system for protein secondary structure prediction. J. Mol. Biol.,225, 1049-63.
  102. Maclin, R. & Shavlik, J. W. (1993). Using knowledge-based neural networks to improve algorithms: refining the Chou-Fasman algorithm for protein folding. Machine Learning,11, 195-215.
  103. Chandonia, J.-M. & Karplus, M. (1995). Neural networks for secondary structure and structural class predictions. Prot. Sci.,4, 275-285.
  104. Mitchell, E. M., Artymiuk, P. J., Rice, D. W. & Willett, P. (1992). Use of techniques derived from graph theory to compare secondary structure motifs in proteins. J. Mol. Biol.,212, 151-166.
  105. Geourjon, C. & Deléage, G. (1995). SOPMA: significant improvements in protein secondary structure prediction by consensus prediction from multiple alignments. CABIOS,11, 681-684.
  106. Kanehisa, M. (1988). A multivariate analysis method for discriminating protein secondary structural segments. Prot. Engin.,2, 87-92.
  107. Munson, P. J. & Singh, R. K. (1997). Multi-body interactions within the graph of protein structure. In Fifth International Conference on Intelligent Systems for Molecular Biology (Gaasterland, T., Karp, P., Karplus, K., Ouzounis, C., Sander, C. et al., eds.), pp. 198-201, AAAI Press, Halkidiki, Greece.
  108. King, R. D., Muggleton, S., Lewis, R. A. & Sternberg, M. J. E. (1992). Drug design by machine learning: The use of inductive logic programming to model the structure-activity relationships of trimethoprim analogues binding to dihydrofolate reductase. Proc. Natl. Acad. Sci. U.S.A.,89, 11322-11326.
  109. Muggleton, S., King, R. D. & Sternberg, M. J. E. (1992). Protein secondary structure prediction using logic-based machine learning. Prot. Engin.,5, 647-657.
  110. Frishman, D. & Argos, P. (1995). Knowledge-based protein secondary structure assignment. Proteins,23, 566-579.
  111. Zhu, Z.-Y. & Blundell, T. L. (1996). The use of amino acid patterns of classified helices and strands in secondary structure prediction. J. Mol. Biol.,260, 261-276.
  112. Asogawa, M. (1997). Beta-sheet prediction using inter-strand residue pairs and refinement with Hopfield neural network. Ismb,5, 48-51.
  113. Yi, T.-M. & Lander, E. S. (1993). Protein Secondary Structure Prediction Using Nearest-neigbor Methods. J. Mol. Biol.,232, 1117-1129.
  114. Solovyev, V. V. & Salamov, A. A. (1994). Predicting a-helix and b-strand segments of globular proteins. CABIOS,10, 661-669.
  115. Salamov, A. A. & Solovyev, V. V. (1995). Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignment. J. Mol. Biol.,247, 11-15.
  116. Kabsch, W. & Sander, C. (1983). Segment83. unpublished.
  117. Schneider, R. (1989). Sekundärstrukturvorhersage von Proteinen unter Berücksichtigung von Tertiärstrukturaspekten. Department of Biology, Univ. Heidelberg, FRG, Diploma thesis.
  118. Devereux, J., Haeberli, P. & Smithies, O. (1984). GCG package. Nucl. Acids Res.,12, 387-395.
  119. Rost WWW, B. (2000). PredictProtein - internet prediction service.
  120. Dao-pin, S., Sauer, U., Nicholson, H. & Matthews, B. W. (1991). Contributions of surface salt bridges to the stability of bacteriophage T4 lysozyme determined by directed mutagenesis. Biochem.,30, 7142-7153.
  121. Chothia, C. & Lesk, A. M. (1986). The relation between the divergence of sequence and structure in proteins. EMBO J.,5, 823-826.
  122. Doolittle, R. F. (1986). Of URFs and ORFs: a primer on how to analyze derived amino acid sequences. University Science Books, Mill Valley California.
  123. Lesk, A. M. (1991). Protein Architecture - A Practical Approach. Oxford University Press, Oxford, New York, Tokyo.
  124. Sander, C. & Schneider, R. (1991). Database of homology-derived structures and the structural meaning of sequence alignment. Proteins,9, 56-68.
  125. Rost, B. (1997). Protein structures sustain evolutionary drift. Folding & Design,2, S19-S24.
  126. Yang, A. S. & Honig, B. (2000). An integrated approach to the analysis and modeling of protein sequences and structures. II. On the relationship between sequence and structural similarity for proteins that are not obviously related in sequence. J. Mol. Biol.,301, 679-689.
  127. Lichtarge, O., Bourne, H. R. & Cohen, F. E. (1996). An evolutionary trace method defines binding surfaces common to protein families. J. Mol. Biol.,257, 342-358.
  128. Pazos, F., Sanchez-Pulido, L., Garcia-Ranea, J. A., Andrade, M. A., Atrian, S. et al. (1997). Comparative analysis of different methods for the detection of specificity regions in protein families. In BCEC97: Bio-Computing and Emergent Computation (Olsson, B., Lundh, D. & Narayanan, A., eds.), pp. 132-145, World Scientific, Skövde, Sweden.
  129. Dickerson, R. E., Timkovich, R. & Almassy, R. J. (1976). The cytochrome fold and the evolution of bacterial energy metabolism. J. Mol. Biol.,100, 473-491.
  130. Dickerson, R. E. (1971). The structure of cytochrome c and the rates of molecular evolution. J. Mol. Evol.,1, 26-45.
  131. Benner, S. A. (1989). Patterns of divergence in homologous proteins as indicators of tertiary and quaternary structure. Adv. Enzyme Regul.,28, 219-236.
  132. Frampton, J., Leutz, A., Gibson, T. J. & Graf, T. (1989). DNA-binding domain ancestry. Nature,342, 134.
  133. Bazan, J. F. (1990). Structural design and molecular evolution of a cytokine receptor superfamily. Proc. Natl. Acad. Sci. U.S.A.,87, 6934-6938.
  134. Benner, S. A. & Gerloff, D. (1990). Patterns of divergence in homologous proteins as indicators of secondary and tertiary structure of the catalytic domain of protein kinases. Adv. Enz. Reg.,31, 121-181.
  135. Niermann, T. & Kirschner, K. (1990). Improving the prediction of secondary structure of 'TIM-barrel' enzymes. Protein Eng.,4, 137-147.
  136. Barton, G. J., Newman, R. H., Freemont, P. S. & Crumpton, M. J. (1991). Amino acid sequence analysis of the annexin super-gene family of proteins. Eur. J. Biochem.,198, 749-760.
  137. Benner, S. A. (1992). Predicting de novo the folded structure of proteins. Curr. Opin. Str. Biol.,2, 402-412.
  138. Gibson, T. J. (1992). Assignment of a-helices in multiply aligned protein sequences - applications to DNA binding motifs. In Patterns in Protein Sequence and Structure (Taylor, W. R., eds.), pp. 99-110, Springer-Verlag, Berlin-Heidelberg.
  139. Musacchio, A., Gibson, T., Lehto, V.-P. & Saraste, M. (1992). SH3 - an abundant protein domain in search of a function. FEBS Lett.,307, 55-61.
  140. Barton, G. J. & Russell, R. B. (1993). Protein structure prediction. Nature,361, 505-506.
  141. Boscott, P. E., Barton, G. J. & Richards, W. G. (1993). Secondary structure prediction for homology modelling. Prot. Engin.,6, 261-266.
  142. Gerloff, D. L., Jenny, T. F., Knecht, L. J., Gonnet, G. H. & Benner, S. A. (1993). The nitrogenase MoFe protein. FEBS Lett.,318, 118-124.
  143. Gibson, T. J., Thompson, J. D. & Abagyan, R. A. (1993). Proposed structure for the DNA-binding domain of the Helix-Loop-Helix family of eukaryotic gene regulatory proteins. Prot. Engin.,6, 41-50.
  144. Livingstone, C. D. & Barton, G. J. (1994). Secondary structure prediction from multiple sequence data: blood clotting factor XIII and versinia protein-tyrosine phosphatase. Int. J. Peptide Protein Res.,44, 239-244.
  145. Valencia, A., Hubbard, T. J., Muga, A., Bañuelos, S., Llorca, O. et al. (1995). Prediction of the structure of GroES and its interaction with GroEL. Proteins,22, 199-209.
  146. Hansen, J. E., Lund, O., Nielsen, J. O., Brunak, S. & Hansen, J.-E. S. (1996). Prediction of the secondary structure of HIV-1 gp120. Proteins,25, 1-11.
  147. Maxfield, F. R. & Scheraga, H. A. (1979). Improvements in the Prediction of Protein Topography by Reduction of Statistical Errors. Biochem.,18, 697-704.
  148. Zvelebil, M. J., Barton, G. J., Taylor, W. R. & Sternberg, M. J. E. (1987). Prediction of protein secondary structure and active sites using alignment of homologous sequences. J. Mol. Biol.,195, 957-961.
  149. Rost, B., Sander, C. & Schneider, R. (1994). PHD - an automatic server for protein secondary structure prediction. CABIOS,10, 53-60.
  150. Altschul, S. F. & Gish, W. (1996). Local alignment statistics. Meth. Enzymol.,266, 460-480.
  151. Schneider, R. (1994). Sequenz und Sequenz-Struktur Vergleiche und deren Anwendung für die Struktur- und Funktionsvorhersage von Proteinen. Univ. of Heidelberg, PhD.
  152. Levin, J. M., Pascarella, S., Argos, P. & Garnier, J. (1993). Quantification of secondary structure prediction improvement using multiple alignment. Prot. Engin.,6, 849-854.
  153. Hubbard, T. J. P. & Park, J. (1995). Fold recognition and ab initio structure predictions using Hidden Markov models and b-strand pair potentials. Proteins,23, 398-402.
  154. Mehta, P. K., Heringa, J. & Argos, P. (1995). A simple and fast approach to prediction of protein secondary structure from multiply aligned sequences with accuracy above 70%. Prot. Sci.,4, 2517-2525.
  155. Di Francesco, V., Garnier, J. & Munson, P. J. (1996). Improving protein secondary structure prediction with aligned homologous sequences. Prot. Sci.,5, 106-113.
  156. Gerloff, D. L. & Cohen, F. E. (1996). Secondary structure prediction and unrefined tertiary structure prediction for cyclin A, B, and D. Proteins,24, 18-34.
  157. King, R. D. & Sternberg, M. J. (1996). Identification and application of the concepts important for accurate and reliable protein secondary structure prediction. Prot. Sci.,5, 2298-2310.
  158. Riis, S. K. & Krogh, A. (1996). Improving prediction of protein secondary structure using structured neural networks and multiple sequence alignments. J. Comp. Biol.,3, 163-183.
  159. Frishman, D. & Argos, P. (1997). 75% accuracy in protein secondary structure prediction. Proteins,27, 329-335.
  160. Salamov, A. A. & Solovyev, V. V. (1997). Protein secondary structure prediction using local alignments. J. Mol. Biol.,268, 31-36.
  161. Barton, G. J. (1996). Protein sequence alignment and database scanning. In Protein structure prediction (Sternberg, M. J. E., eds.), pp. 31-64, Oxford Univ. Press, Oxford.
  162. Gribskov, M. & Veretnik, S. (1996). Identification of sequence patterns with profile analysis. Meth. Enzymol.,266, 198-227.
  163. Higgins, D. G., Thompson, J. D. & Gibson, T. J. (1996). Using CLUSTAL for multiple sequence alignments. Meth. Enzymol.,266, 383-402.
  164. Hughey, R. & Krogh, A. (1996). Hidden Markov models for sequence analysis: extension and analysis of the basic method. CABIOS,12, 95-107.
  165. Orengo, C. A. & Taylor, W. R. (1996). SSAP: Sequential structure alignment program for protein structure comparison. Meth. Enzymol.,266, 617-635.
  166. Taylor, W. R. (1996). Multiple protein sequence alignment: algorithms and gap insertion. Meth. Enzymol.,266, 343-367.
  167. Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics,14, 755-763.
  168. Karplus, K., Barrett, C., Cline, M., Diekhans, M., Grate, L. et al. (1999). Predicting protein structure using only sequence information. Proteins,S3, 121-125.
  169. Orengo, C. A., Bray, J. E., Hubbard, T., LoConte, L. & Sillitoe, I. (1999). Analysis and assessment of ab initio three-dimensional prediction, secondary structure, and contacts prediction. Proteins,37, 149-170.
  170. Cuff, J. A., Birney, E., Clamp, M. E. & Barton, G. J. (2000). ProtEST: protein multiple sequence alignments from expressed sequence tags. Bioinformatics,16, 111-6.
  171. Jennings, A. J., Edge, C. M. & Sternberg, M. J. E. (2000). An approach to improve multiple alignments of protein sequences using predicted secondary structure. Prot. Engin.,in press.
  172. Reczko, M. (1993). Protein secondary structure prediction with partially recurrent neural networks. In First International Workshop on Neural Networks Applied to Chemistry and Environmental Sciences eds.), pp. 153-159, Gordon and Breach Science Publ., Lyon, France.
  173. Ouali, M. & King, R. D. (2000). Cascaded multiple classifiers for secondary structure prediction. Prot. Sci.,9, 1162-1176.
  174. Chandonia, J. M. & Karplus, M. (1999). New methods for accurate prediction of protein secondary structure. Proteins,35, 293-306.
  175. Petersen, T. N., Lundegaard, C., Nielsen, M., Bohr, H., Bohr, J. et al. (2000). Prediction of protein secondary structure at 80% accuracy. Proteins,41, 17-20.
  176. Schmidler, S. C., Liu, J. S. & Brutlag, D. L. (2000). Bayesian segmentation of protein secondary structure. J. Comp. Biol.,7, 233-248.
  177. Aurora, R. & Rose, G. D. (1998). Helix capping. Prot. Sci.,7, 21-38.

  178. Figureau, A., Angelica Soto, M. & Toha, J. (1999). Secondary structure of proteins and three-dimensional pattern recognition. J. Theor. Biol.,201, 103-11.
  179. Wang, Z.-X. & Yuan, Z. (2000). How good is prediction of protein structural class by the component-coupled method? Proteins,38, 165-175.
  180. Eyrich, V., Martí-Renom, M. A., Przybylski, D., Fiser, A., Pazos, F. et al. (2001). EVA: continuous automatic evaluation of protein structure prediction servers. Bioinformatics,in submission.
  181. Eyrich WWW, V., Martí-Renom, M. A., Przybylski, D., Fiser, A., Pazos, F. et al. (2001). EVA: continuous automatic evaluation of protein structure prediction servers. 2001.
  182. Rychlewski WWW, L. & Fischer, D. (2000). LiveBench: continous benchmarking of prediction servers. IIMCB Warsaw, WWW document (http://BioInfo.PL/LiveBench/).
  183. Koyama, S., Yu, H., Dalgarno, D. C., Shin, T. B., Zydowsky, L. D. et al. (1993). Structure of the PI3K SH3 Domain and Analysis of the SH3 Family. Cell,72, 945-952.
  184. Musacchio, A., Noble, M., Pauptit, R., Wierenga, R. & Saraste, M. (1992). Crystal structure of a Src-homology 3 (SH3) domain. Nature,359, 851-855.
  185. Hansen, L. K. & Salamon, P. (1990). Neural Network Ensembles. IEEE Trans. Pattern Anal. Machine Intel.,12, 993-1001.
  186. Rost, B. & Sander, C. (2000). Third generation prediction of secondary structure. In Protein structure prediction: methods and protocols (Webster, D., eds.), pp. 71-95, Humana Press, Totowa, NJ.
  187. King, R. D., Ouali, M., Strong, A. T., Aly, A., Elmaghraby, A. et al. (2000). Is it better to combine predictions? Prot. Engin.,13, 15-19.
  188. Selbig, J., Mevissen, T. & Lengauer, T. (1999). Decision tree-based formation of consensus protein secondary structure prediction. Bioinformatics,15, 1039-1046.
  189. Guermeur, Y., Geourjon, C., Gallinari, P. & Deleage, G. (1999). Improved performance in protein secondary structure prediction by inhomogeneous score combination. Bioinformatics,15, 413-421.
  190. Rost WWW, B. (2001). WWW services for sequence analysis. EMBL, WWW document (http://cubic.bioc.columbia.edu/doc/links_index.html).
  191. Eyrich WWW, V. & Rost, B. (2000). The META-PredictProtein server.
  192. Barton, G. J. (1995). Protein secondary structure prediction. Curr. Opin. Str. Biol.,5, 372-376.
  193. Kirshenbaum, K., Young, M. & Highsmith, S. (1999). Predicting allosteric switches in myosins. Prot. Sci.,8, 1806-1815.
  194. Przytycka, T., Aurora, R. & Rose, G. D. (1999). A protein taxonomy based on secondary structure. Nat. Struct. Biol.,6, 672-682.
  195. Liu T, J., Tan, H. & Rost, B. (2000). Genomes full of proteins with long non-structured regions? Columbia University.
  196. Brautigam, C., Steenbergen-Spanjers, G. C., Hoffmann, G. F., Dionisi-Vici, C., van den Heuvel, L. P. et al. (1999). Biochemical and molecular genetic characteristics of the severe form of tyrosine hydroxylase deficiency. Clin Chem,45, 2073-2078.
  197. Davies, G. P., Martin, I., Sturrock, S. S., Cronshaw, A., Murray, N. E. et al. (1999). On the structure and operation of type I DNA restriction enzymes. J. Mol. Biol.,290, 565-579.
  198. Di Stasio, E., Sciandra, F., Maras, B., Di Tommaso, F., Petrucci, T. C. et al. (1999). Structural and functional analysis of the N-terminal extracellular region of beta-dystroglycan. Biochem Biophys Res Commun,266, 274-278.
  199. Gerloff, D. L., Cannarozzi, G. M., Joachimiak, M., Cohen, F. E., Schreiber, D. et al. (1999). Evolutionary, mechanistic, and predictive analyses of the hydroxymethyldihydropterin pyrophosphokinase family of proteins. Biochem Biophys Res Commun,254, 70-6.
  200. Juan, H. F., Hung, C. C., Wang, K. T. & Chiou, S. H. (1999). Comparison of three classes of snake neurotoxins by homology modeling and computer simulation graphics. Biochem Biophys Res Commun,257, 500-10.
  201. Laval, V., Chabannes, M., Carriere, M., Canut, H., Barre, A. et al. (1999). A family of Arabidopsis plasma membrane receptors presenting animal beta-integrin domains. Biochim. Biophys. Ac.,1435, 61-70.
  202. Seto, M. H., Liu, H. L., Zajchowski, D. A. & Whitlow, M. (1999). Protein fold analysis of the B30.2-like domain. Proteins,35, 235-249.
  203. Xu, H., Aurora, R., Rose, G. D. & White, R. H. (1999). Identifying two ancient enzymes in Archaea using predicted secondary structure alignment. Nat. Struct. Biol.,6, 750-4.
  204. Jackson, R. M. & Russell, R. B. (2000). The serine protease inhibitor canonical loop conformation: examples found in extracellular hydrolases, toxins, cytokines and viral proteins. J. Mol. Biol.,296, 325-334.
  205. Paquet, J. Y., Vinals, C., Wouters, J., Letesson, J. J. & Depiereux, E. (2000). Topology prediction of Brucella abortus Omp2b and Omp2a porins after critical assessment of transmembrane beta strands prediction by several secondary structure prediction methods. J Biomol Struct Dyn,17, 747-757.
  206. Shah, P. S., Bizik, F., Dukor, R. K. & Qasba, P. K. (2000). Active site studies of bovine alpha1-->3-galactosyltransferase and its secondary structure prediction. Biochim. Biophys. Ac.,1480, 222-234.
  207. Stawiski, E. W., Baucom, A. E., Lohr, S. C. & Gregoret, L. M. (2000). Predicting protein function from structure: unique structural features of proteases. Proc Natl Acad Sci U S A,97, 3954-8.
  208. Rost, B. (1995). TOPITS: Threading One-dimensional Predictions Into Three-dimensional Structures. In Third International Conference on Intelligent Systems for Molecular Biology (Rawlings, C., Clark, D., Altman, R., Hunter, L., Lengauer, T. et al., eds.), pp. 314-321, Menlo Park, CA: AAAI Press, Cambridge, England.
  209. Fischer, D. & Eisenberg, D. (1996). Fold recognition using sequence-derived properties. Prot. Sci.,5, 947-955.
  210. Russell, R. B., Copley, R. R. & Barton, G. J. (1996). Protein fold recognition by mapping predicted secondary structures. J. Mol. Biol.,259, 349-365.
  211. Ayers, D. J., Gooley, P. R., Widmer-Cooper, A. & Torda, A. E. (1999). Enhanced protein fold recognition using secondary structure information from NMR. Prot. Sci.,8, 1127-1133.
  212. de la Cruz, X. & Thornton, J. M. (1999). Factors limiting the performance of prediction-based fold recognition methods. Prot. Sci.,8, 750-759.
  213. Di Francesco, V., Munson, P. J. & Garnier, J. (1999). FORESST: fold recognition from secondary structure predictions of proteins. Bioinformatics,15, 131-140.
  214. Hargbo, J. & Elofsson, A. (1999). Hidden Markov models that use predicted secondary structures for fold recognition. Proteins,36, 68-76.
  215. Jones, D. T. (1999). GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J. Mol. Biol.,287, 797-815.
  216. Jones, D. T., Tress, M., Bryson, K. & Hadley, C. (1999). Successful recognition of protein folds using threading methods biased by sequence similarity and predicted secondary structure. Proteins,37, 104-111.
  217. Koretke, K. K., Russell, R. B., Copley, R. R. & Lupas, A. N. (1999). Fold recognition using sequence and secondary structure information. Proteins,37, 141-148.
  218. Ota, M., Kawabata, T., Kinjo, A. R. & Nishikawa, K. (1999). Cooperative approach for the protein fold recognition. Proteins,37, 126-132.
  219. Panchenko, A., Marchler-Bauer, A. & Bryant, S. H. (1999). Threading with explicit models for evolutionary conservation of structure and sequence. Proteins,Suppl 3, 133-140.
  220. Kelley, L. A., MacCallum, R. M. & Sternberg, M. J. (2000). Enhanced genome annotation using structural profiles in the program 3D-PSSM. J. Mol. Biol.,299, 499-520.
  221. Heringa, J. (1999). Two strategies for sequence comparison: profile-preprocessed and secondary structure-induced multiple alignment. Comput. Chem.,23, 341-364.
  222. Ng, P., Henikoff, J. & Henikoff, S. (2000). PHAT: a transmembrane-specific substitution matrix. Bioinformatics,16, in press.
  223. Baldi, P., Pollastri, G., Andersen, C. A. & Brunak, S. (2000). Matching protein beta-sheet partners by feedforward and recurrent neural networks. Ismb,8, 25-36.
  224. Ortiz, A. R., Kolinski, A., Rotkiewicz, P., Ilkowski, B. & Skolnick, J. (1999). Ab initio folding of proteins using restraints derived from evolutionary information. Proteins,Suppl 3, 177-185.
  225. Eyrich, V. A., Standley, D. M., Felts, A. K. & Friesner, R. A. (1999). Protein tertiary structure prediction using a branch and bound algorithm. Proteins,35, 41-57.
  226. Eyrich, V. A., Standley, D. M. & Friesner, R. A. (1999). Prediction of protein tertiary structure to low resolution: performance for a large and structurally diverse test set. J. Mol. Biol.,288, 725-742.
  227. Lomize, A. L., Pogozheva, I. D. & Mosberg, H. I. (1999). Prediction of protein structure: the problem of fold multiplicity. Proteins,Suppl, 199-203.
  228. Chen, C. C., Singh, J. P. & Altman, R. B. (1999). Using imperfect secondary structure predictions to improve molecular structure computations. Bioinformatics,15, 53-65.
  229. Samudrala, R., Xia, Y., Huang, E. & Levitt, M. (1999). Ab initio protein structure prediction using a combined hierarchical approach. Proteins,Suppl, 194-198.
  230. Samudrala, R., Huang, E. S., Koehl, P. & Levitt, M. (2000). Constructing side chains on near-native main chains for ab initio protein structure prediction. Prot. Engin.,13, 453-457.
  231. Minor, D. L. J. & Kim, P. S. (1996). Context-dependent secondary structure formation of a designed protein sequence. Nature,380, 730-734.
  232. Krittanai, C. & Johnson, W. C. J. (2000). The relative order of helical propensity of amino acids changes with solvent environment. Proteins,39, 132-141.
  233. Pan, X. M., Niu, W. D. & Wang, Z. X. (1999). What is the minimum number of residues to determine the secondary structural state? J. Prot. Chem.,18, 579-584.
  234. Jacoboni, I., Martelli, P. L., Fariselli, P., Compiani, M. & Casadio, R. (2000). Predictions of protein segments with the same aminoacid sequence and different secondary structure: A benchmark for predictive methods. Proteins,41, 535-544.
  235. Zhou, X., Alber, F., Folkers, G., Gonnet, G. H. & Chelvanayagam, G. (2000). An analysis of the helix-to-strand transition between peptides with identical sequence. Proteins,41, 248-256.
  236. Compiani, M., Fariselli, P., Martelli, P. L. & Casadio, R. (1999). Neural networks to study invariant features of protein folding. Theoretical Chemistry Accounts,101, 21-26.
  237. Rost, B. & Valencia, A. (1996). Pitfalls of protein sequence analysis. Curr. Opin. Biotech.,7, 457-461.
  238. Rost, B., Casadio, R. & Fariselli, P. (1996). Topology prediction for helical transmembrane proteins at 86% accuracy. Prot. Sci.,5, 1704-1718.
  239. Rost WWW, B. (1996). Accuracy of predicting buried helices by PHDsec. EMBL Heidelberg, Germany, WWW document (http://www.embl-heidelberg.de/~rost/Res/96B-PredBuriedHelices.html).

  240. Rost WWW, B. (1996). 1D structure prediction for Chameleon (IgG binding domain of protein G). EMBL Heidelberg, Germany, WWW document (http://www.embl-heidelberg.de/~rost/Res/96C-PredChameleon.html).
  241. Dalal, S., Balasubramanian, S. & Regan, L. (1997). Protein alchemy: changing b-sheet into a-helix. Nat. Struct. Biol.,4, 548-552.
  242. Chou, P. Y. & Fasman, G. D. (1978). Prediction of the secondary structure of proteins from their amino acid sequence. Adv. Enzymol.,47, 45-148.
  243. Garnier, J., Gibrat, J.-F. & Robson, B. (1996). GOR method for predicting protein secondary structure from amino acid sequence. Meth. Enzymol.,266, 540-553.
  244. Rost, B. (2001). Predicting protein structure: better data, better results! J. Mol. Biol.,in submission.
  245. Abagyan, R. A. & Batalov, S. (1997). Do aligned sequences share the same fold? J. Mol. Biol.,273, 355-368.
  246. Park, J., Teichmann, S. A., Hubbard, T. & Chothia, C. (1997). Intermediate sequences increase the detection of distant sequence homologies. J. Mol. Biol.,273, 349-354.
  247. Kozlov, G., Ekiel, I., Beglova, N., Yee, A., Dharamsi, A. et al. (2000). Rapid fold and structure determination of the archaeal translation elongation factor 1beta from Methanobacterium thermoautotrophicum. J Biomol NMR,17, 187-194.
  248. Rost WWW, B., Eyrich, V. A., Przybylski, D., Pazos, F., Valencia, A. et al. (2000). EVA - Evaluation of automatic protein structure prediction services. Columbia University / Rockefeller University / CNB Madrid, WWW document (http://cubic.bioc.columbia.edu/eva).
  249. Matthews, B. W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Ac.,405, 442-451.
  250. Zhang, C.-T. & Chou, K.-C. (1992). An optimization approach to predicting protein structural class from amino acid composition. Prot. Sci.,1, 401-408.