EVA: large-scale analysis of secondary structure prediction

Burkhard Rost 1 * & Volker A. Eyrich 1,2

1 CUBIC, Columbia University, Department of Biochemistry and Molecular Biophysics, 650 West 168th Street, New York, NY 10032, USA

2 Columbia Univ., Dept. of Chemistry, 3000 Broadway MC 3136, New York, NY 10027, USA

* Corresponding author: rost@columbia.edu, http://cubic.bioc.columbia.edu/
Tel: +1-212-305-3773, fax: +1-212-305-7932

contact e-mail:rost@columbia.edu


Title: EVA: large-scale analysis of secondary structure prediction
Author:Burkhard Rost \& Volker A Eyrich
Quote: Proteins, 2001, 45 Suppl 5:S192-S199

Table of Contents


Abstract

EVA is a web-based server that evaluates automatic structure prediction servers continuously and objectively. Since June 2000, EVA collected more than 20,000 secondary structure predictions. The EVA sets sufficed to conclude that the field of secondary structure prediction has advanced again. Accuracy increased substantially in the 90's through using evolutionary information taken from the divergence of proteins in the same structural family. Recently, the evolutionary information resulting from improved searches and larger databases has again boosted prediction accuracy by more than four percentage points to its current height around 76% of all residues predicted correctly in one of the three states helix, strand, other. The best current methods solved most of the problems raised at earlier CASP meetings: All good methods now get segments right and perform well on strands. Is the recent increase in accuracy significant enough to make predictions even more useful? We believe the answer is affirmative. What is the limit of prediction accuracy? We shall see.

Availability: all data is available through the EVA web site at {cubic.bioc.columbia.edu/eva/}. The raw data for the results presented are available at {eva}/sec/bup_common/2001_02_22/.

Key words: automatic evaluation, large-scale assessment, and protein structure prediction

Abbreviations used: 3D, three-dimensional; 1D, one-dimensional (e.g. string of secondary structure); DSSP, programs and data base assigning secondary structure from 3D coordinates [1] ; JPred2, divergent profile (PSI-BLAST) based neural network prediction [2] ; PHD, Profile based neural network prediction of secondary structure (PHDsec), solvent accessibility (PHDacc), and transmembrane helices (PHDhtm) [3] ; PHDpsi, divergent profile (PSI-BLAST) based neural network prediction [4] ; PSI-BLAST, position specific iterated database search [5] ; PDB, Protein Data Bank of experimentally determined 3D structures of proteins; PROFphd, Advanced profile-based neural network prediction of secondary structure (Rost, unpublished), PROFking, cascaded statistic based secondary structure prediction method [6] ; PSIPRED, divergent profile (PSI-Blast) based neural network prediction [7] ; SAM-T99sec, neural network prediction, using Hidden Markov models as input [8] ; SSpro, profile-based advanced neural network prediction method [9] ;


Introduction

Secondary structure is at the heart of structure prediction. The rapidly growing sequence-structure gap (number of known protein structures vs. number of known protein sequences) has enticed theoreticians to solve simplified prediction problems [10, 11] . An extreme simplification is the prediction of protein structure in one dimension (1D), as represented by strings of secondary structure. Theoreticians are lucky in that this relatively simple task comprises a goal relevant for prediction of protein structure and function, in general. Almost any imaginable algorithm has been applied to this task. The result is that we have come a long way since the first method published 44 years ago [12] . The most important aspect of third generation methods demonstrating their break-through at the first CASP meetings was the automatic use of evolutionary information [13, 14] . In fact, secondary structure prediction may have been the most successful discipline of protein structure prediction over the last 40 years. Is the field still alive?

Here, we focused on presenting various aspects of the performance of recent secondary structure prediction methods. We analysed automatic methods based on large data sets. The machinery allowing such a large-scale assessment is the automatic, continuous, and objective web server EVA [15, 16] .


Methods

Current implementation of EVA

EVA assessment in four prediction categories. Currently, EVA [15] evaluates four different categories of structure prediction servers (URLs at [16] ): (1) comparative modelling, (2) fold recognition and threading, (3) secondary structure prediction, and (4) inter-residue contact predictions. The following groups agreed to let their public secondary structure prediction servers be evaluated by EVA: James Cuff & Geoff Barton (JPred2) [17, 2] ; Mohammed Ouali & Ross King (PROFking) [6] ; David Jones (PSIPRED) [7, 18] , Gajendra Raghava (PSSP, unpublished), Kevin Karplus (SAM-T99sec) [8] , Pierre Baldi & Gianluca Pollastri (SSpro) [9] . Our group contributed PHDsec [19, 20, 3] , PHDpsi [4] , and PROFphd (Rost, unpublished).

Results are updated every week. Every day, EVA obtains the latest experimentally determined structures from the PDB [21] web-site. These structures are parsed into chains using the DSSP program [1] . The sequence of each chain is submitted immediately to the prediction servers using META-PP [22] . Predictions are collected and sent for evaluation to the EVA-satellites: to Rockefeller University for comparative modelling, to CNB Madrid for contact predictions, and to CUBIC at Columbia University for all other predictions. Depending on the category, the assessments are made available within hours to days. The central EVA site at Columbia downloads all HTML pages produced by the satellites, and builds up the 'latest week' results that are then mirrored at the Rockefeller University and at the CNB Madrid.

Measuring secondary structure prediction accuracy

Selection of data sets. Currently, EVAsec uses only proteins with new structures to evaluate secondary structure prediction. We use the following operational definition: if a pairwise alignment search detects the similarity at a level at which it detects 50% false positives, the sequence similarity is deemed 'not significant'. This concept translates to a threshold above 28 identical in 100 aligned residues [23, 24] . This implied that we presented only results from the 'fold recognition' and 'new fold' categories in CASP. We reported results for two sets; the first set_218 contained all 218 protein chains with new structures added to PDB between Jun 2000 and Feb 2001 for which we had results for 6 methods. The second set_99 was a subset of set_218 with 99 chains for which we had predictions for all 9 methods evaluated.

Assigning secondary structure from 3D coordinates. EVAsec uses secondary structure assignments from DSSP [1] . The eight DSSP states are converted to three states using the following transformation: DSSP [HGI] -> helix (H), DSSP [EB] -> strand (E), all other DSSP states [TS ] -> other (L). Note: occasionally developers convert 310 helices (DSSP G), pi-helices (DSSP I), or beta-bridges (DSSP B) to the 'other' state. Such a conversion seemingly increases accuracy, since these states are more difficult to predict [17] .

Scoring per-residue accuracy. The three-state per-residue accuracy (Q3) is the most widely used score for evaluating secondary structure predictions. Q3 gives the percentage of residues correctly predicted in one of the three states: helix, strand, other. Most residues are observed in the 'other' state. Hence, Q3 can be high even for methods predicting helices and strands inaccurately. One way around this problem is to measure the percentages of residues observed in state i (HEL) predicted correctly in state i () and the percentage of residues predicted in i and predicted correctly in i (). Another way around are the Matthews correlation coefficients [25] , and the information index [19, 26] . Some methods predict 3D structure starting from rigid body secondary structure segments. These methods need predictions with a low percentage of residues confused between strand and helix as measured by the BAD score [27] .

Scoring per-segment accuracy. In practice, methods that get most of the segment cores right are more useful than those that get some of the entire segments right. Per-residue scores cannot distinguish between these two. Many segment-based measures have been proposed [26] ; the one that appears to distinguish best between good and bad predictions is the average overlap between segments (SOV) [26, 28] .

Scoring accuracy in predicting secondary structure class. A coarse-grained classification of protein structures bases on secondary structure composition [29, 30] . Hence, secondary structure predictions also imply predictions of secondary structural class. EVAsec reports the percentage of proteins correctly predicted in one of the following four classes: all-alpha (length > 60, helix > 45%, strand < 5%), all-beta (length > 60, helix < 5%, strand > 45%), alpha/beta (length > 60, helix > 30%, strand > 20%), other. The thresholds were chosen by intuition [31, 32, 20] , since these simplified structural classes are not separated well [33] . EVAsec also reports differences between observed and predicted overall content in order to measure the accuracy in predicting secondary structure composition independently of thresholds.

Ranking methods

Methods are not ranked based on too few test proteins! EVAsec does not rank prediction methods based on too few test proteins. For example, because the accuracy of secondary structure prediction varies between proteins, accuracy estimates typically constitute averages over many test sequences, with standard deviations usually above ten percentage points. We use this standard deviation to estimate the error of the average accuracy as a function of the test set size. A significant difference (ÆQ) between two methods, or the error of the accuracy estimate for one method, is:

where Q is the measure for accuracy, Nprot the number of proteins used in the test set, NprotLarge the number of proteins used in a larger, representative set (>100 proteins) and s (Q, NprotLarge) is the standard deviation of variable Q in a large, representative test set (assuming a Gaussian distribution of variable Q). The observation that different prediction methods typically have similar standard deviations provides a necessary justification for this approach. For example, when a method correctly predicts 75% of the residues in a test set of 16 proteins with a standard deviation of 10%, a difference relative to another method that is smaller than 2.5% (i.e., ÆQ = 10/sqrt(16)) is not significant. Thus, we cannot distinguish between two methods that predict correctly 75% and 73% of all residues, respectively. EVAsec uses this estimate to rank methods in the following way. Assume four methods have accuracy levels of A=75, B=73, C=71, and D=68. D can be distinguished from all other methods (ÆQ > 2.5 to all). Hence, it ranks last. C can be distinguished from A (ÆQ = 4 > 2.5). However, A cannot be distinguished from B (ÆQ = 2 < 2.5), and B cannot be distinguished from C (ÆQ=2 < 2.5). This situation results in a dilemma that has four different possible solutions: (1) A, B and C get the same rank ascertaining that no two methods are ranked differently that cannot be distinguished. (2) A and B get rank 1, and C rank 2 assuring that no two methods are ranked equally that can be distinguished. (3) A gets rank 1, B rank 2 and C rank 3, ignoring that we cannot distinguish between A and B, nor between B and C. (4) Do not rank. None of these solutions is 'correct'. During the first three CASP experiments, solution 3 was practised (higher average results in higher rank). The evaluation of secondary structure prediction performance for CASP4 effectively implemented a concept more similar to solution 1 [34] . The first solution is also realised by EVAsec. For the example given this implies that A, B, and C are ranked 1; D is ranked 2.

Comparing prediction methods

Bootstrap experiment to test effect of small data sets. CASP4 had 53 targets; for 43 structures were available at the meeting in Dec. 2000. 29 of these 43 proteins constituted fold recognition/novel fold targets at CASP (Methods). For 11 of these 29 all methods participating at CAFASP predicted secondary structure [35] . Are these sets sufficient to rank methods? Although such ranking based on too small sets was carefully avoided at CASP4 [34] , we still wanted to address this question by the following bootstrap experiment. (1) Take a data set of proteins predicted by all M methods (here 99 protein chains from 10 methods). (2) Select at random (i) a set of 11 proteins predicted by all methods (common sets) and (ii) another M sets of 29 proteins that may differ between the methods (incomparable sets). (3) Measure the average performance on each set.

Methods best evaluated on identical subsets. The first observation from such an experiment was that methods differed considerably between different random draws, i.e. between different hypothetical CASP experiments ( Fig. 1 ). Remarkably, using 29 proteins from incomparable subsets ( Fig. 1 A) resulted in slightly higher variation than using 11 proteins from common subsets ( Fig. 1 B). This suggested that - on average - it is a better strategy to base comparisons of methods on identical common subsets rather than on all available predictions even if the constraint to have a common subset reduces the available data from 29 to 11.

Ranking methods based on small subsets can be misleading. The variation between different random draws of data sets became even more dramatic when we ranked the methods based on the average performance: most methods did rank best AND worst in one of the random draws. Ranks varied more for the 29 proteins from incomparable sets than for the 11 proteins from the common sets. Assuming that the 99 proteins constitute a 'representative set' (which is wrong, as indicated below and in Table 1 ), we can compile a cumulative average of one method over many random draws. How many draws do we need before the methods will reach the averages of the original set? About 30 draws for the common subsets of 11 proteins, and more than 60 for the incomparable subsets of 29 proteins (data not shown). In our experiment, we forced all methods to predict equal numbers of proteins. Most methods provide estimates for the reliability of the prediction for each residue ( Fig. 3 ). What if methods submitted only their seemingly best predictions? We used such an index to select only the most reliable predictions from one method while forcing all other methods to predict for all proteins. Note that this selection was realised without knowing the accuracy of a particular prediction. Surprisingly, we could make EVERY method for which we had such an index become the winner at ANY of the random draws by submitting only the most reliable predictions! Our bootstrap experiments underlined what most CASP evaluators practised: ranking methods based on small subsets may not be appropriate.



Fig. 1
fig1.gif

Fig. 1. Significance of averages and ranks from small numbers. CAFASP2 had 29 sequence-unique proteins to evaluate secondary structure prediction; for 11 of these we had CAFASP2 results for all methods. Each point in the graphs is an average over 29 (A) and over 11 (B) proteins for one method. Proteins are selected at random from a set of 99 protein chains. The x-axes give the number of different random draws from the set of 99 proteins. The subsets of 11 proteins are constrained to be identical between all methods for every subset (B), while the 29 (A) are not subjected to this constraint, i.e. they differ on average between the methods. Below, the spread of ranks are given that would result from the respective averages. For example, JPred2 would be the winner at one random draw and come in the last at another when using 29 incomparable proteins (A: 1-8); it would never rank 'last' when using 11 identical proteins (B: 1-7). Conclusions: firstly, 11 proteins are clearly not enough to rank methods. Secondly, if we have results for all methods for 29 proteins but these are not identical 29, results are even less significant. Hence, using identical subsets is the better strategy even if it implies to discard most possible targets.




Results

Better alignments improved secondary structure predictions significantly. The set of 99 new protein chains for which EVA collected results for all methods did not suffice to distinguish between all methods. However, some trends became apparent: A number of methods in 2001 predict secondary structure more accurately than did the best method of 1996. In fact, all methods using alignments not restricted to pairwise comparisons performed significantly better than PHD using only pairwise alignment information (Table 1). Simply replacing the pairwise alignments input to PHD by PSI-BLAST profiles [5] , made the resulting PHDpsi rank in the 'winner' group. (Upon closer look: most of the improvement of PHDpsi over PHD resulted from using larger databases, rather than from using PSI-BLAST [4] .) Did this imply that nothing has changed but the databases and the search methods? 99 protein chains did not suffice to tell.



Table 1
Table 1: Accuracy on a common set of 99 new protein chains A

Method

CASP

Q3

S OV

B AD

Info

CH

CE

Class

H

E

JPred2

102

75.5

67.6

72.2

84.7

58.0

75.0

1.9

0.34

0.67

0.59

80.8

6.7

5.6

PHDpsi

385

74.5

67.9

78.7

81.9

63.8

69.0

2.9

0.23

0.68

0.58

80.8

6.6

4.8

PROFphd

402

77.0

71.6

80.5

84.4

68.8

71.6

2.3

0.36

0.72

0.63

81.8

5.9

3.9

PROFking

214

74.4

67.4

74.0

86.4

70.9

66.6

2.7

0.32

0.69

0.61

84.8

7.6

6.7

PSIPRED

258

76.8

72.2

81.9

82.9

68.5

72.3

2.5

0.37

0.71

0.63

84.8

5.5

4.6

SAM-T99sec

111

76.1

70.8

84.7

80.0

62.0

76.9

1.9

0.35

0.71

0.62

80.8

6.7

4.8

SSpro

115

76

69.1

81.2

82.1

63.2

72.6

2.4

0.35

0.70

0.60

77.8

6.5

5.8

                             

PHD

142

71.7

67.3

75.9

77.7

61.1

62.9

3.8

0.25

0.62

0.53

77.8

7.9

5.7

                             

PSSP

510

64.3

58.7

64.4

71.5

55.6

50.0

5.9

0.20

0.50

0.40

74.7

9.8

7.8


A:
Data set and sorting: All methods have been tested on the same set of 99 new protein chains (EVA version Feb 2001). None of these structures was similar to any protein used to develop the respective method. This set comprised the largest such set by Feb 23, 2001 for which we had results. Sorting and grouping reflects the following concept: if the data set is too small to distinguish between two methods, these two are grouped. For the given set of 99 protein this yielded three groups. Inside of each group, results are sorted alphabetically. Note: groups are separated by an empty line; 99 proteins did not suffice to separate between the first 7 methods (see Table 2 for a larger set).
Method: see abbreviations on top of article.
Scores [19, 58] : Q3: three-state per-residue accuracy, i.e., number of residues predicted correctly in either of the three states helix, strand, other; SOV: three-state per-segment score measuring the overlap between predicted and observed segments [26, 28] ; : residues predicted correctly in helix (or strand) as percentage of residues observed in helix (or strand); : residues predicted correctly in helix (or strand) as percentage of residues predicted in helix (or strand); BAD: percentage of helical residues predicted as strand, and of strand residues predicted as helix [27] ; Info: per-residue information content [19] ; CH: Matthew's correlation coefficient for state helix [59] ; CE: Matthew's correlation for state strand [59] ; Class: percentage of proteins correctly sorted into one of the four classes: all-alpha, all-beta, alpha/beta, other; ÆH: difference between predicted and observed secondary structure content in helix; ÆH: difference between predicted and observed secondary structure content in strand.



And the winners are … For a few of the methods EVA had 2-3 times larger data sets. In particular, 218 protein chains sufficed to distinguish between some of the methods indistinguishable when evaluated on 99 chains: PROFphd, PSIPRED, and SSpro were significantly more accurate than JPred2 and PHDpsi ( Table 2 ). All three 'winners' were equally balanced in predicting strand and helix, and had a similar level of performance in predicting secondary structure content. Unlike the bulk of methods presented during the early CASP meetings all 7 methods shown in Table 1 predicted beta strands on average more accurately than residues in non-regular structure. While SSpro was significantly less accurate in predicting segments than PROFphd, all three best methods predicted segments much better than the two second best methods JPred2 and PHDpsi.



Table 2
Table 2: Accuracy on a set of 218 identical proteins A

Method B

Q3

S OV

B AD

info

CH

CE

Class

H

E

PROFsec

76.8

72.8

80.5

84.4

68.8

71.6

2.2

0.36

0.72

0.63

82.1

5.6

4.0

PSIPRED

76.4

72.0

81.9

82.9

68.5

72.3

2.5

0.37

0.71

0.63

79.8

5.4

4.4

SSpro

76.1

71.2

81.2

82.1

63.2

72.6

2.5

0.35

0.70

0.60

81.2

6.0

5.2

                           

JPred2

74.8

69.3

72.2

84.7

58.0

75.0

2.4

0.34

0.67

0.59

76.1

7.6

5.7

PHDpsi

74.7

69.6

78.7

81.9

63.8

69.0

3

0.29

0.68

0.58

79.8

6.2

4.9

                           

PHD

71.4

67.4

75.9

77.7

61.1

62.9

4.2

0.25

0.62

0.53

76.1

7.7

5.9

A: symbols as in Table 1; results based on 218 proteins not used for developing the methods.



Some proteins predicted well by all methods. The per-protein average over the 7 methods that performed best on the set of 99 chains ( Table 1 ) varied as strongly between proteins as did each of the methods. Did this finding confirm the notion that some proteins were easier to predict than others? On average, the worst prediction method for a particular protein chain had a higher accuracy when the average over all methods was higher ( Fig. 2 ). In other words, some chains were predicted by all methods more accurately than others, in particular, the average over all methods was below 70% (Q3) for 20% of the chains, above 80% for another 20% of the chains, and between 70 and 80 for all other chains ( Fig. 2 ). However, for many chains the average over all methods was high, although some method had a very low accuracy. The best methods reached about 77% accuracy. For more than 80% of all proteins one method performed better than this. Hence, most proteins were predicted above average by at least one method. For only two of the 99 chains, all methods reached less than 68% accuracy (Appendix). The worst predictions were obtained for the short peptide of the human apoliprotein II (1by6:A), the structure of which was determined by NMR. By default, we used the first NMR model to determine prediction errors although other models correlated better with the predictions (data not shown). The other bad predictions were for the endonuclease I-PPOI complexed to DNA (1evw:A). The major problems were that most methods missed the two helices reaching into the DNA on opposite sites of the molecule, and over-predicted a long strand for two parallel non-hydrogen bonded stretches on the opposite site of the DNA-binding. For 66 of the 99 protein chains, one method had more than 80% of the residues predicted correctly, while the average over all methods reached this level for only 25 out of 99. Could we anticipate which method did best on which protein without knowing the structure?

Prediction accuracy did not correlate well with experimental resolution. We used our largest set of 218 chains to analyse whether or not prediction accuracy correlated with the resolution of the respective structure. Surprisingly, we could not find a strong correlation between accuracy and resolution. Nevertheless, prediction accuracy was about three percentage points higher for the Xray structures than for the first NMR models. Furthermore, when averaging all predictions for the quarter of all chains with highest resolution, we found levels of accuracy about four percentage points above the average over the quarter with lowest resolution.



Fig. 2
fig2.gif

Fig. 2. Best, worst and average predictions for each protein. Each cross is the average over all 7 prediction methods that ranked best on 99 protein chains (Table 1); triangles give the lowest and highest accuracy for a particular protein. The x-axis describes for which fraction of the 99 proteins the average was above a certain value. For example, the average accuracy was below 70% for 20% of all proteins, while only for 6 of 99 the best method did not reach 70% accuracy. Furthermore, counting the filled triangles below the 70% line revealed that for half the proteins (48 of the 99) even the worst method surpassed 70% accuracy.



Reliability indices estimated prediction accuracy accurately. Prediction methods typically use three output states for helix, strand, other, and predict the state with the highest value as the secondary structure of the respective residue. Assume the output for one residue is (0.3, 0.4, 0.3); that for another (0.1, 0.8, 0.1); both residues are predicted as strand. However, the second prediction is much stronger. This difference can be carved into an index describing the reliability of a secondary structure prediction for each residue [36, 19] . After the successes of such indices at the first CASP meeting, almost all methods now implement estimates for the reliability of the prediction for each residue. For all methods tested, these indices correlated surprisingly well with accuracy ( Fig. 3 ). For example, PSIPRED and PROFphd reached levels above 90% for the half of the residues predicted most strongly.



Fig. 3
fig3.gif

Fig. 3. Prediction strength correlated well with accuracy. Residues predicted at higher reliability are predicted more accurately [19, 3] . Reliability indices are now used by most methods. Shown are cumulative values, i.e. the accuracy (Q3) for all residues predicted above a given reliability. For example, for all methods 90% of the 40% most strongly predicted residues are predicted correctly. Results were based on the set_99 also used for Table 1.



Secondary structural class predicted almost as accurately as by experiment. Grouping proteins into secondary structure classes (all-alpha, all-beta, alpha/beta, other) appears a useful initial approach toward classifying proteins [37, 38] . Such classes can be predicted successfully based merely on the overall amino acid composition of a protein [39, 40, 41] . More and more increasingly complex and genial methods address this reduced goal; reported levels of prediction accuracy approach 100%. Recently, Wang & Yuan explained these high values by insufficient testing schemes, and challenged that a four-state accuracy of around 60% comprises the maximum for methods based solely on composition [41] . Obviously, it is much easier to predict class starting from the detailed information about evolutionary profiles for the entire sequence than by restricting the input to composition. In fact, today's best general prediction methods also predict secondary structure class better ( Table 1 Table 2 ). The differences between observed and predicted secondary structure composition are now below 6% for helix and strand. This performance is similar to what experimental low-resolution (circular dichroism, Fourier transform induced spectroscopy) methods achieve at their best [20, 42] .

Homologues of known structures predicted marginally better. All current top-of-the-line methods somehow learn the secondary structure for proteins of known structure. In particular, no method relies entirely on 'first principles' like, e.g. one of the best methods of the second generation ALB did [43] . Consequently, today's methods somehow depend on residual sequence similarity between target and known structures. Did this imply that prediction accuracy was significantly higher for proteins with homologues of known structure? For example, PSIPRED reached a level close to 80% accuracy for a set of 223 protein chains with significant sequence similarity to known structures (data not shown) compared to about 76-77% for new structures ( Table 2 ). A similar trend persisted for all prediction methods analysed (data not shown). How would these values compare to inferring secondary structure through comparative modelling? We did not yet compile data to address this question explicitly. However, for structural alignments the respective secondary structure assignments agree for more than 88% of all residues [26, 44] . Hence, when we know a protein of known structure that is similar to a target, we supposedly still best use comparative modelling to predict secondary structure. A similar conclusion was suggested by analysing the CASP4 results for comparative modelling [34] .


Conclusions

CASP and EVA: both are needed. For CAFASP all methods predicted secondary structure for the same 11 proteins, and some methods for another 18. Our bootstrap experiment provided some numbers illustrating problems with ranking methods based on too small data sets. Can we conclude anything from small sets? Certainly, but the level of detail depends on the data set. For example, 99 proteins sufficed to conclude that 7 methods were more accurate than was pairwise PHD ( Table 1 ). However, we needed more than 200 proteins to distinguish between some of the 7 best (Table 2). EVA continues to assess prediction methods automatically on as many proteins every month as does CASP every two years [15] . Should we then have CASP without, e.g. secondary structure prediction? We perceive that the advance in prediction methods over the last 8 years has been influenced strongly by CASP. Further advances might stall with no secondary structure prediction present at CASP. We suggest to compare expert predictions and automatic methods based on the limited number of CASP targets and to relate the performance to the larger EVA sets. What do secondary structure predictions teach us about protein function? This is a kind of problem that EVA cannot address. We need expert evaluations to learn what to measure. Finally, all tables given here are available through the EVA web site; the interpretation of the data is not.

The field advanced significantly. Growing databases and improved search techniques yielded a substantial improvement in secondary structure prediction over the last four years. The best methods now reach sustained levels of 76% ( Table 1 Table 2 ). For almost every second protein even the worst of the 7 best methods ( Table 1 ) surpassed 70% accuracy, and for less than 10% of the proteins the 70% level was not reached by the best method (Fig. 2). Even more impressively, about 60% of all residues are predicted at levels similar to structural alignments of homologues (Fig. 3).

88% is a limit, but shall we ever reach close to there? Protein secondary structure formation is influenced by long-range interactions [45, 46, 47] and by the environment [48, 49] . Consequently, stretches of up to 11 adjacent residues (dubbed chameleon after [45] ) can be found in different secondary structure states [50, 51, 52] . Implicitly, such non-local effects are contained in the exchange patterns of protein families. This is reflected by the fact that strand is predicted almost as accurately as helix ( Table 1 ), although sheets are stabilised by more non-local interactions than helices. Local evolutionary profiles can even suffice to identify structural switches [53, 48] . Surprisingly, we can find some traces of folding events in secondary structure predictions [54] . Even more amazing is a study suggesting that alignment-based methods achieve similar levels of accuracy for chameleon regions as for all other regions [51] . Secondary structure assignments may vary for two versions of the same structure. One reason is that protein structures are no rocks but dynamic objects with some regions more mobile than others. Another reason is that any assignment method has to choose particular thresholds. Consequently, assignments differ by about 5-10 percentage points between different NMR models for the same protein [44] , and by about 12 percentage points between structural homologues [26] . The latter number provides an upper limit for secondary structure prediction of error-free comparative modelling. After the recent advances we have reached above 76%. Thus, we need to mount another twelve percentage points (or even less). What is the major obstacle, the size of the experimental database as suggested by Pan et al. [50] ? PHDpsi was trained on 200 proteins; when using PSI-BLAST input it was almost as accurate as PSIPRED trained on 2000 proteins ( Table 2 ). Hence, the database growth may not suffice. Will the current explosion of sequences boost accuracy? In fact, current databases have less than 10 homologues for more than one third of the proteins, and more than 100 for only 20% of the proteins. Although based on a too small set for conclusions, for these 20% highly populated families the accuracy of PROFphd was four percentage points above average (data not shown). Thus, larger databases may get us six percentage points higher, and it may not. The answer remains nebulous.

What are the major problems of the field? Most major problems prominent in many of the predictions submitted to the first two CASP meetings have been solved. The most important task may now be to correlate predicted secondary structure to aspects of protein function. One method has successfully related secondary structure predictions automatically to functional aspects [53, 48] . However, secondary structure based identifications of binding sites or other functional aspects is still restricted to single-case expert analyses. Other than this, a number of loose ends remained. (1) All methods still have problems predicting the precise termini of regular secondary structure segments. (2) Frequently, the number of helices and strands is not predicted correctly. (3) We know that evolutionary information improves prediction accuracy. However, we still have not succeeded to correlate the 'information' contained in a particular alignment with the resulting improvement in prediction accuracy. (4) Rather than improving existing methods even further, the field should possibly attempt to expand the concept of secondary structure by predicting other states (e.g. turns [55] ) or different descriptions of super-secondary structure (e.g. as used in Isites [56] ).

And now we run human? The field has advanced considerably; more improvement appears to lie ahead. Prediction methods are fast enough to analyse entire genomes, and for particular examples the resulting classifications are relevant to structural and functional genomics [38, 57] . Nevertheless, to play the devil's advocate: We are missing a variety of approaches relating secondary structure predictions explicitly to function. Obviously, this remark may apply to bioinformatics, in general: The new millennium began with the publication of the entire human genome; we must rush to get ready for the data flood.

Acknowledgements

Particular thanks to the EVA teams at the Rockefeller University (Marc A. Martí-Renom, András Fiser & Andrej Sali) and at the CNB in Madrid (Florencio Pazos & Alfonso Valencia) for joining us in pursuing a laborious idea. Thanks also to Jinfeng Liu and Dariusz Przybylski (CUBIC, Columbia) for helping with soft- and hardware. Furthermore, we are grateful to Phil Bourne (UCSD) for his support and to Kevin Karplus (UCSC) for numerous suggestions. Last not least, we thank all the developers who accepted EVA submissions: Pierre Baldi (Irvine), Phil Bourne (UCSD), Søren Brunak (Copenhagen), James Cuff (London), Piero Fariselli (Bologna), Mitsuo Iwadate (Kitasoto), Ross King (Aberystwyth), Ole Lund (Copenhagen), Jarek Meller (Ithaca), David Jones (Uxbridge), Kevin Karplus (UCSC), Osvaldo Olmea (Madrid), Gianlucca Pollastri (Irvine), Gajendra Raghava (Chandigarh), Kristoffer Rapacki (Copenhagen), Torsten Schwede (Geneva).


References

  1. Kabsch, W. and Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen bonded and geometrical features. Biopolymers 22:2577-2637, 1983.
  2. Cuff, J. A. and Barton, G. J. Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins 34:508-519, 1999.
  3. Rost, B. PHD: predicting one-dimensional protein structure by profile based neural networks. Meth. Enzymol. 266:525-539, 1996.
  4. Przybylski, D. and Rost, B. Alignments grow, secondary structure prediction improves. Proteins in press, 2001.
  5. Altschul, S., Madden, T., Shaffer, A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. Gapped Blast and PSI-Blast: a new generation of protein database search programs. Nucl. Acids Res. 25:3389-3402, 1997.
  6. Ouali, M. and King, R. D. Cascaded multiple classifiers for secondary structure prediction. Prot. Sci. 9:1162-1176, 2000.
  7. Jones, D. T. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292:195-202, 1999.
  8. Karplus, K., Barrett, C., Cline, M., Diekhans, M., Grate, L. and Hughey, R. Predicting protein structure using only sequence information. Proteins S3:121-125, 1999.
  9. Baldi, P., Brunak, S., Frasconi, P., Soda, G. and Pollastri, G. Exploiting the past and the future in protein secondary structure prediction. Bioinformatics 15:937-946, 1999.
  10. Rost, B. and Sander, C. Bridging the protein sequence-structure gap by structure predictions. Annu. Rev. Biophys. Biomol. Struct. 25:113-136, 1996.
  11. Rost, B. and O'Donoghue, S. I. Sisyphus and prediction of protein structure. CABIOS 13:345-356, 1997.
  12. Szent-Györgyi, A. G. and Cohen, C. Role of proline in polypeptide chain configuration of proteins. Science 126:697, 1957.
  13. Rost, B. and Sander, C. Progress of 1D protein structure prediction at last. Proteins 23:295-300, 1995.
  14. Rost, B. Better 1D predictions by experts with machines. Proteins Suppl. 1:192-197, 1997.
  15. Eyrich, V., Martí-Renom, M. A., Przybylski, D., Fiser, A., Pazos, F., Valencia, A., Sali, A. and Rost, B. EVA: continuous automatic evaluation of protein structure prediction servers. Bioinformatics in press, 2001.
  16. Eyrich, V., Martí-Renom, M. A., Przybylski, D., Fiser, A., Pazos, F., Valencia, A., Sali, A. and Rost, B. EVA: continuous automatic evaluation of protein structure prediction servers. WWW document (http://cubic.bioc.columbia.edu/eva): Columbia University, 2001.
  17. Cuff, J. A., Clamp, M. E., Siddiqui, A. S., Finlay, M. and Barton, G. J. JPred: a consensus secondary structure prediction server. Bioinformatics 14:892-893, 1998.
  18. McGuffin, L. J., Bryson, K. and Jones, D. T. The PSIPRED protein structure prediction server. Bioinformatics 16:404-405, 2000.
  19. Rost, B. and Sander, C. Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol. 232:584-599, 1993.
  20. Rost, B. and Sander, C. Combining evolutionary information and neural networks to predict protein secondary structure. Proteins 19:55-72, 1994.
  21. Berman, H. M., Westbrook, J., Feng, Z., Gillliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. and Bourne, P. E. The Protein Data Bank. Nucl. Acids Res. 28:235-242, 2000.
  22. Eyrich, V. and Rost, B. The META-PredictProtein server. WWW document (http://cubic.bioc.columbia.edu/predictprotein/submit_meta.html): CUBIC, Columbia University, Dept. of Biochemistry & Molecular Biophysics, 2000.
  23. Sander, C. and Schneider, R. Database of homology-derived structures and the structural meaning of sequence alignment. Proteins 9:56-68, 1991.
  24. Rost, B. Twilight zone of protein sequence alignments. Prot. Engin. 12:85-94, 1999.
  25. Mathews, F. S. The structure, function and evolution of cytochromes. Prog. Biophys. Mol. Biol. 45:1-56, 1985.
  26. Rost, B., Sander, C. and Schneider, R. Redefining the goals of protein secondary structure prediction. J. Mol. Biol. 235:13-26, 1994.
  27. Defay, T. and Cohen, F. E. Evaluation of current techniques for ab initio protein structure prediction. Proteins 23:431-445, 1995.
  28. Zemla, A., Venclovas, C., Fidelis, K. and Rost, B. A modified definition of SOV, a segment-based measure for protein secondary structure prediction assessment. Proteins 34:220-223, 1999.
  29. Levitt, M. A simplified representation of protein conformations for rapid simulation of protein folding. J. Mol. Biol. 104:59-107, 1976.
  30. Levitt, M. and Chothia, C. Structural patterns in globular proteins. Nature 261:552-558, 1976.
  31. Kneller, D. G., Cohen, F. E. and Langridge, R. Improvements in Protein Secondary Structure Prediction by an Enhanced Neural Network. J. Mol. Biol. 214:171-182, 1990.
  32. Zhang, C.-T. and Chou, K.-C. An optimization approach to predicting protein structural class from amino acid composition. Prot. Sci. 1:401-408, 1992.
  33. Rost, B. Observed secondary structure content for 721 proteins. WWW document (http://cubic.bioc.columbia.edu/results/1996/SecStrContent.html): EMBL Heidelberg, Germany, 1996.
  34. Lesk, A. M., Lo Conte, L. and Hubbard, T. J. P. Assessment of novel folds targets in CASP4: Predictions of three-dimensional structures, secondary structures, and interresidue contacts. Proteins in press, 2001.
  35. Fischer, D., Elofsson, A., Rychlewski, L., Pazos, F., Valencia, A., Rost, B., Ortiz, A. R. and Dunbrack, R. L. CAFASP2: the second critical assessment of fully automated structure prediction methods. Proteins submitted, 2001.
  36. Rost, B. and Sander, C. Jury returns on structure prediction. Nature 360:540, 1992.
  37. Gerstein, M. and Levitt, M. A structural census of the current population of protein sequences. Proc. Natl. Acad. Sci. U.S.A. 94:11911-11916, 1997.
  38. Przytycka, T., Aurora, R. and Rose, G. D. A protein taxonomy based on secondary structure. Nat. Struct. Biol. 6:672-682, 1999.
  39. Liu, W. and Chou, K. C. Prediction of protein secondary structure content. Prot. Engin. 12:1041-1050, 1999.
  40. Zhang, C. T. and Zhang, R. Skewed distribution of protein secondary structure contents over the conformational triangle. Prot. Engin. 12:807-10, 1999.
  41. Wang, Z.-X. and Yuan, Z. How good is prediction of protein structural class by the component-coupled method? Proteins 38:165-175, 2000.
  42. Chandonia, J. M. and Karplus, M. New methods for accurate prediction of protein secondary structure. Proteins 35:293-306, 1999.
  43. Ptitsyn, O. B. and Finkelstein, A. V. Theory of protein secondary structure and algorithm of its prediction. Biopolymers 22:15-25, 1983.
  44. Andersen, C. A. F., Palmer, A. G., Brunak, S. and Rost, B. Continuous secondary structure assignment correlates with protein flexibility. Structure submitted:2001.
  45. Minor, D. L. J. and Kim, P. S. Context-dependent secondary structure formation of a designed protein sequence. Nature 380:730-734, 1996.
  46. Muńoz, V., Cronet, P., López-Hernández, E. and Serrano, L. Analysis of the effect of local interactions on protein stability. Folding & Design 1:167-178, 1996.
  47. Villegas, V., Zurdo, J., Filimonov, V. V., Aviles, F. X., Dobson, C. M. and Serrano, L. Protein engineering as a strategy to avoid formation of amyloid fibrils. Prot. Sci. 9:1700-8, 2000.
  48. Young, M., Kirshenbaum, K., Dill, K. A. and Highsmith, S. Predicting conformational switches in proteins. Prot. Sci. 8:1752-1764, 1999.
  49. Krittanai, C. and Johnson, W. C. J. The relative order of helical propensity of amino acids changes with solvent environment. Proteins 39:132-141, 2000.
  50. Pan, X. M., Niu, W. D. and Wang, Z. X. What is the minimum number of residues to determine the secondary structural state? J. Prot. Chem. 18:579-584, 1999.
  51. Jacoboni, I., Martelli, P. L., Fariselli, P., Compiani, M. and Casadio, R. Predictions of protein segments with the same aminoacid sequence and different secondary structure: A benchmark for predictive methods. Proteins 41:535-544, 2000.
  52. Zhou, X., Alber, F., Folkers, G., Gonnet, G. H. and Chelvanayagam, G. An analysis of the helix-to-strand transition between peptides with identical sequence. Proteins 41:248-256, 2000.
  53. Kirshenbaum, K., Young, M. and Highsmith, S. Predicting allosteric switches in myosins. Prot. Sci. 8:1806-1815, 1999.
  54. Compiani, M., Fariselli, P., Martelli, P. L. and Casadio, R. Neural networks to study invariant features of protein folding. Theoretical Chemistry Accounts 101:21-26, 1999.
  55. Shepherd, A. J., Gorse, D. and Thornton, J. M. Prediction of the location and type of beta-turns in proteins using neural networks. Prot. Sci. 8:1045-55, 1999.
  56. Bystroff, C., Thorsson, V. and Baker, D. HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins. J. Mol. Biol. 301:173-190, 2000.
  57. Teichmann, S. A., Chothia, C. and Gerstein, M. Advances in structural genomics. Curr. Opin. Str. Biol. 9:390-399, 1999.
  58. Rost, B. EVA measures of secondary structure prediction accuracy. WWW document (http://cubic.bioc.columbia.edu/eva/doc/measure_sec.html): EMBL, 2001.
  59. Matthews, B. W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Ac. 405:442-451, 1975.