1 CUBIC, Columbia University, Department of Biochemistry and Molecular Biophysics, 650 West 168th Street, New York, NY 10032, USA
2 Columbia Univ., Dept. of Chemistry, 3000 Broadway MC 3136, New York, NY 10027, USA
* Corresponding author: rost@columbia.edu, http://cubic.bioc.columbia.edu/
Tel: +1-212-305-3773, fax: +1-212-305-7932
contact e-mail:rost@columbia.edu
| Title: | EVA: large-scale analysis of secondary structure prediction |
| Author: | Burkhard Rost \& Volker A Eyrich |
| Quote: | Proteins, 2001, 45 Suppl 5:S192-S199 |
EVA is a web-based server that evaluates automatic structure prediction servers continuously and objectively. Since June 2000, EVA collected more than 20,000 secondary structure predictions. The EVA sets sufficed to conclude that the field of secondary structure prediction has advanced again. Accuracy increased substantially in the 90's through using evolutionary information taken from the divergence of proteins in the same structural family. Recently, the evolutionary information resulting from improved searches and larger databases has again boosted prediction accuracy by more than four percentage points to its current height around 76% of all residues predicted correctly in one of the three states helix, strand, other. The best current methods solved most of the problems raised at earlier CASP meetings: All good methods now get segments right and perform well on strands. Is the recent increase in accuracy significant enough to make predictions even more useful? We believe the answer is affirmative. What is the limit of prediction accuracy? We shall see.
Availability: all data is available through the EVA web site at
{cubic.bioc.columbia.edu/eva/}. The raw data for the results presented
are available at {eva}/sec/bup_common/2001_02_22/.
Key words: automatic evaluation, large-scale assessment,
and protein structure prediction
Abbreviations used: 3D, three-dimensional; 1D, one-dimensional
(e.g. string of secondary structure); DSSP, programs and
data base assigning secondary structure from 3D coordinates [1] ;
JPred2, divergent profile (PSI-BLAST) based neural network
prediction [2] ; PHD, Profile based neural network prediction
of secondary structure (PHDsec), solvent accessibility (PHDacc),
and transmembrane helices (PHDhtm) [3] ; PHDpsi, divergent
profile (PSI-BLAST) based neural network prediction [4] ; PSI-BLAST,
position specific iterated database search [5] ; PDB,
Protein Data Bank of experimentally determined 3D structures of
proteins; PROFphd, Advanced profile-based neural network
prediction of secondary structure (Rost, unpublished), PROFking,
cascaded statistic based secondary structure prediction method
[6] ; PSIPRED, divergent profile (PSI-Blast) based neural
network prediction [7] ; SAM-T99sec, neural network
prediction, using Hidden Markov models as input [8] ; SSpro,
profile-based advanced neural network prediction method [9] ;
Secondary structure is at the heart of structure prediction. The rapidly growing sequence-structure gap (number of known protein structures vs. number of known protein sequences) has enticed theoreticians to solve simplified prediction problems [10, 11] . An extreme simplification is the prediction of protein structure in one dimension (1D), as represented by strings of secondary structure. Theoreticians are lucky in that this relatively simple task comprises a goal relevant for prediction of protein structure and function, in general. Almost any imaginable algorithm has been applied to this task. The result is that we have come a long way since the first method published 44 years ago [12] . The most important aspect of third generation methods demonstrating their break-through at the first CASP meetings was the automatic use of evolutionary information [13, 14] . In fact, secondary structure prediction may have been the most successful discipline of protein structure prediction over the last 40 years. Is the field still alive?
Here, we focused on presenting various aspects of the performance of recent secondary structure prediction methods. We analysed automatic methods based on large data sets. The machinery allowing such a large-scale assessment is the automatic, continuous, and objective web server EVA [15, 16] .
EVA assessment in four prediction categories. Currently, EVA [15] evaluates four different categories of structure prediction servers (URLs at [16] ): (1) comparative modelling, (2) fold recognition and threading, (3) secondary structure prediction, and (4) inter-residue contact predictions. The following groups agreed to let their public secondary structure prediction servers be evaluated by EVA: James Cuff & Geoff Barton (JPred2) [17, 2] ; Mohammed Ouali & Ross King (PROFking) [6] ; David Jones (PSIPRED) [7, 18] , Gajendra Raghava (PSSP, unpublished), Kevin Karplus (SAM-T99sec) [8] , Pierre Baldi & Gianluca Pollastri (SSpro) [9] . Our group contributed PHDsec [19, 20, 3] , PHDpsi [4] , and PROFphd (Rost, unpublished).
Results are updated every week. Every day, EVA obtains the latest experimentally determined structures from the PDB [21] web-site. These structures are parsed into chains using the DSSP program [1] . The sequence of each chain is submitted immediately to the prediction servers using META-PP [22] . Predictions are collected and sent for evaluation to the EVA-satellites: to Rockefeller University for comparative modelling, to CNB Madrid for contact predictions, and to CUBIC at Columbia University for all other predictions. Depending on the category, the assessments are made available within hours to days. The central EVA site at Columbia downloads all HTML pages produced by the satellites, and builds up the 'latest week' results that are then mirrored at the Rockefeller University and at the CNB Madrid.
Selection of data sets. Currently, EVAsec uses only proteins with new structures to evaluate secondary structure prediction. We use the following operational definition: if a pairwise alignment search detects the similarity at a level at which it detects 50% false positives, the sequence similarity is deemed 'not significant'. This concept translates to a threshold above 28 identical in 100 aligned residues [23, 24] . This implied that we presented only results from the 'fold recognition' and 'new fold' categories in CASP. We reported results for two sets; the first set_218 contained all 218 protein chains with new structures added to PDB between Jun 2000 and Feb 2001 for which we had results for 6 methods. The second set_99 was a subset of set_218 with 99 chains for which we had predictions for all 9 methods evaluated.
Assigning secondary structure from 3D coordinates. EVAsec uses secondary structure assignments from DSSP [1] . The eight DSSP states are converted to three states using the following transformation: DSSP [HGI] -> helix (H), DSSP [EB] -> strand (E), all other DSSP states [TS ] -> other (L). Note: occasionally developers convert 310 helices (DSSP G), pi-helices (DSSP I), or beta-bridges (DSSP B) to the 'other' state. Such a conversion seemingly increases accuracy, since these states are more difficult to predict [17] .
Scoring per-residue accuracy. The three-state per-residue
accuracy (Q3) is the most widely used score for evaluating
secondary structure predictions. Q3 gives the percentage
of residues correctly predicted in one of the three states: helix,
strand, other. Most residues are observed in the 'other' state.
Hence, Q3 can be high even for methods predicting helices
and strands inaccurately. One way around this problem is to measure
the percentages of residues observed in state i (HEL) predicted
correctly in state i () and the percentage of residues
predicted in i and predicted correctly in i (
).
Another way around are the Matthews correlation coefficients [25] ,
and the information index [19, 26] . Some methods predict 3D
structure starting from rigid body secondary structure segments.
These methods need predictions with a low percentage of residues
confused between strand and helix as measured by the BAD score
[27] .
Scoring per-segment accuracy. In practice, methods that get most of the segment cores right are more useful than those that get some of the entire segments right. Per-residue scores cannot distinguish between these two. Many segment-based measures have been proposed [26] ; the one that appears to distinguish best between good and bad predictions is the average overlap between segments (SOV) [26, 28] .
Scoring accuracy in predicting secondary structure class. A coarse-grained classification of protein structures bases on secondary structure composition [29, 30] . Hence, secondary structure predictions also imply predictions of secondary structural class. EVAsec reports the percentage of proteins correctly predicted in one of the following four classes: all-alpha (length > 60, helix > 45%, strand < 5%), all-beta (length > 60, helix < 5%, strand > 45%), alpha/beta (length > 60, helix > 30%, strand > 20%), other. The thresholds were chosen by intuition [31, 32, 20] , since these simplified structural classes are not separated well [33] . EVAsec also reports differences between observed and predicted overall content in order to measure the accuracy in predicting secondary structure composition independently of thresholds.
Methods are not ranked based on too few test proteins! EVAsec does not rank prediction methods based on too few test proteins. For example, because the accuracy of secondary structure prediction varies between proteins, accuracy estimates typically constitute averages over many test sequences, with standard deviations usually above ten percentage points. We use this standard deviation to estimate the error of the average accuracy as a function of the test set size. A significant difference (ÆQ) between two methods, or the error of the accuracy estimate for one method, is:
where Q is the measure for accuracy, Nprot the number of proteins used in the test set, NprotLarge the number of proteins used in a larger, representative set (>100 proteins) and s (Q, NprotLarge) is the standard deviation of variable Q in a large, representative test set (assuming a Gaussian distribution of variable Q). The observation that different prediction methods typically have similar standard deviations provides a necessary justification for this approach. For example, when a method correctly predicts 75% of the residues in a test set of 16 proteins with a standard deviation of 10%, a difference relative to another method that is smaller than 2.5% (i.e., ÆQ = 10/sqrt(16)) is not significant. Thus, we cannot distinguish between two methods that predict correctly 75% and 73% of all residues, respectively. EVAsec uses this estimate to rank methods in the following way. Assume four methods have accuracy levels of A=75, B=73, C=71, and D=68. D can be distinguished from all other methods (ÆQ > 2.5 to all). Hence, it ranks last. C can be distinguished from A (ÆQ = 4 > 2.5). However, A cannot be distinguished from B (ÆQ = 2 < 2.5), and B cannot be distinguished from C (ÆQ=2 < 2.5). This situation results in a dilemma that has four different possible solutions: (1) A, B and C get the same rank ascertaining that no two methods are ranked differently that cannot be distinguished. (2) A and B get rank 1, and C rank 2 assuring that no two methods are ranked equally that can be distinguished. (3) A gets rank 1, B rank 2 and C rank 3, ignoring that we cannot distinguish between A and B, nor between B and C. (4) Do not rank. None of these solutions is 'correct'. During the first three CASP experiments, solution 3 was practised (higher average results in higher rank). The evaluation of secondary structure prediction performance for CASP4 effectively implemented a concept more similar to solution 1 [34] . The first solution is also realised by EVAsec. For the example given this implies that A, B, and C are ranked 1; D is ranked 2.
Bootstrap experiment to test effect of small data sets. CASP4 had 53 targets; for 43 structures were available at the meeting in Dec. 2000. 29 of these 43 proteins constituted fold recognition/novel fold targets at CASP (Methods). For 11 of these 29 all methods participating at CAFASP predicted secondary structure [35] . Are these sets sufficient to rank methods? Although such ranking based on too small sets was carefully avoided at CASP4 [34] , we still wanted to address this question by the following bootstrap experiment. (1) Take a data set of proteins predicted by all M methods (here 99 protein chains from 10 methods). (2) Select at random (i) a set of 11 proteins predicted by all methods (common sets) and (ii) another M sets of 29 proteins that may differ between the methods (incomparable sets). (3) Measure the average performance on each set.
Methods best evaluated on identical subsets. The first observation from such an experiment was that methods differed considerably between different random draws, i.e. between different hypothetical CASP experiments ( Fig. 1 ). Remarkably, using 29 proteins from incomparable subsets ( Fig. 1 A) resulted in slightly higher variation than using 11 proteins from common subsets ( Fig. 1 B). This suggested that - on average - it is a better strategy to base comparisons of methods on identical common subsets rather than on all available predictions even if the constraint to have a common subset reduces the available data from 29 to 11.
Ranking methods based on small subsets can be misleading.
The variation between different random draws of data sets became
even more dramatic when we ranked the methods based on the average
performance: most methods did rank best AND worst in one of the
random draws. Ranks varied more for the 29 proteins from incomparable
sets than for the 11 proteins from the common sets. Assuming that
the 99 proteins constitute a 'representative set' (which is wrong,
as indicated below and in Table 1 ), we can compile a cumulative
average of one method over many random draws. How many draws do
we need before the methods will reach the averages of the original
set? About 30 draws for the common subsets of 11 proteins, and
more than 60 for the incomparable subsets of 29 proteins (data
not shown). In our experiment, we forced all methods to predict
equal numbers of proteins. Most methods provide estimates for
the reliability of the prediction for each residue ( Fig. 3 ). What
if methods submitted only their seemingly best predictions? We
used such an index to select only the most reliable predictions
from one method while forcing all other methods to predict for
all proteins. Note that this selection was realised without knowing
the accuracy of a particular prediction. Surprisingly, we could
make EVERY method for which we had such an index become the winner
at ANY of the random draws by submitting only the most reliable
predictions! Our bootstrap experiments underlined what most CASP
evaluators practised: ranking methods based on small subsets may
not be appropriate.
Fig. 1. Significance of averages and ranks from small numbers. CAFASP2 had 29 sequence-unique proteins to evaluate secondary structure prediction; for 11 of these we had CAFASP2 results for all methods. Each point in the graphs is an average over 29 (A) and over 11 (B) proteins for one method. Proteins are selected at random from a set of 99 protein chains. The x-axes give the number of different random draws from the set of 99 proteins. The subsets of 11 proteins are constrained to be identical between all methods for every subset (B), while the 29 (A) are not subjected to this constraint, i.e. they differ on average between the methods. Below, the spread of ranks are given that would result from the respective averages. For example, JPred2 would be the winner at one random draw and come in the last at another when using 29 incomparable proteins (A: 1-8); it would never rank 'last' when using 11 identical proteins (B: 1-7). Conclusions: firstly, 11 proteins are clearly not enough to rank methods. Secondly, if we have results for all methods for 29 proteins but these are not identical 29, results are even less significant. Hence, using identical subsets is the better strategy even if it implies to discard most possible targets.
Better alignments improved secondary structure predictions
significantly. The set of 99 new protein chains for which
EVA collected results for all methods did not suffice to distinguish
between all methods. However, some trends became apparent: A number
of methods in 2001 predict secondary structure more accurately
than did the best method of 1996. In fact, all methods using alignments
not restricted to pairwise comparisons performed significantly
better than PHD using only pairwise alignment information (Table
1). Simply replacing the pairwise alignments input to PHD by PSI-BLAST
profiles [5] , made the resulting PHDpsi rank in the 'winner'
group. (Upon closer look: most of the improvement of PHDpsi over
PHD resulted from using larger databases, rather than from using
PSI-BLAST [4] .) Did this imply that nothing has changed but
the databases and the search methods? 99 protein chains did not
suffice to tell.
\
|
Method |
CASP |
Q3 |
S OV |
|
|
|
|
B AD |
Info |
CH |
CE |
Class |
H |
E |
|
JPred2 |
102 |
75.5 |
67.6 |
72.2 |
84.7 |
58.0 |
75.0 |
1.9 |
0.34 |
0.67 |
0.59 |
80.8 |
6.7 |
5.6 |
|
PHDpsi |
385 |
74.5 |
67.9 |
78.7 |
81.9 |
63.8 |
69.0 |
2.9 |
0.23 |
0.68 |
0.58 |
80.8 |
6.6 |
4.8 |
|
PROFphd |
402 |
77.0 |
71.6 |
80.5 |
84.4 |
68.8 |
71.6 |
2.3 |
0.36 |
0.72 |
0.63 |
81.8 |
5.9 |
3.9 |
|
PROFking |
214 |
74.4 |
67.4 |
74.0 |
86.4 |
70.9 |
66.6 |
2.7 |
0.32 |
0.69 |
0.61 |
84.8 |
7.6 |
6.7 |
|
PSIPRED |
258 |
76.8 |
72.2 |
81.9 |
82.9 |
68.5 |
72.3 |
2.5 |
0.37 |
0.71 |
0.63 |
84.8 |
5.5 |
4.6 |
|
SAM-T99sec |
111 |
76.1 |
70.8 |
84.7 |
80.0 |
62.0 |
76.9 |
1.9 |
0.35 |
0.71 |
0.62 |
80.8 |
6.7 |
4.8 |
|
SSpro |
115 |
76 |
69.1 |
81.2 |
82.1 |
63.2 |
72.6 |
2.4 |
0.35 |
0.70 |
0.60 |
77.8 |
6.5 |
5.8 |
|
PHD |
142 |
71.7 |
67.3 |
75.9 |
77.7 |
61.1 |
62.9 |
3.8 |
0.25 |
0.62 |
0.53 |
77.8 |
7.9 |
5.7 |
|
PSSP |
510 |
64.3 |
58.7 |
64.4 |
71.5 |
55.6 |
50.0 |
5.9 |
0.20 |
0.50 |
0.40 |
74.7 |
9.8 |
7.8 |
A:
Data set and sorting: All methods
have been tested on the same set of 99 new protein chains (EVA
version Feb 2001). None of these structures was similar to any
protein used to develop the respective method. This set comprised
the largest such set by Feb 23, 2001 for which we had results.
Sorting and grouping reflects the following concept: if the data
set is too small to distinguish between two methods, these two
are grouped. For the given set of 99 protein this yielded three
groups. Inside of each group, results are sorted alphabetically.
Note: groups are separated by an empty line; 99 proteins did not
suffice to separate between the first 7 methods (see Table 2 for
a larger set).
Method: see abbreviations on top of article.
Scores [19, 58] : Q3: three-state per-residue
accuracy, i.e., number of residues predicted correctly in either
of the three states helix, strand, other; SOV: three-state per-segment
score measuring the overlap between predicted and observed segments
[26, 28] ; : residues predicted correctly in helix (or strand)
as percentage of residues observed in helix (or strand); : residues
predicted correctly in helix (or strand) as percentage of residues
predicted in helix (or strand); BAD: percentage of helical residues
predicted as strand, and of strand residues predicted as helix
[27] ; Info: per-residue information content [19] ; CH:
Matthew's correlation coefficient for state helix [59] ; CE:
Matthew's correlation for state strand [59] ; Class: percentage
of proteins correctly sorted into one of the four classes: all-alpha,
all-beta, alpha/beta, other; ÆH: difference between predicted
and observed secondary structure content in helix; ÆH: difference
between predicted and observed secondary structure content in
strand.
And the winners are
For a few of the methods EVA
had 2-3 times larger data sets. In particular, 218 protein chains
sufficed to distinguish between some of the methods indistinguishable
when evaluated on 99 chains: PROFphd, PSIPRED, and SSpro were
significantly more accurate than JPred2 and PHDpsi ( Table 2 ).
All three 'winners' were equally balanced in predicting strand
and helix, and had a similar level of performance in predicting
secondary structure content. Unlike the bulk of methods presented
during the early CASP meetings all 7 methods shown in Table 1
predicted beta strands on average more accurately than residues
in non-regular structure. While SSpro was significantly less accurate
in predicting segments than PROFphd, all three best methods predicted
segments much better than the two second best methods JPred2 and
PHDpsi.
|
Method B |
Q3 |
S OV |
|
|
|
|
B AD |
info |
CH |
CE |
Class |
H |
E |
|
PROFsec |
76.8 |
72.8 |
80.5 |
84.4 |
68.8 |
71.6 |
2.2 |
0.36 |
0.72 |
0.63 |
82.1 |
5.6 |
4.0 |
|
PSIPRED |
76.4 |
72.0 |
81.9 |
82.9 |
68.5 |
72.3 |
2.5 |
0.37 |
0.71 |
0.63 |
79.8 |
5.4 |
4.4 |
|
SSpro |
76.1 |
71.2 |
81.2 |
82.1 |
63.2 |
72.6 |
2.5 |
0.35 |
0.70 |
0.60 |
81.2 |
6.0 |
5.2 |
|
JPred2 |
74.8 |
69.3 |
72.2 |
84.7 |
58.0 |
75.0 |
2.4 |
0.34 |
0.67 |
0.59 |
76.1 |
7.6 |
5.7 |
|
PHDpsi |
74.7 |
69.6 |
78.7 |
81.9 |
63.8 |
69.0 |
3 |
0.29 |
0.68 |
0.58 |
79.8 |
6.2 |
4.9 |
|
PHD |
71.4 |
67.4 |
75.9 |
77.7 |
61.1 |
62.9 |
4.2 |
0.25 |
0.62 |
0.53 |
76.1 |
7.7 |
5.9 |
A: symbols as in Table 1; results based on 218
proteins not used for developing the methods.
Some proteins predicted well by all methods. The per-protein average over the 7 methods that performed best on the set of 99 chains ( Table 1 ) varied as strongly between proteins as did each of the methods. Did this finding confirm the notion that some proteins were easier to predict than others? On average, the worst prediction method for a particular protein chain had a higher accuracy when the average over all methods was higher ( Fig. 2 ). In other words, some chains were predicted by all methods more accurately than others, in particular, the average over all methods was below 70% (Q3) for 20% of the chains, above 80% for another 20% of the chains, and between 70 and 80 for all other chains ( Fig. 2 ). However, for many chains the average over all methods was high, although some method had a very low accuracy. The best methods reached about 77% accuracy. For more than 80% of all proteins one method performed better than this. Hence, most proteins were predicted above average by at least one method. For only two of the 99 chains, all methods reached less than 68% accuracy (Appendix). The worst predictions were obtained for the short peptide of the human apoliprotein II (1by6:A), the structure of which was determined by NMR. By default, we used the first NMR model to determine prediction errors although other models correlated better with the predictions (data not shown). The other bad predictions were for the endonuclease I-PPOI complexed to DNA (1evw:A). The major problems were that most methods missed the two helices reaching into the DNA on opposite sites of the molecule, and over-predicted a long strand for two parallel non-hydrogen bonded stretches on the opposite site of the DNA-binding. For 66 of the 99 protein chains, one method had more than 80% of the residues predicted correctly, while the average over all methods reached this level for only 25 out of 99. Could we anticipate which method did best on which protein without knowing the structure?
Prediction accuracy did not correlate well with experimental
resolution. We used our largest set of 218 chains to analyse
whether or not prediction accuracy correlated with the resolution
of the respective structure. Surprisingly, we could not find a
strong correlation between accuracy and resolution. Nevertheless,
prediction accuracy was about three percentage points higher for
the Xray structures than for the first NMR models. Furthermore,
when averaging all predictions for the quarter of all chains with
highest resolution, we found levels of accuracy about four percentage
points above the average over the quarter with lowest resolution.
Fig. 2. Best, worst and average predictions for each protein. Each cross is the average over all 7 prediction methods that ranked best on 99 protein chains (Table 1); triangles give the lowest and highest accuracy for a particular protein. The x-axis describes for which fraction of the 99 proteins the average was above a certain value. For example, the average accuracy was below 70% for 20% of all proteins, while only for 6 of 99 the best method did not reach 70% accuracy. Furthermore, counting the filled triangles below the 70% line revealed that for half the proteins (48 of the 99) even the worst method surpassed 70% accuracy.
Reliability indices estimated prediction accuracy accurately.
Prediction methods typically use three output states for
helix, strand, other, and predict the state with the highest value
as the secondary structure of the respective residue. Assume the
output for one residue is (0.3, 0.4, 0.3); that for another (0.1,
0.8, 0.1); both residues are predicted as strand. However, the
second prediction is much stronger. This difference can be carved
into an index describing the reliability of a secondary structure
prediction for each residue [36, 19] . After the successes
of such indices at the first CASP meeting, almost all methods
now implement estimates for the reliability of the prediction
for each residue. For all methods tested, these indices correlated
surprisingly well with accuracy ( Fig. 3 ). For example, PSIPRED
and PROFphd reached levels above 90% for the half of the residues
predicted most strongly.
Fig. 3. Prediction strength correlated well with accuracy.
Residues predicted at higher reliability are predicted more
accurately [19, 3] . Reliability indices are now used by most
methods. Shown are cumulative values, i.e. the accuracy (Q3) for
all residues predicted above a given reliability. For example,
for all methods 90% of the 40% most strongly predicted residues
are predicted correctly. Results were based on the set_99 also
used for Table 1.
Secondary structural class predicted almost as accurately as by experiment. Grouping proteins into secondary structure classes (all-alpha, all-beta, alpha/beta, other) appears a useful initial approach toward classifying proteins [37, 38] . Such classes can be predicted successfully based merely on the overall amino acid composition of a protein [39, 40, 41] . More and more increasingly complex and genial methods address this reduced goal; reported levels of prediction accuracy approach 100%. Recently, Wang & Yuan explained these high values by insufficient testing schemes, and challenged that a four-state accuracy of around 60% comprises the maximum for methods based solely on composition [41] . Obviously, it is much easier to predict class starting from the detailed information about evolutionary profiles for the entire sequence than by restricting the input to composition. In fact, today's best general prediction methods also predict secondary structure class better ( Table 1 Table 2 ). The differences between observed and predicted secondary structure composition are now below 6% for helix and strand. This performance is similar to what experimental low-resolution (circular dichroism, Fourier transform induced spectroscopy) methods achieve at their best [20, 42] .
Homologues of known structures predicted marginally better.
All current top-of-the-line methods somehow learn the secondary
structure for proteins of known structure. In particular, no method
relies entirely on 'first principles' like, e.g. one of the best
methods of the second generation ALB did [43] . Consequently,
today's methods somehow depend on residual sequence similarity
between target and known structures. Did this imply that prediction
accuracy was significantly higher for proteins with homologues
of known structure? For example, PSIPRED reached a level close
to 80% accuracy for a set of 223 protein chains with significant
sequence similarity to known structures (data not shown) compared
to about 76-77% for new structures ( Table 2 ). A similar trend
persisted for all prediction methods analysed (data not shown).
How would these values compare to inferring secondary structure
through comparative modelling? We did not yet compile data to
address this question explicitly. However, for structural alignments
the respective secondary structure assignments agree for more
than 88% of all residues [26, 44] . Hence, when we know a protein
of known structure that is similar to a target, we supposedly
still best use comparative modelling to predict secondary structure.
A similar conclusion was suggested by analysing the CASP4 results
for comparative modelling [34] .
CASP and EVA: both are needed. For CAFASP all methods predicted secondary structure for the same 11 proteins, and some methods for another 18. Our bootstrap experiment provided some numbers illustrating problems with ranking methods based on too small data sets. Can we conclude anything from small sets? Certainly, but the level of detail depends on the data set. For example, 99 proteins sufficed to conclude that 7 methods were more accurate than was pairwise PHD ( Table 1 ). However, we needed more than 200 proteins to distinguish between some of the 7 best (Table 2). EVA continues to assess prediction methods automatically on as many proteins every month as does CASP every two years [15] . Should we then have CASP without, e.g. secondary structure prediction? We perceive that the advance in prediction methods over the last 8 years has been influenced strongly by CASP. Further advances might stall with no secondary structure prediction present at CASP. We suggest to compare expert predictions and automatic methods based on the limited number of CASP targets and to relate the performance to the larger EVA sets. What do secondary structure predictions teach us about protein function? This is a kind of problem that EVA cannot address. We need expert evaluations to learn what to measure. Finally, all tables given here are available through the EVA web site; the interpretation of the data is not.
The field advanced significantly. Growing databases and improved search techniques yielded a substantial improvement in secondary structure prediction over the last four years. The best methods now reach sustained levels of 76% ( Table 1 Table 2 ). For almost every second protein even the worst of the 7 best methods ( Table 1 ) surpassed 70% accuracy, and for less than 10% of the proteins the 70% level was not reached by the best method (Fig. 2). Even more impressively, about 60% of all residues are predicted at levels similar to structural alignments of homologues (Fig. 3).
88% is a limit, but shall we ever reach close to there? Protein secondary structure formation is influenced by long-range interactions [45, 46, 47] and by the environment [48, 49] . Consequently, stretches of up to 11 adjacent residues (dubbed chameleon after [45] ) can be found in different secondary structure states [50, 51, 52] . Implicitly, such non-local effects are contained in the exchange patterns of protein families. This is reflected by the fact that strand is predicted almost as accurately as helix ( Table 1 ), although sheets are stabilised by more non-local interactions than helices. Local evolutionary profiles can even suffice to identify structural switches [53, 48] . Surprisingly, we can find some traces of folding events in secondary structure predictions [54] . Even more amazing is a study suggesting that alignment-based methods achieve similar levels of accuracy for chameleon regions as for all other regions [51] . Secondary structure assignments may vary for two versions of the same structure. One reason is that protein structures are no rocks but dynamic objects with some regions more mobile than others. Another reason is that any assignment method has to choose particular thresholds. Consequently, assignments differ by about 5-10 percentage points between different NMR models for the same protein [44] , and by about 12 percentage points between structural homologues [26] . The latter number provides an upper limit for secondary structure prediction of error-free comparative modelling. After the recent advances we have reached above 76%. Thus, we need to mount another twelve percentage points (or even less). What is the major obstacle, the size of the experimental database as suggested by Pan et al. [50] ? PHDpsi was trained on 200 proteins; when using PSI-BLAST input it was almost as accurate as PSIPRED trained on 2000 proteins ( Table 2 ). Hence, the database growth may not suffice. Will the current explosion of sequences boost accuracy? In fact, current databases have less than 10 homologues for more than one third of the proteins, and more than 100 for only 20% of the proteins. Although based on a too small set for conclusions, for these 20% highly populated families the accuracy of PROFphd was four percentage points above average (data not shown). Thus, larger databases may get us six percentage points higher, and it may not. The answer remains nebulous.
What are the major problems of the field? Most major problems prominent in many of the predictions submitted to the first two CASP meetings have been solved. The most important task may now be to correlate predicted secondary structure to aspects of protein function. One method has successfully related secondary structure predictions automatically to functional aspects [53, 48] . However, secondary structure based identifications of binding sites or other functional aspects is still restricted to single-case expert analyses. Other than this, a number of loose ends remained. (1) All methods still have problems predicting the precise termini of regular secondary structure segments. (2) Frequently, the number of helices and strands is not predicted correctly. (3) We know that evolutionary information improves prediction accuracy. However, we still have not succeeded to correlate the 'information' contained in a particular alignment with the resulting improvement in prediction accuracy. (4) Rather than improving existing methods even further, the field should possibly attempt to expand the concept of secondary structure by predicting other states (e.g. turns [55] ) or different descriptions of super-secondary structure (e.g. as used in Isites [56] ).
And now we run human? The field has advanced considerably;
more improvement appears to lie ahead. Prediction methods are
fast enough to analyse entire genomes, and for particular examples
the resulting classifications are relevant to structural and functional
genomics [38, 57] . Nevertheless, to play the devil's advocate:
We are missing a variety of approaches relating secondary structure
predictions explicitly to function. Obviously, this remark may
apply to bioinformatics, in general: The new millennium began
with the publication of the entire human genome; we must rush
to get ready for the data flood.
Particular thanks to the EVA teams at the Rockefeller University
(Marc A. Martí-Renom, András Fiser & Andrej
Sali) and at the CNB in Madrid (Florencio Pazos & Alfonso
Valencia) for joining us in pursuing a laborious idea. Thanks
also to Jinfeng Liu and Dariusz Przybylski (CUBIC, Columbia) for
helping with soft- and hardware. Furthermore, we are grateful
to Phil Bourne (UCSD) for his support and to Kevin Karplus (UCSC)
for numerous suggestions. Last not least, we thank all the developers
who accepted EVA submissions: Pierre Baldi (Irvine), Phil Bourne
(UCSD), Søren Brunak (Copenhagen), James Cuff (London),
Piero Fariselli (Bologna), Mitsuo Iwadate (Kitasoto), Ross King
(Aberystwyth), Ole Lund (Copenhagen), Jarek Meller (Ithaca), David
Jones (Uxbridge), Kevin Karplus (UCSC), Osvaldo Olmea (Madrid),
Gianlucca Pollastri (Irvine), Gajendra Raghava (Chandigarh), Kristoffer
Rapacki (Copenhagen), Torsten Schwede (Geneva).