Proteins, 1997, Supplement 1, 192-197

Better 1D predictions by experts with machines

Burkhard Rost

European Molecular Biology Laboratory

EMBL, 69 012 Heidelberg, Germany; rost@embl-heidelberg.de; http://dodo.bioc.columbia.edu/~rost/

contact e-mail:rost@embl-heidelberg.de


Abstract

Accuracy of predicting protein secondary structure and solvent accessibility has been improved significantly by using evolutionary information contained in multiple sequence alignments. For the second Asilomar meeting, predictions were made automatically for all targets using the publicly available prediction service PredictProtein. Additionally, a semi-automatic procedure for generating more informative alignments was used in combination with the PHD prediction methods. Results confirmed the estimates for prediction accuracy. Furthermore, the more informative alignments yielded better predictions. The fairly accurate predictions of 1D structure were successfully used by various groups for the Asilomar meeting as first step towards predicting higher dimensions of protein structure.

Key words: prediction of protein secondary structure, and residue solvent accessibility, multiple alignments, neural networks.


Introduction

Simplifying the structure prediction problem. The second Asilomar meeting has confirmed that after 40 years of ardent research, theory still cannot predict protein three-dimensional structure (3D) from sequence, in general 1. However, the rapidly growing sequence-structure gap (number of known protein structures vs. number of known protein sequences) has enticed theoreticians to solve simplified prediction problems 2. An extreme simplification is the prediction of protein structure in one dimension (1D), as represented by strings of, e.g., secondary structure, and residue solvent accessibility. Theoreticians are lucky because the 1D prediction problem is not only the task they can accomplish best, but in that even partially correct predictions of 1D structure are useful, e.g., for predicting protein function, or functional sites.

Break-through of third generation prediction methods. The first generation of 1D prediction methods were based on physico-chemical principles, expert rules, and statistics of single residues 3. The second generation incorporated the influence of residues adjacent to the residue for which 1D structure was predicted (local information) 4. These secondary structure prediction methods shared three major shortcomings: (1) prediction accuracy was limited to about 60% accuracy (percentage of residues predicted correctly in either of the three states helix, strand, other), (2) b-strands were predicted at typically < 40% accuracy, (3) predicted secondary structure segments were, on average, only half as long as observed segments. Some methods were tailored to overcome one of these problems (long-range information: 5, 6; b-strand accuracy: 7; length: 8). However, only recently have automatic methods been developed that overcome most of these shortcomings. The most important trick of the third generation prediction tools of the 90's is the use of evolutionary information contained in multiple alignments of protein families 9-20, 2, 21. To outsiders the superiority of the third generation tools over their predecessors (which unfortunately are still being used by major sequence analysis packages, such as GCG 22) may appear marginal. However, the usefulness of the third generation methods was demonstrated in the second Asilomar meeting in which these automated tools were routinely used by sequence analysis experts.


Materials and methods

From sequence to 1D structure. The major step in improving 1D predictions has been the use of evolutionary information contained in multiple sequence alignments. Generating the information fed into the neural network system PHD 20 required four steps: a data base search for homologues (method BLAST Altschul, 1996), (2) a refined profile-based dynamic-programming alignment of the most likely homologues (method MAXHOM Schneider, 1994), (3) a decision for which proteins will be considered as homologues (length-depend cut-off for pairwise sequence identity Sander & Schneider, 1991; Chothia & Lesk, 1986), and (4) a final refinement, and extraction of the resulting multiple alignment. In general, prediction accuracy is better when predictions are based on better alignments. Better alignments are defined by: (i) fewer incorrectly aligned residues; (ii) greater divergence within the family of sequences. In practice, these two conditions are opponents in that less similar homologues are more likely to be mis-aligned.

Completely vs. almost automatic. The PHD prediction methods are automatically available via the internet service PredictProtein 20 (send the word help to PredictProtein@EMBL-Heidelberg.DE, or use the WWW interface 23). Users have the choice between the fully automatic procedure taking the query sequence through the entire cycle, or expert intervention into the generation of the alignment. For the Asilomar contest, both these modes of operation were explored. The following changes were made with respect to the usual PredictProtein service: (1) rather than SWISS-PROT, a non-redundant data base of all known protein sequences was searched; (2) the cut-off for accepted homologues was lowered from 30% to 25% pairwise sequence identity; (3) the final list of putative homologues was inspected visually, some proteins were excluded from the list. PredictProtein users typically continue with two additional time-consuming expert interventions: (4) visual correction of the final alignment, and (5) investigation of how the prediction of particular segments depends on the alignment. Steps 4-5 were not performed for the Asilomar targets.

Prediction targets. Secondary structure and residue solvent accessibility was predicted for all Asilomar targets. Here, results were compiled for the 15 targets for which structures were available. Secondary structure predictions comprised the predictions of secondary structure state (helix, strand, other), and a reliability index for each residue; relative solvent accessibility predictions gave the percentage of predicted solvent accessibility (additionally projected onto a two-state model (buried: ² 16%, exposed > 16%), and a reliability index for each residue.


Results

What went well?

Prediction accuracy within expected range. (1) Secondary structure prediction: the accuracy was about 74% (three-state per-residue and per-segment scores). (2) Solvent accessibility prediction: the accuracy was about 69% (two state score); the correlation between observed and predicted relative solvent accessibility was 0.5; predictions were best for residues observed in strands, and worst for residues with no regular secondary structure; predictions were best for the charged amino acids aspartic, glutamic, and lysine, and for methionine. Overall, secondary structure prediction accuracy using PHD on the Asilomar proteins was slightly higher than expected; solvent accessibility prediction accuracy slightly lower than expected 20.

Reliability of prediction correlated with accuracy. Prediction accuracy varied largely between different proteins (Fig. 1A). However, the reliability of PHD predictions enabled estimating on which side of such a distribution the prediction for a given protein was expected (Fig. 1B). Furthermore, individual residues could be correctly labelled for which the prediction was expected to be more likely accurate (Fig. 1C). For example, half of all residues predicted in a helix were predicted with the highest reliability index; 90% of these were correctly predicted (Fig. 1C).


Fig. 1
fig1.gif

Fig. 1. Prediction accuracy for CASP2 targets. (A) 1D prediction accuracy was described by three scores: (1) per-segment accuracy in predicting secondary structure (Sov3, defined in 33), (2) per-residue accuracy in predicting secondary structure (Q3, defined in 33), and (3) per-residue accuracy in predicting residue solvent accessibility (Q2, defined in 13). Exceptionally bad predictions did not coincide, i.e., the lowest value for each score did not occur for the same protein. In fact, for the three proteins for which secondary structure prediction was worst (t31, t32, t38) accessibility predictions were rather accurate; and the worst predicted accessibility for t16 coincided with an extremely good secondary structure prediction. (B) The reliability index, scaled from 0 (low) to 9 (high), reflects the strength of the prediction for each residue. Here, the reliability index was averaged over each protein. The protein average correlated with the overall per-residue accuracy of secondary structure prediction: the worst predicted protein (t32) had the lowest average reliability index; the best predicted ones (t08, t42) had the highest average reliability indices. (C+D) The expected prediction accuracy can be raised above the 90% level at the expense of not predicting secondary structure for regions with a low reliability index. How likely was a residue predicted in an a-helix with a reliability index of n , predicted correctly? The two plots were derived for different test sets, (C) reflected the statistics on 15 Asilomar targets, (D) statistics on a 50 times larger set of 705 sequence-unique proteins. For example, prediction accuracy tended to surpass the 70% accuracy level for residues predicted at levels of RI ³ 6; about two-thirds of all residues were predicted at that level. To illustrate the fraction of residues predicted at highest reliability: for the set of 705 about 40% of the helical residues were predicted at RI=9 (93% of these were correct); and for the Asilomar set about 50% of the helical residues were predicted at RI=9 (90% of which were correctly predicted). Note that C+D show the non-cumulative values. The cumulative distributions answering the question 'How high is the expected prediction accuracy for all residues predicted at higher reliability?' is given elsewhere 10, 13, 12, 20, 29



More informative alignments yielded better predictions. By manually improving the multiple alignments used for the PredictProtein server (Fig. 1), the prediction of secondary structure improved from 70% (three-state per-residue accuracy for fully-automatic alignment selections from PredictProtein server 20) to 74% (semi-automatic alignment selection used for CASP2 submissions). Solvent accessibility predictions were improved from 67% (two-state per-residue accuracy for fully-automatic alignment selection) to 69% (CASP2 submissions).

What went wrong?

Confusing helices and strands. The two examples shown in Fig. 2 represented the worst cases for the secondary structure prediction in terms of per-residue and per-segment accuracy. However, even more fatal for using 1D predictions for further steps towards 3D prediction were cases for which the prediction confused helices and strands. On average, such bad predictions were made for about 8% of all residues. Particularly bad were the values for target t11 (19% of residues confused), and for target t32 (Fig. 2; 13% of residues confused). None of the confused segments was predicted with high reliability indices.


Fig. 2
fig2

Fig. 2. Examples for prediction errors. Two examples of errors in secondary structure prediction. Secondary structure prediction was worst for these two proteins (Fig. 1A, note: for t38 the prediction had to be based on a single sequence). Abbreviations used: AA, amino acid in one-letter code; Obs, secondary structure assignment based on 3D structure by DSSP 34; PHD, prediction by neural network system; RI, reliability of prediction (0 is low, 9 is high). Symbols for secondary structure assignments: H, a-helix; E (extended), b-strand; blank, other.



Helices too long, strands too short. Helices were predicted at a higher than average length (12.5 predicted vs. 10.3 observed); strands were predicted at a lower than average length (5.2 vs. 6.7). These values were not representative for the PHD averages, and might have originated from unusually high percentages of secondary structure in the 15 CASP2 targets (38% helix, 24% strand compared to about 32 % helix and 21% strand in a representative subset of PDB 24, data not given).

Structural class predicted at accuracy levels below average. The composition of secondary structure enables a rough classification of proteins into structural classes 25. On average, secondary structure predictions from PHDsec predict 75% of all proteins correctly in one of the four classes: all-a, all-b, a/b, other 20, 26. For the CASP2 targets the class classification was correct for 67% of the proteins, only. The dominant error was to predict strands for all-a proteins (and to consequently place those proteins into the class 'other', rather than into the class 'all-a'). However, the average content of secondary structure was predicted about as accurately as expected: differences between predicted and observed compositions were 7% for helix, and 9% for strand. Thus, the difference between CASP2 and expected classification error could be attributed to the small data set.

Overprediction of buried residues. Most buried residues (² 16% relative solvent accessibility) were predicted as buried (76%). However, this was accomplished by an overprediction of buried residues, as only 60% of the residues predicted to be buried were actually observed in that state. The dominant error was a strong overprediction of completely buried (0% accessible) residues. In general, residues were clearly overpredicted in the ranges 49-64% accessibility, and clearly underpredicted in the ranges 1-4% and 64-81%. (Note: the other side of the same coin was that exposed residues were underpredicted: 80% of the residues predicted to be exposed were observed in that state, however, only 64% of the residues observed in the exposed state were actually predicted.)

Why?

Correct alignment crucial for correct prediction. Alignments used for the input to the PHD neural networks should be both informative (high level of diversity; many sequences), and correct. The semi-automatic generation of multiple alignments used for the CASP2 submissions clearly improved the information content in the alignments, and thus prediction accuracy. However, including proteins from the twilight zone 27 of 25-30% pairwise sequence identity may be fatal in two ways. Firstly, some of the included proteins may not be structurally similar to the protein for which 1D structure is predicted. Secondly, the lower the level of pairwise sequence identity, the higher the likelihood of mis-aligning some residues. This became particularly obvious, when the alignments for the worst predicted proteins were changed (AFTER the meeting). Secondary structure prediction accuracy could be increased by simply excluding some less likely family members: for target t31, Q3 (three-state per-residue accuracy) from 62% to 68%, Sov3 (three-state per-segment accuracy) from 54% to 65%; for target t32, Q3 from 53% to 56%, Sov3 from 54% to 56%; for target t38 Q3 from 57% to 63%, Sov3 from 42% to 48%. Prediction accuracy was clearly below average for proteins for which no alignments were available (such as for t38, Fig. 2). The second effect of falsely aligned residues was difficult to estimate. However, the extent of the first effect illustrated that alignment errors were fatal.

Prediction accuracy lower for unusual proteins. The PHD neural networks were trained on globular water-soluble proteins; predictions tend to be wrong for other proteins. One example from the CASP2 set was t32 a small protein (98 residues; Fig. 2) which is stabilised by three cysteine-bridges. Fundamental mistakes in the secondary structure prediction were around the cysteine-bridges (Fig. 2). However, on average proteins with cysteine-bridges were not predicted less accurately (Arthur Lesk, this issue). For the prediction of solvent accessibility another effect becomes crucial: the interaction between protein chains: overall accessibility was predicted worst for t16. However, many of the residues 'falsely' predicted as buried were actually observed at interfaces between the three chains of the protein (data not shown). In general, residues predicted to be buried and observed to be exposed, often indicate binding interfaces 28.

Confusing helices and strands partly due to using local information. A fatal error for prediction-based modelling is the confusion of helices and strands. Exactly this fatal error happens frequently for PHD predictions (for seven of the 15 CASP2 targets; statistics on a larger data set at: 29). Often the beginning and the ends of the confused segments are correctly identified (target t02, strand 13; target t11, strand 1; target t14, strand 4, helix 8; target t16, helix 2; target t31 strand 8). Only for two confused segments the reliability of at least one residue was above a value of RI = 6 (helix 8 in t14, and helix 1 in t16). Nevertheless, how can a segment be placed correctly if the type is confused? Secondary structure formation is partly determined by residue interactions non local in sequence. Such information is captured by the PHD predictions only to some extent. A region may have a higher preference for forming a helix than a strand (and vice versa), but interactions non-local in sequence may result in that the formation of a b-sheet (a-helix) is energetically more favourable. Indeed, the confusion between helices and strands can often be attributed to hydrogen-bonds stabilised by non-local inter-residue contacts 30.

15 proteins are not representative. Some of the 'errors' were specific for the CASP2 targets. The major reason for that was that 15 proteins were not enough to comprise a representative sub-set of all proteins (difference between Fig. 1C and D).


Conclusions: what did we learn?

Easy to be wise afterwards? Inspecting the examples for which predictions went wrong tended to produce arguments for why that was so. However, such reasoning in some cases appeared rather premature: proteins for which secondary structure was predicted below average tended to differ from those for which solvent accessibility was predicted below average (Fig. 1A), although many of the arguments would apply to both prediction methods (such as the stabilisation by cysteine-bridges for target t32).

Generating more informative alignments is straightforward. The difference in prediction accuracy between the fully-automatic and the semi-automatic selection of the alignment (two to four percentage points) illustrated that prediction accuracy could be improved significantly without changing the final prediction method (PHD 20). The procedure used for the CASP2 submission could be automated. (The major technical problem, currently, is the lack of CPU resources available at EMBL for the PredictProtein service.) Another point was illustrated for the CASP2 targets: monitoring how predictions change in response to the alignment (including more or less proteins) is an excellent means of arriving at better expert-driven predictions.

CASP: good for testing methods, but not representative. 1D structure predictions comprise excellent examples for prediction methods, in general, since we have large data sets for which we can estimate prediction accuracy. Such tests reveal that prediction accuracy differs between different proteins (with one standard deviation of about ten percentage points). How many proteins are representative? To approach the answer, consider the following experiment: first, average prediction accuracy and its standard distribution are compiled for a set of 705 unique proteins chains 29; second, from the set of 705 chains 20 are picked at random, this is repeated until average accuracy and standard distribution match that of the set of 705 proteins. How many repeats would it take? The answer is: about five to ten. Thus, the following conclusions from 1D predictions in CASP2 evolve for users (and editors): don't trust too much methods that (1) were not tested in CASP, (2) revealed much lower values of accuracy than published, and (3) that were successful in CASP, but never evaluated on larger data bases.

1D predictions now accurate enough as first step in structure prediction. Many of the third generation predictions of 1D structure are accurate enough to become a first step in predicting higher dimensions of protein structure (Arthur Lesk, this issue). A prominent application of PHD predictions was threading of the CASP2 targets (e.g. Murzin, or Fischer, Eisenberg et al., this issue). Even an automatic PHD-threading procedure 31, 32 yielded relatively good results for recognising the correct fold.

Acknowledgements

Thanks to Sean O'Donoghue (EMBL, Heidelberg) for helpful discussions and for critically reading the manuscript. Particular thanks to all those who contributed essentially to the CASP2 meeting by making their experimental structure determinations available prior to publication, to the assessors Arthur Lesk (LMB, Cambridge), and Michael Levitt (Stanford University), and, last, certainly not least, to all those who were involved in organising that meeting, to name a few: John Moult (CARB, Washington), Tim Hubbard (Sanger Centre, England), Stephen Bryant (NIH, Washington), Jan Pedersen (CARB, Washington), and Krystof Fidelis (LNL, Livermore).


References

  1. Rost, B. and O'Donoghue, S. I. Sisyphus and prediction of protein structure. CABIOS in press, 1997.
  2. Rost, B. and Sander, C. Bridging the protein sequence-structure gap by structure predictions. Annu. Rev. Biophys. Biomol. Struct. 25:113-136, 1996.
  3. Kabsch, W. and Sander, C. How good are predictions of protein secondary structure? FEBS Lett. 155:179-182, 1983.
  4. Fasman, G. D. Prediction of protein structure and the principles of protein conformation. New York, London: Plenum, 1989.
  5. Maxfield, F. R. and Scheraga, H. A. Improvements in the Prediction of Protein Topography by Reduction of Statistical Errors. Biochem. 18:697-704, 1979.
  6. Zvelebil, M. J., Barton, G. J., Taylor, W. R. and Sternberg, M. J. E. Prediction of protein secondary structure and active sites using alignment of homologous sequences. J. Mol. Biol. 195:957-961, 1987.
  7. Gascuel, O. and Golmard, J. L. A simple method for predicting the secondary structure of globular proteins: implications and accuracy. CABIOS 4:357-365, 1988.
  8. Kabsch, W. and Sander, C. Segment
  9. unpublished 1983.
  10. Gerloff, D. L., Jenny, T. F., Knecht, L. J., Gonnet, G. H. and Benner, S. A. The nitrogenase MoFe protein. FEBS Lett. 318:118-124, 1993.
  11. Rost, B. and Sander, C. Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol. 232:584-599, 1993.
  12. Benner, S. A., Badcoe, I., Cohen, M. A. and Gerloff, D. L. Bona Fide Prediction of Aspects of Protein Conformation. J. Mol. Biol. 235:926-958, 1994.
  13. Rost, B. and Sander, C. Combining evolutionary information and neural networks to predict protein secondary structure. Proteins 19:55-72, 1994.
  14. Rost, B. and Sander, C. Conservation and prediction of solvent accessibility in protein families. Proteins 20:216-226, 1994.
  15. Wako, H. and Blundell, T. L. Use of amino acid environment-dependent substitution tables and conformational propensities in structure prediction from aligned sequences of homologous proteins I. Solvent accessibility classes. J. Mol. Biol. 238:682-692, 1994.
  16. Wako, H. and Blundell, T. L. Use of amino acid environment-dependent substitution tables and conformational propensities in structure prediction from aligned sequences of homologous proteins II. Secondary structures. J. Mol. Biol. 238:693-708, 1994.
  17. Barton, G. J. Protein secondary structure prediction. Curr. Opin. Str. Biol. 5:372-376, 1995.
  18. Gerloff, D. L., Chelvanayagam, G. and Benner, S. A. A predicted consensus structure for the protein-kinase c2 homology (c2h) domain, the repeating unit of synaptotagmin. Proteins 22:299-310, 1995.
  19. Salamov, A. A. and Solovyev, V. V. Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignment. J. Mol. Biol. 247:11-15, 1995.
  20. Di Francesco, V., Garnier, J. and Munson, P. J. Improving protein secondary structure prediction with aligned homologous sequences. Prot. Sci. 5:106-113, 1996.
  21. Rost, B. PHD: predicting one-dimensional protein structure by profile based neural networks. Meth. Enzymol. 266:525-539, 1996.
  22. Thompson, M. J. and Goldstein, R. A. Predicting solvent accessibility: higher accuracy using Bayesian statistics and optimized residue substitution classes. Proteins 25:38-47, 1996.
  23. Devereux, J., Haeberli, P. and Smithies, O. GCG package. Nucl. Acids Res. 12:387-395, 1984.
  24. Rost, B. PredictProtein - internet prediction service. WWW document (http://dodo.bioc.columbia.edu/predictprotein): EMBL, 1997.
  25. Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F., Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. and Tasumi, M. The Protein Data Bank: a computer based archival file for macromolecular structures. J. Mol. Biol. 112:535-542, 1977.
  26. Levitt, M. and Chothia, C. Structural patterns in globular proteins. Nature 261:552-558, 1976.
  27. Rost, B. Observed secondary structure content for 721 proteins. WWW document (http://dodo.bioc.columbia.edu/~rost/Res/96A-SecStrContent.html): EMBL Heidelberg, Germany, 1996.
  28. Doolittle, R. F. Of URFs and ORFs: a primer on how to analyze derived amino acid sequences. Mill Valley California: University Science Books, 1986.
  29. Hubbard, T., Tramontano, A., Barton, G., Jones, D., Sippl, M., Valencia, A., Lesk, A., Moult, J., Rost, B., Sander, C., Schneider, R., Lahm, A., Leplae, R., Buta , C., Eisenstein, M., Fjellström, O., Floeckner, H., Grossmann, J. G., Hansen, J., Helmer-Citterich, M., Joergensen, F. S., Marchler-Bauer, A., Osuna, J., Park, J., Reinhardt, A., Ribas de Pouplana, L., Rojo-Dominguez, A., Saudek, V., Sinclair, J., Sturrock, S., Venclovas, C. and Vinals, C. Update on protein structure prediction: results of the 1995 IRBM workshop. Folding & Design 1:R55-R63, 1996.
  30. Rost, B. Expected prediction accuracy of PHD. WWW document (http://dodo.bioc.columbia.edu/~rost/Res/96D-ExpAccuracyPHD.html): EMBL Heidelberg, Germany, 1996.
  31. Rychlewski, L. and Godzik, A. Secondary structure predictions: in quest of forces that shape the local protein structure. Preprint: The Scripss Research Institute; 10666 N. Torrey Pines Road; La Jolla, CA 92037, USA, 1996.
  32. Rost, B. TOPITS: Threading One-dimensional Predictions Into Three-dimensional Structures. In: Rawlings, C., Clark, D., Altman, R., Hunter, L., Lengauer, T. and Wodak, S. (eds.). Third International Conference on Intelligent Systems for Molecular Biology. Cambridge, England: Menlo Park, CA: AAAI Press, 1995:314-321.
  33. Rost, B., Schneider, R. and Sander, C. Protein fold recognition by prediction-based threading. J. Mol. Biol. 270:471-480, 1997.
  34. Rost, B., Sander, C. and Schneider, R. Redefining the goals of protein secondary structure prediction. J. Mol. Biol. 235:13-26, 1994.
  35. Kabsch, W. and Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen bonded and geometrical features. Biopolymers 22:2577-2637, 1983.