PHD: predicting 1D protein structure byprofile based neural networks

Rost, Burkhard

Meth. in Enzym. 1996, 266, 525-539 (ed.: Russ Doolittle)


Contact: Burkhard Rost (rost@EMBL-Heidelberg.de)

Text only - ASCII (loads much much faster)
Text only - HTML (loads much faster)


Table of contents

Abbreviations used

Some more recent results on the accuracy of PHD




Introduction

We still cannot predict protein three-dimensional (3D×) structure from sequence alone. But, we can predict 3D structure for one fourth of the known protein sequences (SWISSPROT 1) by homology modelling based on significant sequence identity (>25%) to known 3D structures (PDB 2). 3 For the remaining, about 30,000 known sequences, the prediction problem has to be simplified. An extreme simplification is to try to predict projections of 3D structure, e.g., 1D secondary structure, solvent accessibility, or transmembrane location assignments for each residue.

Despite the extreme simplification, the success of 1D predictions has been limited as segments from single sequences (used as input) do not contain sufficient global information about 3D structures. 4, 5 Patterns of amino acid substitutions within sequence families are highly specific for the 3D structure of that family. Using such evolutionary information is the key to a significant improvement of 1D predictions.

In this review I describe three prediction methods that use evolutionary information as input to neural network systems to predict secondary structure (PHDsec 6-8), relative solvent accessibility (PHDacc 9), and transmembrane helices (PHDhtm 10). I shall also illustrate the possibilities and limitations in practical applications of these methods with results from careful cross-validation experiments on large sets of unique protein structures.

All predictions are made available by an automatic email prediction service (see Availability). The baseline conclusion after some 30,000 requests to the service 11 is that 1D predictions have become accurate enough to be used as a starting point for expert-driven modelling of protein structure. 12-14

Methods


Generating the multiple sequence alignment

The first step in a PHD prediction is generating a multiple sequence alignment. The second step involves feeding the alignment into a neural network system. Correctness of the multiple sequence alignment is as crucial for prediction accuracy as that the alignment contains a broad spectrum of homologous sequences. By default, PHD uses the program MaxHom (Fig. 1) that generates a pairwise profile-based multiple alignment. 15 A key feature of MaxHom is the compilation of a length-dependent cut-off for significant pairwise sequence identity (Fig. 1). 15


Fig. 1

Fig. 1
First, for each protein, the SWISSPROT data base is searched for sequence homologues with a fast alignment method (BLAST , Altschul, et al., J. Mol. Biol., 215, 403-410, 1990 ). Second, the list of putative homologues found is re-examined with a more sensitive profile-based multiple alignment method (MaxHom, Sander and Schneider, Proteins, 9, 56-68, 1991 ). Third, a length dependent cut-off for significant pairwise sequence identity is applied (25%+5%; where '+5%' reflects a safety margin in the twilight zone (R. F. Doolittle, "Of URFs and ORFs: a primer on how to analyze derived amino acid sequences" Univ. Science Books, Mill Valley California, 1986). ( Doolittle, 1986 ).




Multiple levels of computations

The PHD methods process the input information on multiple levels (Fig. 2). The first level is a feed-forward neural network with three layers of units (input, hidden, and output). Input to this first level sequence-to-structure network consists of two contributions: one from the local sequence, i.e., taken from a window of 13 adjacent residues, and another from the global sequence (Fig. 2). Output of the first level network is the 1D structural state of the residue at the centre of the input window. For PHDsec and PHDhtm the second level is a structure-to-structure network (see below). The next level consists of an arithmetic average over independently trained networks (jury decision). The final level is a simple filter.


Fig. 2

Fig. 2
First, a window of w =13 adjacent residues is chosen from the alignment (here I show only w =7 for clarity). Second, for each residue the profile and global information is compiled from the protein. Third, the local and global information is fed into neural network systems. PHDsec and PHDhtm consist of two network levels. First level, sequence-to-structure networks, for each residue position 24 units are used, 20 for the amino acid types, one for a 'spacer' allowing the window to extent over the protein ends (so that the first and last residues in a protein can be at the centre of one input window), two for the numbers of insertions (ins ) and deletions (del ) in the alignment at that position, and one for the conservation weight (cons ); the global information is coded by 20 units for the amino acid composition, four for the protein length, and eight for the distances of the window with respect to the protein ends. The output units code for the 1D structural state of the central residue. For PHDsec, three output units code for helix, strand, and rest; for PHDacc, ten units code for ten levels of relative solvent accessibility (e.g., if the fourth unit has the maximal value, then the prediction is a relative solvent accessibility „ 9% and < 16%); and for PHDhtm, two units code for transmembrane or not transmembrane helix. Second level, structure-to-structure networks: the output of the first level is fed into a second level of structure-to-structure network, which additionally uses global information and the conservation weight as input, e.g., for PHDsec: first level output = 3 units -> local input to second level = 3 + 1 (spacer) + 1 (cons). The output of the second level is the same as that of the first level.




Number of output units determined by task

Secondary structure is coded by three units: helix, H (H , G , and I in DSSP 16); strand, E (E and B in DSSP 16); and none of the above, denoted loop, L. Transmembrane locations are coded by two units, one for residues being in a transmembrane helix, the other for non-membrane bound residues (assignments from SWISSPROT 1). For solvent accessibility the output coding is not as straightforward. Firstly, the value for accessibility is normalised to a relative accessibility (observed accessibility taken from DSSP 16 divided by maximal accessibility of a given residue type 9, 17) to enable a comparison between residues of different sizes. Secondly, the relative accessibility is projected onto ten states (for technical reasons; Fig. 2). 9

Better segment prediction by structure-to-structure networks

The output coding for the second level network is identical to the one for the first (Fig. 2). The dominant input contribution to the second level structure-to-structure network is the output of the first level sequence-to-structure network. The reason for introducing a second level is the following. Networks are trained by changing the connections between the units such that the error is reduced for each of the examples successively presented to the network during training. The examples are chosen at random. Therefore, the examples taken at time step t and at time step t+1 are usually not adjacent in sequence. This implies that the network cannot learn that, e.g., helices contain at least three residues. The second level structure-to-structure network introduces a correlation between adjacent residues with the effect that predicted secondary structure segments or transmembrane helices have length distributions similar to the ones observed. 6, 7

Balanced predictions by balanced training

For the prediction of secondary structure and transmembrane helices, the distribution of the examples is rather uneven: about 32% of the residues are observed in helix, 21% in strand, and 47% in loop; about 18% of the residues in integral transmembrane proteins are located in transmembrane helices. Choosing the training examples proportional to the occurrence in the data set (unbalanced training), results in a prediction accuracy that mirrors this distribution, e.g., strands are predicted inferior to helix or loop. 18-20 A simple way around the data base bias is a balanced training: at each time step one example is chosen from each class, i.e., one window with the central residue in a helix, one with the central residue in a strand and one representing the loop class. This training results is a prediction accuracy well balanced between the output states. 6, 7

Compromise between over- and under-prediction by jury decision

Balanced training results in improved predictions for the less populated output states (e.g. strand). However, this is associated with less accurate predictions for more populated states (loop). Consequently, the overall accuracy is lower for the balanced than for the unbalanced prediction. To find a compromise between networks with balanced and those with unbalanced training, a final jury decision is performed (effectively a compromise between over- and under-prediction). The jury decision is a simple arithmetic average over, typically, four differently trained networks: all combinations of first level networks with balanced or unbalanced training, and with balanced or unbalanced training of second level networks (2 ´ 2). The final prediction is assigned to the unit with maximal output value ('winner takes all').

Correcting obvious errors by final filter

For secondary structure prediction (PHDsec), the filter affect only drastic, unrealistic predictions (e.g.: HEH -> HHH; EHE -> EEE; and LHL -> LLL). For accessibility prediction (PHDacc), the filter performs an average over neighbouring output units (i.e. not over adjacent residues). Only the filter used for predicting transmembrane helices (PHDhtm) is crucial for the performance. The currently implemented filter has been guided by previous experiences. 21-24 Predicted transmembrane helices which are too long are either split or shortened. Predicted transmembrane helices which are too short are either elongated or deleted. All these decisions (split or shorten; elongate or delete) are based on the strength of the prediction and on the length of the transmembrane helix predicted. 10

Avoiding the trap of over-estimating prediction accuracy

The three necessary conditions for an appropriate evaluation of prediction accuracy are: first, that training and testing set are distinct; second, that the testing set is representative; and third, that free parameters are not optimised on the test set which is used for the final evaluation. To explain these conditions in more detail: first, the criterion for distinct sets is that no protein in one set has more than 25% pairwise sequence identity to any protein in the other. 15 Second, the test set has to be representative for the data base (ideally for all existing proteins), i.e., all known sequence families should be included, and they should be included only once. Third, no free parameter should be optimised with respect to the test set. A simple protocol for correct testing would be the following. (1) Choose a small test set ('pre-test', some 10 proteins), and adjust free parameters; (2) keeping the network fixed, compile the accuracy for all test proteins ('real-test', > 100 proteins by cross-validation experiments; note that the number of splits between test and training sets for cross-validation is of no interest for the user); (3) apply the same network to another test set never used before ('pre-release test', e.g., protein structures experimentally determined after the project had started). A lower level of accuracy for the 'pre-release test' than for the 'real test' indicates an over-fitting of free parameters. Step three should be re-applied whenever a considerable number of new structures have been added to the data base (Table I).


Tab. 1

Table I

caption Tab. 1






Results


Values for expected prediction accuracy are distributions

Statements such as 'secondary structure is about 90% conserved within sequence families' 25, or 'solvent accessibility is about 85% conserved within sequence families' 9 refer to averages of distributions. The same holds for the expected prediction accuracy (Fig. 3; newer graphs). Such distributions explain why some developers have over-estimated the performance of their tools using data sets of only tens of proteins (or even fewer). 5 For the user interested in a certain protein, the distributions imply a rather unfortunate message: for that protein, the accuracy could be lower than 40%, or it could be higher than 90% (Fig. 3; newer graphs). For some of the worst predicted proteins, the low level of accuracy could be anticipated from their unusual features, e.g., for crambin, or the antifreeze glycoprotein type III. However, for others the reasons for the failure of PHDsec are not obvious, e.g., both the phosphotidylinonitol 3-kinase 26 and the Src-homology domain of cytoskeletal spectrin have homologous structure 27 but prediction accuracy varies between less than 40% (pik) and more than 70% (spectrin). Another possible reason for a bad prediction is a bad alignment. In general, single sequences yield accuracy values about ten percentage points lower than multiple alignments. 6 Indeed, the worst case for a prediction so far is pheromone (1erp) a short protein structurally dominated by a disulphide bridge, for which there is no sequence alignment available: only 32% of the residue are predicted correctly.


Fig. 3

Fig. 3 The expected variation of prediction accuracy with protein chain. (a) Three-state per-residue overall accuracy for PHDsec (total of 337 chains). (b) Two-state per-residue overall accuracy for PHDacc (total of 318 chains). Given are the distributions, averages and one standard deviation. (c) Depicts the cumulative percentage of protein chains predicted at an error level lower than the value given (error = 100 - accuracy). Error values are percentages of falsely predicted residues (PHDsec, PHDacc) and falsely predicted segments (PHDhtm). E.g., for one half of all chains, PHDhtm predicts all segments correctly (note, total set for evaluating PHDhtm comprises only 69 chains), whereas PHDsec and PHDacc rate at about 25% falsely predicted residues.





Reliability of prediction correlates with accuracy

An estimate where in the distributions (Fig. 3) a given prediction is to be expected is given by the prediction strength, i.e., the difference between the output unit with highest value (winner unit) and the output unit with the next highest value. This difference is used to define a reliability index for the prediction of each residue (normalised to a scale from 0 (low) to 9 (high)). Residues with higher reliability index are predicted with higher accuracy (Fig. 4; newer graphs). In practice, the reliability index offers an excellent tool to focus on some key regions predicted at high levels of expected accuracy. (Note however, that the reliability indices tend to be unusually high for poor alignments.)


Fig. 4

Fig. 4 (a) Expected per-residue accuracy for residues with a reliability index (RI ) above a given cut-off, e.g., a level of accuracy comparable to homology modelling is reached for 49% of all residues by PHDacc (RI „ 4) and for 44% of all residues by PHDsec (RI „ 7). The small region covered by the reliability index of PHDhtm is dominated by strong predictions for non-transmembrane residues; the most accurately predicted residues in transmembrane helices reach a level of 'only' 96% accuracy. (b) Expected per-segment accuracy for secondary structure segments with an average reliability index (<RI>) above a given cut-off. For example, an average reliability of <RI> „ 7 is reached for: (i) 36% of all segments (SovSDO3(3) > 82%), (ii) 56% of all helices (SovSDO3(E) > 85%), and (iii) 24 of all strands (SovSDO3(H) > 86%). (Definitions of segment overlap in Rost, et al., J. Mol. Biol., 235, 13-26, 1994 .)





Prediction of secondary structure at better than 72% accuracy

PHDsec was the first secondary structure prediction method to surpass a level of 70% overall three-state per-residue accuracy. 6 The last test set with more than 300 unique protein chains and a total of more than 70,000 residues compiled for this contribution yielded a three-state per-residue accuracy better than 72% (Table I). Besides the high level of overall-accuracy, predictions are well balanced (high value for I in Table I). Furthermore, PHDsec meets the demands for a reasonable prediction tool in that the accuracy measured in segment-based scores 25 is higher than per-residue scores: about 74% of the segments are correctly predicted (Sov in Table I).

Structural class prediction comparable to experimental accuracy

Proteins can be sorted roughly into four structural classes based on secondary structure content: all-a (helix ³ 45%, strand < 5%), all-b (strand ³ 45%, helix < 5%), a/b (helix ³ 30%, strand ³ 20%), and all others. 28, 29 An experimental way to measure secondary structure content is circular dichroism spectroscopy. 30, 31 A simple alternative is to use the predictions of PHDsec to compile the overall prediction of secondary structure content. Based on the predicted content, proteins are sorted into either of the four structural classes. The result is that for about 74% of all protein chains, the class is correctly predicted (Table 2I). The correlation between observed and predicted content is 0.88 for helix and 0.75 for strand. These values are comparable to results from circular dichroism spectroscopy (helix: 0.84, strand: 0.37-0.41 31). 6 Of course, this does not imply that PHDsec can replace experiments. However, the high level of accuracy suggests to use PHDsec prediction as a complement to experiments.


Tab. 2

Table 2

caption Tab. 2





Prediction of buried or exposed residues at 74% accuracy

Comparing the conservation of secondary structure with that of solvent accessibility (measured in three states), we find that solvent accessibility is less conserved (Table I). Consequently, PHDacc is less accurate than PHDsec. However, the accessibility prediction is relatively close to the optimum given by homology modelling: the correlation between predicted and observed relative accessibility is 0.54 for PHDacc and would be 0.68 for sequence alignments if homology modelling were possible (Table I). More than 74% of the residues are predicted correctly in either of the two states, buried or exposed. Entirely buried residues (< 4% accessible) are predicted best (data not shown). 9 PHDacc is, so far, superior to other methods (Table I). (Note: tested on a subset of 99 monomers, the two-state accuracy rose to over 77%. 9)

Transmembrane helices predicted at 95% accuracy

The problem in evaluating the performance of PHDhtm is the small set of proteins for which the locations of transmembrane helices have been determined reliably. Consequently, the results ought to be viewed with caution. The overall two-state per-residue accuracy of PHDhtm is 95% (Fig. 3; newer graphs), the per-segment accuracy is about 96% (only 15 out of 380 transmembrane helices in a set of 69 proteins were wrongly predicted). 10 Of further practical importance is the low level of false positives for PHDhtm. Out of a set of 278 globular water-soluble proteins with unique sequences, PHDhtm predicts only 14 incorrect transmembrane helices; these errors occur mostly for proteins with highly hydrophobic b-strands in the core. 10



Availability

PHD predictions (and MaxHom alignments) are available upon request by the automatic prediction service PredictProtein. 11 For detailed information send the word help as subject to the internet address: PredictProtein@EMBL-Heidelberg.DE or ventured through the World Wide Web (WWW) site: http://dodo.bioc.columbia.edu/predictprotein/predictprotein.html (Fig. 5). Since we sometimes have over 100 requests per day, returning a prediction may take a day or more. If you have no answer after two days, something has gone wrong (typical reasons: corrupted email connection of sender or hardware problems at EMBL). In such a case, simply resubmit the request. Should the answer not appear after another two days, send a note to: Predict-Help@EMBL-Heidelberg.DE . For further services (e.g. data bases) provided by the EMBL Protein Design Group, see http://www.sander.embl-heidelberg.de/desc/ or connect by anonymous ftp to ftp.embl-heidelberg.de


Fig. 5

Fig. 5 Example for a request to the automatic protein structure prediction server PredictProtein. All network methods (PHDsec, PHDacc, PHDhtm) are available. The file to be submitted consists of two parts: a header (optional key words in italics), and the main body starting with a hash (#) in the first line and the one-letter code amino acid sequence in the following lines. Alternative options are the submission of a list of sequences, or a complete alignment. Details are given in the PredictProtein help file (see text).





Comments

How accurate are the predictions?

The expected levels of accuracy (PHDsec 72±9% (three states); PHDacc 75±7% (two states); PHDhtm 94±6%) are valid for typical globular, water-soluble (PHDsec, PHDacc), or helical transmembrane proteins (PHDhtm) when the multiple alignment contains many and diverse sequences. High values for the reliability indices indicate more accurate predictions. (Note: for alignments with little variation in the sequences, the reliability indices adopt misleadingly high values.)

How useful are the predictions?

The prediction of secondary structure can be accurate enough to assist chain tracing. Furthermore, predictions can be used as a starting point for modelling 3D structure and predicting function. 13, 32-34

Confusion between strand and helix?

PHDsec focuses on predicting hydrogen bonds. Consequently, occasionally strongly predicted (high reliability index) helices are observed as strands and vice versa.

Strong signal from secondary structure caps?

The ends of helices and strands contain a strong signal. However, on average PHDsec predicts the core of helices and strands more accurately than the caps. 19

Accessibility useful to provide upper limits for contacts?

The predicted solvent accessibility (PHDacc) can be translated into a prediction of the number of water atoms around a given residue. Consequently, PHDacc can be used to derive upper and lower limits for the number of inter-residue contacts of a certain residue (such an estimate could improve predictions of inter-residue contacts 35).

What about protein design and synthesised peptides?

The PHD networks are trained on naturally evolved proteins. However, the predictions have proven to be useful in some cases to investigate the influence of single mutations. For short poly-peptides, the following should be taken into account: the network input consists of 17 adjacent residues, thus, shorter sequences may be dominated by the ends (which are treated as solvent).

How to predict porins?

PHDhtm predicts only transmembrane helices, and PHDsec has been trained on globular, water-soluble proteins. How to predict 1D structure for porins then? As porins are partly accessible to solvent, prediction accuracy of PHDsec was relatively high (70%) for the known structures. Thus, PHDsec appears to be applicable.

How to use the prediction of transmembrane helices?

One possible application of PHDhtm is to scan, e.g., entire chromosomes for possible transmembrane proteins. The classification as transmembrane protein is not sufficient to have knowledge about function, but may shed some light into the puzzle of genome analyses. When using PHDhtm for this purpose, the user should keep in mind that on average about 5% of the globular proteins are falsely predicted to have transmembrane helices.

Acknowledgements

Gladly, I should like to express my gratitude to the colleagues from the EMBL who help(ed) in developing PHD. First of all, thanks to Chris Sander for his intellectual, emotional, and financial support. Second, thanks to Reinhard Schneider for valuable ideas, important discussions, and for his help in setting up the prediction server. Third, thanks to Antoine de Daruvar for having rewritten the server software and for now maintaining the server. Fourth, to Gerrit Vriend whose ideas paved the way for the first prediction above 70% accuracy. Fifth, thanks to SŽan O'Donoghue for a thorough correction of the manuscript. Finally, thanks to all those who deposit data about protein structure in public data bases and thus enable the development of tools such as PHD.


References (text)





Abbreviations used:

3D, three-dimensional; 1D, one-dimensional; PDB, Protein Data Bank of experimentally determined 3D structures of proteins; SWISSPROT, data base of protein sequences; DSSP, data base containing the secondary structure and solvent accessibility for proteins of known 3D structure; HSSP, data base containing for each PDB protein of known 3D structure the alignments of all SWISSPROT sequences homologue to the known structure; MaxHom, profile based multiple alignment program; PHD, Profile based neural network prediction of secondary structure (PHDsec), solvent accessibility (PHDacc), and transmembrane helices (PHDhtm).