Pitfalls of protein sequence analysis
Burkhard Rost & Alfonso Valencia EMBL CNB-CSIC 69012 Heidelberg Cantoblanco, 28049-Madrid Germany Spain rost@embl-heidelberg.de valencia@samba.cnb.uam.es
What can theory predict of protein structure? In general, protein three-dimensional (3D) structure can NOT be predicted from sequence [2, 3] . However, 3D structure can be predicted by homology modelling, i.e., by using a sequence homologue (>25% sequence identity) with an experimentally determined 3D structures. If no sequence homologue is found in PDB [4] , there still is a chance to predict 3D structure by threading, i.e., by remote homology modelling (<25% sequence identity). However, correct 3D models -and even correct detection of remote homology - from threading are rare [5, 6] . But, theory can assist by predicting one-dimensional (1D) aspects of 3D structure, e.g., secondary structure, solvent accessibility, transmembrane helices, binding sites, sequence motifs, and aspects of protein function.
Ease of use bears an ease of misuse. Rapidly developing electronic communication (Internet, World Wide Web) facilitates spreading prediction methods. Experimental biologists submit sequences, theoretical biologists configure automatic services that return predictions. The advantage is that users need not become experts for sequence analysis tools. However, the ease of offering and accessing predictions bears two problems. (1) Inaccurate methods (or insufficiently validated ones) are made available bypassing selection systems such as referees. (2) Users may misinterpret results due to a lack of insight into the features of prediction methods.
More than 30% pairwise sequence identity. Sequence analysis usually starts by searching homologues in databases [7, 4] . The success of alignment programs grounds on evolutionary connections between homologous proteins: if 24 out of 80 aligned residues (i.e. 30%; more for shorter matches; [8] ) are identical between two naturally evolved proteins, the two have similar 3D structures and similar functions [9, 8, 10] (this may not be valid for engineered proteins). The level of sequence identity significant for homology is much higher for smaller regions; for very short motifs (e.g. 'RGD', 'KDEL') homology can NOT be inferred from sequence identity.
Higher values for sequence similarity. If similarity scores (physico-chemical properties: D->E = 1) rather than identity scores (D->E =0; D->D = 1) are used to select homologues, the pairwise similarity, usually, has to be higher than 30% to be significant. A rule of thumb for true homologues is that for these similarity scores are higher than identity scores. Similarity scores depend on the particular similarity metric used. Thus, results cannot be compared directly between different methods.
Constraints to significant identity: composition bias and gaps. There are two possible errors in inferring homology from a given level of pairwise sequence identity. (1) Composition bias: if the two aligned proteins have regions with a high composition of certain amino acids (e.g. ARG rich regions in DNA binding proteins) such regions may be important for protein function - and in many cases are indicative of functional class - but may be misleading for homology searches. Thus, composition biased regions should be ignored when compiling sequence identity, and be used only to confirm presence of similar composition bias in identified homologues. (2) Many gaps: if an alignment between two proteins contains too many insertions (gaps) even a relative high value of sequence identity may not suffice to ascertain homology (typical structure alignments contain up to 10% gaps).
Evolutionary patterns crucial for successful prediction of function. A typical mistake is to predict function by putative homology based on an over-interpreted level of sequence similarity. Functional and structural constraints are translated into sequence conservation in a particular way that depends on the particular protein structure and its evolution. The level of similarity required for identifying functionally equivalent proteins in two species depends on the overall divergence of the species and on the particular protein family.
Some databases use more reliable annotations than others. When predicting function based on similarity to proteins of known function (as annotated in databases), it is important to be aware of incomplete or wrong annotations. The annotations for the putative homologue ought to be verified in the original sources of the functional assignments (a more reliable database is SWISS-PROT [7] ). A similar problem arises for errors in sequences, such as frame-shifts or sequencing errors (very frequent in EST's).
Quality of alignment. Despite the central rôle that alignment programs play in sequence analysis, a thorough analysis of the quality of methods based on statistically significant numbers of proteins has yet to be accomplished. In general, alignments are more likely to be correct for higher levels of pairwise sequence identity; and are less likely to be correct in more variable regions.
Stability of alignment. Say you find three proteins you want to use to build a multiple alignment for your sequence of unknown structure (U ). In experiment A , you align them in the order 1-U (1 to U), 2-U, 3-U; in experiment B you inverse the order 3-U, 2-U, 1-U. The alignments of A and B may differ in detail. An error of the program? Not necessarily, the reason may, as well, be that the alignment is just not unique, i.e., the best and the second best solution to the alignment problem may have similar scores. In such cases the alignment is less reliable in regions where the results from A and B differ.
Sequence alignments reveal underlying evolutionary processes. Aligning protein sequences may appear to be purely a problem of matching letters. However, sequence alignments unravel information about structural and functional relations between residues in different proteins. Obviously it is not trivial to map the complexity of factors determining protein structure and function onto 1D relations between letters.
Evolutionary divergence within sequence families. In general, regions with many insertions and deletions in the alignment are less informative. To illustrate this, say, your protein has 333 residues, 22 sequences are aligned in the N-terminal region, and only two near the C-term. Then, you cannot draw firm conclusions about function and/or structure from the conservation patterns at the C-term. Do 20 sequences suffice for an informative alignment? Not necessarily. The information contained in a multiple alignment is determined rather by the divergence of the aligned sequences than by the number. Ideally, the entire range between 30-90% sequence identity should be covered, with preferably many sequences at lower levels (30-50%).
Extending local sequence motifs to entire folds. The goal of database searches is to find a good alignment for a full folding domain. This task is often very difficult in absence of 3D information. If you found a local motif (e.g. by a BLAST search [11] ), try to extend the alignment to cover the full core of the proteins. A good indication for a correct match is that motifs (or local hits from the BLAST search) appear in the same order in the final alignment.
Aligning entire families rather than subsets of sequences. Another helpful criterion indicative of true homologues is that the 'full' protein family is described by the alignment rather than just a subset of sequences. In practice, searches often identify initially a few members of a given family. The aligned regions should then be investigated thoroughly: are local motifs compatible with the entire family? Are there other motifs that could be used to uncover further homologues by restricting the search to such motifs? Is the pattern symmetrical, i.e., if your protein U has been aligned to, e.g., the protein kinase family based on strong local motifs: do other patterns relevant for the kinase family match in U ? Incomplete family alignments are often indicative of a misleading local pattern and of falsely having aligned unrelated proteins.
Profile alignments may intrude into the twilight zone. In the twilight zone [10] of 20-30% pairwise sequence identity , sequence alignments become tricky. Only methods using profiles derived from the sequence family of your protein U may reliably intrude into that zone. The quality of such alignments depends crucially on the information contained in the alignment, i.e., the size (number of sequences) and divergence (levels of pairwise sequence identity) of the sequence family. In sparse regions (less sequences) alignments are generally less reliable. However, penetrating the twilight zone requires attention!
70% correct implies 30% incorrect. The most accurate methods for predicting secondary structure or solvent accessibility are based on multiple alignment information and reach levels of about 70% accuracy. This level of accuracy suffices to render useful predictions [12, 6] . However, in interpreting the predictions it is often instructive to spot the 30% of the residues you suspect to be falsely predicted.
Spread of prediction accuracy. An expected accuracy of 70% does NOT imply that for your protein U 70% of all residues are correctly predicted. Instead, values published for prediction accuracy are averaged over hundreds of unique proteins. An expected accuracy of 70±10% (one standard deviation) implies that, on average, for two thirds of all proteins between 60 and 80% of the residues will be predicted correctly. Thus, prediction accuracy can be higher than 80% or lower than 60% for your protein.
Special classes of proteins. Prediction methods are usually derived from knowledge contained in subsets of proteins from databases. Consequently, they should not be applied to classes of proteins which have not been included in the subsets. For example, methods for predicting helices in globular proteins are likely to fail when applied to predict transmembrane helices. In general, results should be taken with caution for proteins with unusual features, such as proline-rich regions, unusually many cysteine bonds, or for domain interfaces.
Better alignments yield better predictions. Multiple alignment-based predictions are substantially more accurate than single sequence-based predictions. How many sequences do you need in your alignment to expect an improvement; and how sensitive are prediction methods with respect to errors in the alignment? The more divergent sequences contained in the alignment, the better (two distantly related sequences often improve secondary structure predictions by several percentage points). Regions with few aligned sequences yield less reliable predictions. The sensitivity to alignment errors depends on the methods, e.g., secondary structure prediction is less sensitive to alignment errors than accessibility prediction.
Better + worse = even better? Today, several automatic services accomplish secondary structure predictions. Some users fall into the what-is-common-is-correct trap, i.e., they average over all prediction methods and consider identical regions as more reliable. Exceptionally, such a majority vote may be beneficial. However frequently, the result will be the worst-of-all prediction. Often, it is preferable to use reliability indices provided by some methods. Such indices answer the question: how reliably is the tryptophan at position 307 predicted in a surface loop? (Note: the correlation between such indices and prediction accuracy is sufficiently tested for a few methods, only.)
1D structure may or may not be sufficient to infer 3D structure. Say you obtain as prediction for regular secondary structure: helix-strand-strand-helix-strand-strand (H-E-E-H-E-E). Assume, you find a protein of known structure with the same motif (H-E-E-H-E-E). Can you conclude that the two proteins have the same fold? Yes and no, your guess may be correct, but there are various ways to realise the given motif by completely different structures. For example, the secondary structure motif 'H-E-E-H-E-E' is contained in, at least, 16 structurally unrelated proteins.
Accuracy of homology modelling at the level of ribbon plots. A common mistake in using homology derived models of 3D structure is that the model is taken too literally. In general, accuracy of homology modelling decreases with lower levels of pairwise sequence identity between your sequence U and the target structure. Models accurate enough to simulate ligand binding in detail require levels of above 90% pairwise sequence identity. Furthermore, successful applications of homology modelling may be hampered by two other difficulties. (1) Loop regions are, in general, less reliable. (2) Accurate predictions for regions with insertions or deletions, i.e., regions where the template structure does not match U , are the exception [6] .
Avoid over-interpreting the details of the model. The main purpose of homology modelling is to translate a given alignment into more intuitive 3D images. Such images often look temptingly 'real'. It is crucial to bear in mind which regions were unreliable in the alignment or in the original structure (e.g. flexible regions, high R factor, low number of NMR constraints). Homology modelling tends to yield sketches of structure rather than accurate co-ordinates. Is homology modelling of any use? Is any hypothesis about a structure better than no hypothesis? Guessing details about protein function is such a difficult task that any null hypothesis may help to guide experiments. This may result in over-interpretations of the level of homology suggested by the sequence alignment and the level of accuracy of the resulting model. In practice, modelling often strikes a balance between users pushing for more interpretations and models pin-pointing the limitations of the methods.
Remote homology modelling (threading) is extremely tricky! One problem of homology modelling for lower levels of pairwise sequence identity is to get the alignment between your sequence U and the template structure T correct. But even if the alignment were correct, a principle limitation is that T and U just do NOT have identical 3D structures. This problem is particularly fatal for remote homology modelling (threading), i.e., prediction of 3D structure based on less than 25% pairwise sequence identity. However, current threading methods are even more limited: getting the alignment correct is the exception rather than the rule [6] . The basic message of this statement for you is NOT 'don't use threading programs', but 'use them with extreme caution and be aware that most resulting models are likely to be mostly wrong'.
Check the predicted model! No matter, which homology modelling technique you use, you better check the model by one of the various programs that detect errors in experimentally determined structures [2] . What if all checks reveal your model to resemble a known structure? You have a good starting point. But, keep in mind that checking tools are tailored to spot errors in structures derived from NMR or X-ray crystallography. Your model may have been subject to extensive refinement trying to optimise exactly those variables that are checked (e.g. torsion angles, bond lengths). Furthermore, other features that help to spot errors in experimentally determined structures (e.g. insertion of gaps in secondary structure elements, proximity of functional or active site residues, hydrophobicity of core) may have been avoided for your model by the alignment program, already. What if the checks reveal your model to be not native-like? Shifts of secondary structure segments are rather frequent in protein evolution. It may have been misleading at the first place to optimise bumps or side-chain packing by the modelling software when the deviations between the secondary structure elements of the target and your protein were substantial. Try to focus on more reliable regions and/or particular aspects of the model compatible with independently derived information.
Most mistakes are pitfalls that could have been avoided. Most of the examples listed can be traced in the literature. Is the take home message: hands off applying prediction methods if you are not an expert? Certainly not, the pitfalls listed could have been avoided. The more difficult the prediction task, the more skills are required: threading and homology modelling are difficult; alignments and 1D structure prediction straightforward. Understanding the limitations of tools is almost the entire way to a successful application.
Is your protein a suitable target for prediction methods? A general message from a workshop organised by Anna Tramantano (IRBM, Rome) and Tim Hubbard (MRC Cambridge) was [12] : this depends on how much you are interested in finding answers to your questions! Theoretical biology still fails to predict 3D structure from sequence, but predictions of various simplified 1D aspects of structure become more accurate and more useful with every new sequence added to public databases.