In: P. von Rague-Schleyer, N. L. Allinger, T. CClark, J. Gasteiger, P. A. Kollman, H. F. Schaefer 'Encyclopedia of Computational Chemistry', John Wiley: Sussex, 1998, 2242-2255
Address
EMBL, 69 012 Heidelberg, Germany
http://dodo.bioc.columbia.edu/~rost/
contact e-mail:rost@embl-heidelberg.de
In: P. von Rague-Schleyer, N. L. Allinger, T. CClark, J. Gasteiger, P. A. Kollman, H. F. Schaefer 'Encyclopedia of Computational Chemistry', John Wiley: Chisester, 2242-2255
Proteins are the machinery of life. The information for life is stored by a four-letter alphabet in the genes (DNA). Proteins are, among others, the macromolecules that perform all important tasks in organisms, such as catalysis of biochemical reactions, transport of nutrients, recognition, and transmission of signals. Thus, genes are the blueprints or library, and proteins are the machinery of life. Proteins are formed by joining amino acids by peptide bonds into a stretched chain. This protein sequence comprises a translation of the four-letter DNA alphabet into a 20-letter alphabet of native amino acids. Proteins differ in length (from 30 to over 30,000 amino acids), and in the arrangement of the amino acids (dubbed residues, when joined in proteins). In water, the chain folds up into a unique three-dimensional (3D) structure. The main driving force is the need to pack residues for which a contact with water is energetically unfavourable (hydrophobic residues) into the interior of the molecule. A detailed analysis of the underlying chemistry shows that this is only possible if the protein forms regular patterns of a macroscopic substructure called secondary structure (Fig. 1; for an excellent introduction into protein structure: [1]; for a short review of the basic principles of folding: [2]).
Sequence determines structure determines function. Protein three-dimensional (3D) structure (i.e. the co-ordinates of all atoms) determines protein function. But what determines 3D structure? The hypothesis that structure (also referred to as 'the fold') is uniquely determined by the specificity of the sequence, has been verified for many proteins [3]. While it is now known that particular proteins (chaperones) often play a rôle in the folding pathway, and in correcting misfolds [4], it is still generally assumed that the final structure is at the free-energy minimum. Thus, all information about the native structure of a protein is coded in the amino acid sequence, plus its native solution environment. Can the code be deciphered, i.e. can 3D structure be predicted from sequence? In principle, the code could by deciphered from physico-chemical principles using, for example, molecular dynamics methods [5]. In practice, however, such approaches are frustrated by two principle obstacles. Firstly, energy differences between native and unfolded proteins are extremely small (order of 1 kcal/mol). Secondly, the high complexity (i.e. co-operativity) of protein folding requires several orders of magnitudes more computing time than we anticipate to have over the next decades. Thus, the inaccuracy in experimentally determining the basic parameters, and the limited computing resources become fatal for predicting protein structure from first principles [6]. The only successful structure prediction tools are knowledge-based, using a combination of statistical theory and empirical rules.
The sequence-structure gap is rapidly increasing. Currently, databases for protein sequences (e.g. SWISS-PROT [7]) are expanding rapidly, largely due to large-scale genome sequencing projects. The first four entire genome sequences have been published; they represent all three terrestrial kingdoms: (1) prokaryotes: haemophilus influenzae [8], and mycoplasma genitalium [9]; (2) eucaryotes: yeast [10]; and (3) archeans: methanococcus jannaschii [11]. At least, another dozen of genomes will be completely sequenced before the end of 1997 (Terry Gaasterland, priv. communication); the entire human genome is likely to be known in the year 2003. This implies that the explosion of genome, and hence, protein sequences is supposedly the only field outgrowing the speed in development of computer hardware. It also implies, that despite significant improvements of structure determination techniques the gap between the number of proteins for which structure is deposited in public databases (PDB [12]), and the number of proteins for which sequences are known is increasing.
Can the egg be unboiled? When an egg is boiled, the proteins it contains unfold. Can this procedure be reversed in theory? Can the encrypted code of protein structure be deciphered? Or, can theory help to bridge the sequence-structure gap? Indeed, for over 30 years, there has been an ardent search for methods to predict protein structure from the sequence. Many methods were found which looked initially very promising - but always the hope has been dashed. How well do we do?
No general prediction of structure from sequence, yet. An important experiment has been initiated by John Moult (CARB, Washington): those who determine protein structures submitted the sequences of proteins for which they were about to solve the structure to a 'to-be-predicted' database; for each entry in that database predictors could send in their predictions before a given deadline (the public release of the structure); finally, the results were compared, and discussed during a workshop (in Asilomar, California). Two such experiments have been completed: in December 1994 (Proteins special issue, Vol. 23, 1995), and in December 1996 (to be published in Proteins, 1997). The results of both experiments demonstrated clearly that the goal to predict structure from sequence has not been reached, yet. So, no improvement despite ardent attempts, and the explosion of knowledge deposited in databases?
Indeed, there is a flood of literature on protein structure prediction
attempting to keep track with the expanding databases (reviews:
[13, 14]; books: [15, 16]; a practical approach to structure prediction
and sequence analysis: [17-19]. In this review focus will be
laid on recent prediction methods that do actually contribute
to bridging the sequence structure gap in particular in view of
analysing entire genomes. The first section will provide a brief
sketch about where we are today in protein structure prediction.
The following chapters will sketch the problems, and some of
the solutions in database searches, and the prediction of protein
structure in 1D, 2D, and 3D (Fig. 1).
Representation of HIV-1 protease monomer (Protein Data Bank code 1HHP) in one, two and three dimensions. Each of the representations gives rise to a different type of prediction problem.
1D - predict secondary structure and solvent accessibility. From left to right: amino acids for the first 33 residues (one letter code, first column); alignment exemplified by 5 sequences (second column); secondary structure [80] (H, helix; E, strand; blank, other: third column), solvent accessibility (measured in Å, fourth column, [80]), and a typical prediction by the neural network program PHD [39] for secondary structure and solvent accessibility (in italics, fifth and sixth column).
2D - predict contact map. The 3D structure can be projected onto a two-dimensional matrix of inter-residue distances or contacts (as shown here). The entry at position ij of the matrix gives the contact strength between residue i and residue j . The stronger a contact, the darker the marker. Horizontal and vertical lines give borders of secondary structure segments. Graph made with CONAN [81].
3D - predict three-dimensional coordinates. The trace
of the protein chain in 3D is plotted schematically as a ribbon
Ca-trace. Strands are indicated
by arrows, the short helix is on the right towards the end (C-term)
of the protein. Graph made with MOLSCRIPT [82]. Prediction not
shown.
Bridging the sequence-structure gap for 10 - 30% of all sequences.
The gap between the number of known sequences (>170,000 [20])
and the number of known structures (about 5,000 [12]) is widening
rapidly. The most successful theoretical approach to bridging
this gap is homology modelling. The principle idea bases on the
following observation. Each native protein sequence adopts a
unique structure. However, many different sequences can adopt
the same basic fold. In other words, proteins with similar sequences
tend to fold into similar structures. Indeed, for a pair of naturally
evolved proteins levels of 25-30% pairwise sequence identity (percentage
of residues identical between the two proteins) are sufficient
to assure that the two proteins fold into similar structures [21-23].
Thus, if a sequence of unknown structure (denote U) has significant
sequence similarity to a protein of known structure (T), it is
possible to construct an approximate 3D model for U based
on the assumption that U simply has basically the same
structure as T . This technique is referred to as homology
modelling. It effectively raises the number of 'known' 3D structures
from 5,000 to over 50,000 ([24]; Fig. 2).
Scope of structure prediction. Given any expressed
protein, how likely can theory predict its 3D structure? For
example, for 30% of the proteins in the current SWISS-PROT database
we can find regions for which homology modelling is applicable
[24], but for the first four entirely sequenced genomes (shown
is yeast) this is true for less than 10% of all proteins [83].
Thus, SWISS-PROT contains a bias introduced, e.g., by limitations
of previous sequencing techniques. Estimating the contribution
of fold recognition or threading techniques is problematic. Margins
given are certainly over-estimated in terms of the accuracy of
current threading methods, and supposedly under-estimated in terms
of the number of remote homologues that could be detected. (Note
however: today threading techniques are not accurate enough for
any large-scale prediction of 3D structure!) The remaining region
(50-80%) is occupied by unknown folds for which no somehow accurate
predictions in 3D can be obtained.
Widening the bridge by threading. Homology modelling allows prediction of 3D structure for 10-30% of all protein sequences. However, there is evidence that most pairs of proteins with similar structure are remote homologues with less than 25% pairwise sequence identity [25]. These remote homologues cannot usually be recognised by conventional sequence alignments, as this level of sequence identity is not significant for structural similarity in the following sense. If one were to collect all pairwise alignments of < 25% sequence identity that result from a search with U against a database of protein sequences, than the vast majority of these pairs would be entirely unrelated proteins. Thus, most similar structures appear to be remote homologues, but most possible pairs at low levels of sequence identity are, in fact, unrelated. Consequently, searching for remote homologues is similar to the task of finding a needle in a haystack [26]. Techniques to manage this difficult task are referred to as 'threading techniques'. Most of these techniques are applicable if and only if the remote homologue to U has known structure. Once a remote homology is detected, remote homology modelling may be used to construct a 3D model. This could potentially reduce the sequence-structure gap by an additional 10,000 - 50,000 proteins (Fig. 2). Given a sequence U from one of the complete genome sequences which have recently become available; what is the likelihood that the 3D structure can be predicted for U by homology modelling or remote homology modelling? A conservative answer is: 10%, based on the success of sequence alignment-based homology modelling (Fig. 50%, assuming all remote homologues could be recognised (Fig. 2).
Accurate prediction for 1D aspects of 3D structure. If
no remote homologue can be detected for U , we are forced
to simplify the prediction problem. There is a pay-off from making
this simplification: using the rich diversity of information in
current databases, it is possible to make very accurate 1D predictions
from the sequence alone. Automatic prediction services are readily
available for secondary structure, solvent accessibility, location
and topology for transmembrane helices [27], and the location
of helices for the special class of coiled-coil proteins [28].
Basic concept. The principle problem of sequence alignments
is to find the optimal superposition between two strings of amino
(or nucleic) acid sequences, i.e. to optimally align the two strings.
The most simple objective is to optimise the percentage of residues
that are identical between the two sequences. A dynamic programming
algorithm is guaranteed to find the optimal solution for the problem
in an algorithmic time quadratic in the length of both sequences
[29] (Fig. 3). For the alignment of protein sequences this simple
approach is not sufficient as finding the best alignments, usually,
requires to introduce gaps in one sequence, or insertions in the
other [30] (Fig. 3: rather than placing two dots into sequence
T , residues A and E could be deleted in sequence U
; note: the gap increases the score from 2 to 4 identical residues).
The introduction of such gaps is mathematically treated by adding
a constant (gap open penalty) to the final score (here number
of identical residues). However, to sensitively align protein
sequences, this still is not sufficient. The major addition to
the simple approach described so far is to evaluate scores not
based on residue identities but based on bio-chemical properties
of amino acid. For example, aligning two hydrophobic residues
(I and L) is more beneficial than aligning a hydrophobic and a
charged residue (L and K; note: when treating hydrophobic residues
as identical, the score for the best gapped alignment in Fig.
3 increases from 4 to 6).
The simple dynamic programming algorithm simply
proceeds in the following way. The two sequences to be aligned
(U and T ) are written into a matrix. Starting
from the first element in the matrix, identities are counted and
summed along the diagonal. The two best paths are marked by grey
lines. The two best alignments match only two identical residues
for the example given. However, if insertions (marked by dots)
were allowed, the best alignment, actually, matches four residues.
Evolution distinguishes signal from noise. At the level of protein molecules, selective pressure results from the need to maintain function, which in turn requires maintenance of the specific 3D structure. This evolutionary history is the basis for the success in aligning protein (or nucleotide) sequences. Accordingly, conservation and mutation patterns observed in alignments contain very specific information about 3D structure. How much variation is tolerated without loss of structure? Two naturally evolved proteins with more than 25% identical residues (length > 80 residues) are very likely to be similar in 3D structure[23].
Task trivial for high levels of sequence identity. Any sequence analysis starts with database searches: all known databases are scanned by sequence alignment procedures for proteins homologous to the search sequence U . When the pairwise sequence identity between U and a putative homologue H is over 25-30% (for more than 80 residues), alignment procedures are usually straightforward [31-35]. For less similar protein pairs, alignments may fail.
Routine database searches by simplified procedures. Aligning two sequences by dynamic programming is a matter of seconds on a modern workstation. However, database searches require to repeat this many times, and since the databases grow, CPU time becomes a constraint in everyday sequence analysis. This bottleneck is opened by methods that start to find 'identical words' (sub-strings), and then grow the alignment around such blocks. The most widely used programs of this sort are BLAST and FASTA [32, 34]. In practice, advanced alignment algorithms typically proceed by first running a fast scan with BLAST and/or FASTA, and by then applying the full dynamic programming algorithm. To illustrate sequence analysis in practice: aligning the 6,000 sequences of yeast against all known proteins was recently accomplished in 72 hours on 64 SGI 10 000 processors [36].
Multiple alignments improve as data banks grow. The most advanced sequence alignment tools base the alignment on profiles derived from databases or particular sequence families [37, 14]. One new generation of alignment methods is based on Hidden Markov Models, another on genetic algorithms. These new methods may be more successful in intruding into the twilight zone of sequence alignments (20-30% sequence identity; [22]) than advanced profile-based methods. However, this remains to be proven.
Drawback: lack of sufficiently tested cut-off criteria.
There are many different alignment methods available for those
who need to run database searches for their everyday's work.
Which method is best? One of the difficulties in comparing different
alignment procedures is the lack of well-defined criteria for
measuring the alignment quality. Very few papers have attempted
to define such measures for the comparison of various methods
[38]. The second problem for users is that most methods do not
supply a cut-off criterion for distinguishing between homologous
and non-homologous sequences (i.e., false positives). For some
large sequence families remote homologues can be aligned correctly,
but for most cases sequences aligned to the search protein U
at levels below 25% pairwise sequence identity will be false
positives, i.e., will have no structural or functional similarity
to U . A simple length-dependent cut-off based on sequence
identity is provided by the program MAXHOM [23]. However, this
threshold neither quantifies the influence of bio-chemical similarities
between amino acids, nor the occurrence of gaps.
Basic concept. The principal idea underlying most secondary structure prediction methods is the fact that segments of consecutive residues have preferences for certain secondary structure states [1, 39]. Thus, the prediction problem becomes a pattern-classification problem tractable by pattern recognition algorithms. The goal is to predict whether the residue at the centre of a segment of typically 13-21 adjacent residues is in a helix, strand or in none of the two (no regular secondary structure, often referred to as the 'coil' or 'loop' state). Many different algorithms have been applied to tackle this simplest version of the protein structure prediction problem: physico-chemical principles, rule-based devices, expert systems, graph theory, linear and multi-linear statistics, nearest-neighbour algorithms, molecular dynamics, and neural networks [39]. However, until recently performance accuracy seemed to have been limited to about 60% (percentage of residues correctly predicted in either helix, strand, or other). The limited accuracy was argued to result from the fact that all methods used only information local in sequence (window of less than 20 adjacent residues). Local information was estimated to account for roughly 65% of the secondary structure formation. Two additional problems were common to all methods developed from 1957 to 1993: (1) strands were predicted at levels of accuracy only slightly superior to random predictions, and (2) predicted secondary structure segments were, on average, only half as long as observed segments. The later two shortcomings could be surmounted by using a particular combination of neural networks [39].
Evolutionary information key to significantly improved predictions.
On the one hand, about 75 out of 100 residues can be exchanged
in a protein without changing structure. On the other hand, exchanges
of 1-5 residues can already destabilise a protein structure.
These statements may appear contradictory. However, the explanation
is simple: evolution has explored exactly the unlikely exchanges
of particular amino acids at particular positions that do not
change structure, as a change of structure usually results in
a loss of function (and thus would not survive). Thus, the residue
exchange patterns extracted from a protein family (i.e. alignments
of similar sequences) are highly indicative of the specific structural
details for that family. The first method that reached a sustained
level of a three-state prediction accuracy above 70% was the profile-based
neural network system PHD which uses exactly such evolutionary
information derived from multiple sequence alignments as input
[39]. By stepwise incorporation of particular evolutionary information,
prediction accuracy (Fig. 72% accuracy
[39]. An interesting, technical detail of this network system
is that the use of a global 'descriptor', namely the overall amino
acid composition (percentage of occurrence of each of the 20 amino
acids) does not affect the local score for accuracy as measured
by the percentage of correctly predicted single residues. Using
amino acid composition, however, improves the accuracy in terms
of a more global score, such as the difference between the percentage
of observed and predicted secondary structure [39]. Was the neural
network an essential tool for the most accurate secondary structure
prediction? A nearest-neighbour algorithm can be used to incorporate
evolutionary information in a similar manner as the neural network
system; the result is a similar performance [40]. Methods combining
statistics, and multiple alignment information have been clearly
less successful, so far. In comparison to methods using single
sequence information only, methods making use of the growing databases
are 6-14 percentage points more accurate. Thus, using evolutionary
information secondary structure can now be predicted more accurately
and reliably than other features of protein structure.
PHDsec: profile-based neural network system for
secondary structure prediction [39]. From the multiple alignment
(here guide sequence SH3 plus 4 other proteins a1-a4, note: lower
case letters indicate deletions in the aligned sequence) a profile
of amino acid occurrences is compiled. To the resulting 20 values
at one particular position m
in the protein (one column) three values are added: the number
of deletions and insertions, and the conservation weight (CW).
13 adjacent columns are used as input. The whole network system
for secondary structure prediction consists of 3 layers: 2 network
layers and 1 layer averaging over independently trained networks.
Secondary structure predictions now extremely useful, in practice. How good is a prediction accuracy of 72% in practice? It is certainly reasonably good compared with the prediction of secondary structure by homology modelling [41]. However, prediction accuracy varies between different proteins, i.e., prediction accuracy is 72%±9% (one standard deviation) [39]. For applications this implies that predictions can be as good as > 95%, but also as bad as <54%. Can users distinguish one from the other? A few methods successfully use reliability indices allowing to label residues for which predictions are, on average, likely to be more accurate. Indeed, for the neural network system PHD the correlation between such a reliability index and accuracy is linear [39]. Thus, the reliability index effectively becomes a means to predict prediction accuracy, and hence to assess to which class a protein of unknown structure (U) belongs: to the well predicted, or to the badly predicted ones. Various methods successfully use secondary structure predictions as a first step, e.g., prediction-based threading (indeed one of the problems of the Asilomar 1996 prediction contest was that many developers of threading algorithms used the same PHD secondary structure predictions as a first step), inter-strand, and inter-residue distance predictions. However, the use of secondary structure predictions is not limited to structure prediction. Instead, the results of, for instance, the public prediction service (PredictProtein [39]) have been used to assist the determination of protein structures (chain tracing in X-ray crystallography), as well as to formulate hypothesis about protein structure, and function that guided experiments in molecular biology, in general (in particular: prediction of binding sites, homologous proteins, design of residue mutations).
Separate prediction of secondary structure content not very
useful. Proteins have been partitioned into various structural
classes, e.g., based on the percentage of residues assigned to
helix, strand, and other [42]. However, such a coarse-grained
classification is not well-defined [43]. Consequently, given
a protein sequence U of unknown structure, attempts to
first predict the secondary structure content for U and
then to use the result to predict the secondary structural class
(i.e., all-a, all-b
or intermediates) is of limited practical use. How do alignment-based
predictions compare to experimental means of determining the content
in secondary structure? For example, PHD is, on average surprisingly,
about as accurate as circular dichroism spectroscopy [43]. Of
course, this does not imply that predictions can replace experiments.
In particular, variation of secondary structure as a result of
changes in environmental conditions (e.g. solvent) is generally
only accessible experimentally.
Basic concept. It has long been argued that if the segments
of secondary structure could be accurately predicted, the 3D structure
could be predicted by simply trying different arrangements of
the segments in space [44]. One criterion for assessing each
arrangement could be to use predictions of residue solvent accessibility
[45, 46]. The principal goal is to predict the extent to which
a residue embedded in a protein structure is accessible to solvent
(Fig. 5). Solvent accessibility can be described in several ways
[45, 46]. The simplest is a two-state description distinguishing
between residues that are buried (relative solvent accessibility
< 16%) and exposed (relative solvent accessibility ³ 16%).
The classical method to predict accessibility is to assign either
of the two states, buried or exposed, according to residue hydrophobicity.
However, a neural network prediction of accessibility has been
shown to be superior to simple hydrophobicity analyses [47].
Residue solvent accessibility is usually measured
by rolling a spherical water molecule over a protein surface and
summing the area that can be accessed by this molecule on each
residue (typical values range from 0-300 Å2).
To allow comparisons between the accessibility of long extended
and spherical amino acids, typically relative values are compiled
(actual area as percentage of maximally accessible area). A more
simplified descriptions distinguishes two states: buried (here
residues C and D) and exposed (here residues A, B, E, F, and G)
residues. Since the packing density of native proteins resembles
that of crystals, values for solvent accessibility provide upper
and lower limits to the number of possible inter-residue contacts.
Evolutionary information improves prediction accuracy.
Solvent accessibility at each position of the protein structure
is evolutionarily conserved within sequence families. This fact
has been used to develop methods for predicting accessibility
using multiple alignment information [48]. Prediction accuracy
is about 75±7%, four percentage points higher than for methods
not using alignment information. Predictions of solvent accessibility
have also been used successfully for prediction-based threading,
as a second criterion towards 3D prediction by packing secondary
structure segments according to upper and lower bounds provided
by accessibility predictions, and as basis for predicting functional
sites [48]. More recently, predictions of accessibility were
also used successfully to predict sub-cellular location (Andrade
& O'Donoghue, priv. communication).
Basic concept. Even in the optimistic scenario that in
the near future most protein structures will be either experimentally
determined, one class of proteins will still represent a challenge
for experimental determination of 3D structure: transmembrane
proteins. The major obstacle with these proteins is that they
do not crystallise, and are hardly tractable by NMR spectroscopy.
Consequently, for this class of proteins structure prediction
methods are even more needed than for globular water-soluble proteins.
Fortunately, the prediction task is simplified by strong environmental
constraints on transmembrane proteins: the lipid bilayer of the
membrane reduces the degrees of freedom to such an extent that
3D structure formation becomes almost a 2D problem. Two major
classes of membrane proteins are known: proteins which insert
helices into the lipid bilayer (Fig. 6), and proteins that form
pores by a barrel of 16 strands (the only known cases of this
type are porins [49]). Since there is not much experimental information
available on different porin-like membrane proteins, we can hardly
estimate prediction accuracy for this class. The situation is
quite different for helical membrane proteins. Once the location
of transmembrane segments is known for helical transmembrane proteins,
3D structure can be predicted by exploring all possible conformations
[50]. Additionally, predicting the locations of these transmembrane
helices is a much simpler problem than is the prediction of secondary
structure for soluble proteins. Elaborated combinations of expert-rules,
hydrophobicity analyses and statistics yields a two-state per-residue
accuracy of about 90% (residues predicted correctly as either
transmembrane helix, or other).
Topology of helical transmembrane proteins.
In one class of membrane proteins, typically apolar helical segments
are embedded in the lipid bilayer oriented perpendicular to the
surface of the membrane. The helices can be regarded as more
or less rigid cylinders. The orientation of the helical axes,
i.e. the topology of the transmembrane protein, can be defined
by the orientation of the first N-terminal residues with respect
to the cell. Topology is defined as out when the protein
N-term (first residue) starts on the extra-cytoplasmic region
(protein A), and as in if the N-term starts on the intra-cytoplasmic
side (proteins B and C).
Evolutionary information improves prediction accuracy.
For two methods the use of multiple alignment information is
reported to clearly improve the accuracy of predicting transmembrane
helices [51, 39]. The best current prediction methods have a
similar high accuracy around 95%. One such method uses a system
of neural networks similar to the one sketched in Fig. 4. In
order to predict the orientation of the helices (i.e. the topology
Fig. 6) a simple rule is applied: positively charged residues
occur more often in intra-cytoplasmic than in extra-cytoplasmic
regions. The advanced neural network system has been improved
significantly by adding a dynamic programming algorithm to the
neural network output. The principle idea is to use the neural
network output as an energy landscape and to find the optimal
path through this landscape [27]. As reliable data for the locations
of transmembrane helices exists only for a few proteins, data
used for deriving these methods originate predominantly from experiments
in cell biology and gene-fusion techniques. Different experimental
groups often report different locations for transmembrane regions.
Thus, the level of 95% accuracy is not verifiable. Despite this
uncertainty in detail, the prediction of transmembrane helices
is a valuable tool to quickly scan entire chromosomes [27]. The
classification into membrane/not-membrane proteins has an expected
error rate of less than two percent, i.e., about two percent of
the proteins predicted to contain transmembrane regions will probably
be false positives. The predictions of transmembrane helices
has provided a lower bound to approach the question of how many
proteins organisms need for, e.g., communication: the percentage
of proteins with transmembrane helices has been estimated to be
about 25% for yeast and haemophilus influenzae, and around 10-15%
for mycoplasma genitalium and methanococcus jannaschii (Rost,
manuscript in preparation).
Prediction problem is a hard one, but the stakes are high. Given all inter-residue contacts or distances (Fig. 3D structure can be reconstructed by distance geometry or molecular dynamics. This is used for the determination of 3D structures by nuclear magnetic resonance (NMR) spectroscopy which produces experimental data of distances between protons [52]. Can inter-residue contacts be predicted? Obviously, some fraction of these contacts can be: helices and strands can be assigned based on hydrogen-bonding pattern between residues. Thus, a successful prediction of secondary structure implies a successful prediction of some fraction of all the contacts. However, contacts predicted from secondary structure assignment are short-ranged, i.e., between residues nearby in sequence. For a successful application of distance geometry, long-range contacts have to be predicted, i.e., contacts between residues far apart in sequence. A few methods have been proposed for the prediction of long-range inter-residue contacts. Two questions surround such methods: first, how accurate are these prediction methods on average; and second, are all important contacts predicted?
Correlated mutations can imply spatial proximity. In sequence alignments, some pairs of positions appear to co-vary in a physico-chemically plausible manner, i.e., a 'loss of function' point mutation is often rescued by an additional mutation that compensates for the change [53]. One hypothesis is that compensations would be most effective in maintaining a structural motif if the mutated residues were spatial neighbours. Attempts have been made to quantify such a hypothesis and to use it for contact predictions [54-56]. In general, prediction accuracy is rather poor, with a direct trade-off between predicting enough contacts, and predicting only correct ones, e.g., taking 5% of the best-predicted long-range contacts (sequence separation above 10 residues) the accuracy prediction is about 50% (A. Valencia, priv. communication).
Distinction between different models, no prediction of 3D,
yet. Analysing correlated mutations is only one way to predict
long-range inter-residue contacts. Other methods use statistics,
mean-force potentials, or neural networks. So far none of the
methods appears to find a path between the Scylla of missing too
many true contacts and the Charibdis of predicting too many false
contacts. However, some of the methods provide sufficient information
to distinguish between alternative models of 3D structure (Valencia,
manuscript in preparation). The ambitious goal to predict long-range
inter-residue contacts sufficiently accurately will hopefully
continue to attract intellectual resources.
Simplifying the contact prediction problem. One simplification of the problem to predict inter-residue contacts focuses on predicting the contacts between residues in adjacent strands (Fig. 1). Such an attempt is motivated by the hope that such interactions are more specific than are sequence-distant (long-range) contacts in general, and hence are easier to predict.
Identifying the correct b-strand alignment. The only method published for predicting inter-strand contacts is based on potentials of mean-force [57] similar to those used in the evaluation of strand-strand threading [58]. Propensities are compiled by database counts for 2 ¥ 2 ¥ 2 classes (parallel/anti-parallel, H-bonded/not H-bonded, N-/C-terminal). Each of the eight classes is divided further into five sub-classes in the following way. Suppose the two strand residues at positions i and j are in close in space. Then the following five residue pairs are counted in separate tables: i/j-2, i/j-1, i/j, i/j+1, i/j+2 . Such pseudo-potentials identify the correct b-strand alignment in 35-45% of the cases.
Using evolutionary information to predict inter-strand contacts.
Even if the locations of strands in the sequence are known exactly,
the pseudo-potentials cannot predict the correct inter-strand
contacts in most cases [57]. However, when using multiple alignment
information, the signal-to-noise ratio increases such that inter-strand
contacts have been predicted correctly for most of the strands
inspected in some test cases [57]. For the purpose of reliable
contact prediction, this result is inadequate, especially as the
locations of the strands are not known precisely. Can the pseudo-potentials
handle errors resulting from incorrect prediction of strands?
Various test examples using predictions by PHDsec [39] as input
to the strand pseudo-potentials indicate that the accuracy in
predicting inter-strand contacts drops (T Hubbard, unpublished),
but in some cases is still high enough to be useful for approximate
modelling of 3D structure [59].
Basic concept. An analysis of PDB reveals that all protein pairs with more than 30% pairwise sequence identity (for alignment length > 80; [23]) have homologous 3D structures, i.e., the essential fold of the two proteins is identical, details such as additional loop regions - regions not in helices or strands - may vary. Structure is more conserved than is sequence. This is the pillar for the success of homology modelling. The principal idea is to model the structure of U (protein of unknown structure) based on the template of a sequence homologue of known structure (T). Consequently, the precondition for homology modelling is that a sequence homologue of known structure is found in PDB. Since homology modelling is currently the only theoretical means to successfully predict 3D structure, this has two implications. First, homology modelling is applicable to 'only' one quarter of the known protein sequences (Fig. 2). Second, as the template of a homologue is required, no unique 3D structure can yet be predicted, i.e., no structure that has no similarity to any experimentally determined 3D structure. Suppose, there is a protein with a sequence similar to U in PDB (say T), is homology modelling straightforward?
High level of sequence identity: atomic resolution. The basic assumption of homology modelling is that U and T have identical backbones (main chain C). The task is to correctly place the side chains of U into the backbone of T . For very high levels of sequence identity between U and T (ideally differing by one residue only), side chains can be 'grown' during molecular dynamics simulations [60]. For slightly lower levels (still of high sequence similarity), side chains are built based on similar environments in known structures [61, 62]. Rotamer libraries (libraries containing all side-chain orientations observed in known structures) are used in the following way. (1) Rotamer distributions are extracted from a database of non-redundant sequences. (2) Fragments of seven (helix, strand) or five residues (other) are compiled. (3) Fragments of the same length are successively shifted through the backbone of U . (4) For modelling the side chains of U only those fragments from the rotamer library are accepted which have the same amino acid in the centre as U , and for which the local backbone is similar to that around the evaluated position). Over the whole range of sequence identity between U and T for which homology modelling is applicable, the accuracy of the model drops with decreasing similarity. For levels of at least 60% sequence identity, the resulting models are quite accurate [62, 63], for even higher values, the models are as accurate as is experimental structure determination. The limiting factor is the computation time required. How accurate is homology modelling for lower levels of sequence identity?
Low level of sequence identity: loop regions sometimes correct.
With decreasing sequence identity between the known structure
H and the query protein U the number of loops
that have to be inserted to align the two grows. An accurate
modelling of loop regions, however, implies solving the structure
prediction problem. The problem is simplified in two ways. First,
loop regions are often relatively short and can thus be simulated
by molecular dynamics (note the CPU time required for molecular
dynamics simulations grows exponentially with the number of residues
of the polypeptide to be modelled). Second, the ends of the loop
regions are fixed by the backbone of the template structure.
Various methods are employed to model loop regions. The best
have the orientation of the loop regions correct in some cases
[63]. This illustrates the current limitations of molecular dynamics:
not even short loop regions can be predicted from sequence. Furthermore,
for experimental structure refinement (use of molecular dynamics
to improve consistency, and accuracy of experimental data) molecular
dynamics is successfully applied to find a better solution when
starting from an almost correct structure. However, for homology
modelling, molecular dynamics refinement usually reduces prediction
accuracy [63]. Below about 40% sequence identity the accuracy
of the sequence alignment used as basis for homology modelling
becomes an additional problem. Nevertheless, even down to levels
of 25-30% sequence identity, homology modelling produces coarse-grained
models for the overall fold of proteins of unknown structure.
Basic concept. As noted in the previous section, naturally evolved sequences with more than 30% pairwise sequence identity have homologous 3D structures [23]. Are all others non-homologous? Not at all. In the current PDB database there are thousands of pairs of structurally homologous pairs of proteins with less than 25% pairwise sequence identity (remote homologues). Actually, most similar protein structures are such remote homologues (Rost, in press). If a correct alignment between U (sequence of unknown structure) and a remote homologue T (pairwise sequence identity to U < 25%) is given, one could build the 3D structure of U by (remote) homology modelling based on the template of T . A successful remote homology modelling must solve three different tasks. (1) The remote homologue (T) has to be detected. (2) U and T have to be correctly aligned. (3) The homology modelling procedure has to be tailored to the harder problem of extremely low sequence identity (with many loop regions to be modelled). Most methods developed so far have been primarily addressed to solve the first, and the second problem. The basic idea is to thread the sequence of U into the known structure of T and to evaluate the fitness of sequence for structure by some kind of environment-based or knowledge-based potential [64, 65]. Threading is in some respects a harder problem than is the prediction of 3D structure (NP-complete: [66]; no physical connection between remote homologues, as many remotely homologous protein pairs may have originated from different ancestors, Rost, in preparation; [67]). However, the stakes are high: solving the threading problem could enable the prediction of thousands of protein structures. Indeed, threading has evolved to become one of the most active fields in the arena of protein structure prediction (with well over 100 annual publications).
Variety of threading techniques. The optimism generated by one of the first papers on threading published in the 90s [68] has boosted attempts to develop threading methods. The principle idea has been to use structural propensities of amino acids (such as preferences for secondary structure formation, hydrophobicity, and polarity), and to then assess whether or not a given sequence with its structural preferences fits into the structural environment of a given structure [65]. A principally different approach has been pushed by Manfred Sippl [69, 67]. The idea is to use the rich knowledge deposited in the database of protein structures (PDB) by extracting mean-force potentials. Such potentials monitor the observed distances between residue pairs of particular amino acids, with a particular sequence separation (number of residues between the two). Until 1995, most threading methods have used mean-force-potentials [64, 37, 67]. A more recent generation of threading methods is based on 1D predictions [48]: first 1D structure (secondary structure and solvent accessibility) is predicted for a sequence of unknown structure, then the 1D structure is extracted for a library of known structures, and finally the observed and the predicted 1D structure strings are aligned by typical dynamic programming algorithms [30]. Has all this effort achieved to crack the hard nut threading?
Remote homologues can often be detected. First the good news: since the different mean-force-potentials which have been proposed capture different aspects of protein structure, the correct remote homologue is likely to be found by at least one of them [70]. Now the bad news: so far, no single method has been able to detect the correct remote homologue for more than half of all test cases [70]. For the methods which have been rigorously evaluated using large test sets, the correct remote homologue is detected in less than 40% of all cases [48]. However, this performance is clearly superior to that of traditional sequence alignments at this low level (<25%) of sequence identity. Furthermore, the success of the last Asilomar experiment on structure prediction (forthcoming issue of Proteins) suggests that the likelihood to detect the correct remote homologue is reasonably high when the choice is refined by experts.
3D prediction by threading is still not reliable. Detecting
the remote homology is only the first of the three obstacles.
It appears that the second obstacle (correct alignment between
U and T ) is much more difficult and, unfortunately,
there is no general solution so far. Thus the final step, building
a 3D model, usually fails since the modelling procedures available
today cannot correct the mistakes in the alignments. Although
the last Asilomar experiment on structure prediction (forthcoming
issue of Proteins) suggested that major improvements have been
accomplished over the last two years, there are still very few
publications to date which report accurate 3D predictions from
threading methods. Currently, the successful use of threading
methods requires sceptical, expert user intervention to spot wrong
hits and false alignments. It is still possible that threading
method will become the most successful structure prediction method,
but a lot of detailed work lies ahead.
Recent breakthrough in structure prediction? In the 1994 Asilomar meeting, none of the 3D ab initio methods were able to predict the correct protein structure [63]. Since that time, new methods have been proposed which indicate possible directions for the future. Several groups have obtained promising results using distance geometry methods [48]. Simplified force-fields in combination with dynamic optimisation strategies have yielded promising, but still relatively inaccurate results [71, 72]. Srinivasan and Rose have reported very encouraging results with their hierarchical search method [73]. However, the second Asilomar experiment on structure prediction (forthcoming issue of Proteins) concluded similarly to the first: no prediction of 3D structure from sequence, yet.
Accurate prediction of 3D structure for coiled-coil proteins. A particular class of proteins are coiled-coils. These are proteins can be defined by a rather simple geometry of long helices, of which two or more wind around one another [28]. Nilges and Brünger [74] have achieved atomic-accuracy in an ab initio prediction of the GCN4 leucine zipper using a hybrid molecular dynamics/simulated annealing search strategy. Recently, equally accurate models for three leucine zippers were obtained with faster calculations based on mean-force-potentials (O'Donoghue, in press).
Recognising incorrect structures. The single most important theoretical advance in 3D prediction in recent years may have been the development of mean-force-potentials. Before these potentials, structure prediction was normally done with 'physical' potentials, i.e., bonds, angles, torsion angles, and van der Waals as well as electrostatic non-bonded terms which describe the internal energy of the molecule [6]. In contrast, the mean-force-potentials, derived from databases of protein structure [75], attempt to describe the free-energy of the molecule. The physical potentials have been used very successfully to refine experimentally determined structures [52]. However, these terms cannot distinguish between a native fold and a grossly misfolded structure [75]. In contrast, mean-force-potentials of pairwise residue distances are quite successful in fold recognition, as well as remote homology modelling [67]. It remains to be seen how best to combine these two different potentials. In one pilot study on the use of mean-force-potentials for 3D structure prediction, best results where obtained by combining both potentials (O'Donoghue and Nilges, private communication).
Extracting principles about structure formation from structures?
The mean-force-potential approach has recently been extended
to study protein folding. Both the physical basis and the general
characteristics of protein folding remain controversial [76].
Simulations and other studies indicate that the free energy balance
of hydrogen bond formation is close to zero, or slightly unfavourable
[77, 78], and that a specific fold is selected primarily by side-chain
interactions [76]. Recently, Sippl et al. have extended the concept
of deriving mean-force potentials to a formalism of describing
Helmholtz free energies of atom-pair interactions [79]. The formalism
starts with the following two assumptions: (1) that protein structures
can be described by Helmholtz free energies (or mean-force-potentials),
and (2) that the distribution of intra-molecular distances in
experimentally determined protein structures does, on average,
not deviate substantially from the corresponding distribution
in native proteins. To normalise the absolute free energy contributions,
the ideal gas is chosen (no internal interactions). Without any
further assumptions or approximations, atom-atom mean-force-potentials
are derived from a data set of known protein structures. The
resulting Helmholtz mean-force-potentials unravel interesting
principles about protein structure formation. (1) Backbone H-bonds
(except for the a-helix interaction
Oi ... Ni+4) do not contribute
to the thermodynamic stability of native folds. (2) H-bond formation
(except for Oi ... Ni+4)
requires energy input that is regained when H-bonds are formed.
Once formed, H-bonds are locked in a deep, narrow minimum. (3)
The energy gain of forming one ionic or two hydrophobic contacts
can provide roughly the activation energy required for forming
a H-bond. Both the eloquence and the conclusions of the approach
have prompted strong criticism, even unanimous rejection of these
findings. Do we witness an error in a method laid out to spot
errors, or the begin of a new era of force fields? Further applications
of these mean-force-potentials will be needed to answer this question.
Native 3D structures of proteins are encoded by a linear sequence of amino acid residues. To predict 3D structure from sequence is a task challenging enough to have occupied a generation of researchers. Have they finally succeeded in their goal? The bad news is: no, we still cannot predict structure for any sequence. The good news are: we have come closer, and growing databases facilitate the task.
Prediction in 3D: theory bridges the sequence-structure gap. The only source for new, unique protein structures (structures for which no homologue exists in the database) are experiments. However, given the amount of time needed to determine a protein structure experimentally, more non-unique structures can be predicted at atomic resolution by homology modelling in a month than have been determined by experiment over the last three decades. Homology derived models are frequently accurate at the level of atomic resolution. Unfortunately, most models typically have considerable co-ordinate errors in loop regions. Coarse-grained homology derived models are available for almost one third of the sequences deposited in the SWISS-PROT database [24]. Threading techniques could increase this ratio considerably by finding more distant homologues. However, for large scale sequence analyses threading techniques are not yet reliable.
Predictions in 1D: significant improvement by larger databases. The rich information contained in the growing sequence and structure databases has been used to improve the accuracy of predictions of some aspects of protein structure. Evolutionary information is successfully used for predictions of secondary structure, solvent accessibility, and transmembrane helices. These predictions of protein structure in 1D are significantly more accurate, and more useful than five years ago. Some methods have indicated that 1D predictions can be useful as an intermediate step on the way to predicting 3D structure (inter-strand contacts; prediction-based threading). Another advantage of predictions in 1D is that they are not very CPU-intensive, i.e., 1D structure can be predicted for the protein sequence of, for example, entire yeast chromosomes overnight.
Predictions in 2D: so far of limited success. The prediction accuracy of chain-distant inter-residue contacts is so far relatively limited. Analysis of correlated mutations can be used to distinguish between alternative models (e.g. for threading techniques). The prediction of inter-strand contacts appears to be useful in some cases. An accurate method for the automatic prediction of contacts between residues not close in sequence remains to be developed.
Most breakthroughs in protein structure prediction were achieved
over the last six years. Thus, although we still cannot solve
the general prediction problem, progress has been made. In general,
however, we could ask the question - is it worth persevering with
structure prediction, given that it is clearly such a difficult
task? The answer is: yes. The methods which have spun off from
structure prediction have already given us considerable insight
into the first four complete genomes. Perseverance with structure
prediction will yield fruit in about five years time when the
human genome will be known.