bottom - CUBIC-papers - CUBIC

Prediction In 1D: secondary structure, membrane helices, and accessibility

Burkhard Rost 1, *

 

1 CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
 rost@columbia.edu,
 http://cubic.bioc.columbia.edu/
  Tel: +1-212-305-3773,   fax: +1-212-305-7932

Table of contents

Abbreviations used

1D structureone-dimensional, e.g. sequence, or strings of secondary structure or solvent accessibility
2D structuretwo-dimensional (e.g. inter-residue distances)
3D structurethree-dimensional co-ordinates of protein structure
ASPmethod identifying regions of structure ambivalent in response to global changes [1]
2]; BLASTfast sequence alignment method [3]
CASPCritical Assessment of Protein Structure Prediction [4]
COILScoiled-coil prediction [5]
DSSPprogram and database assigning secondary structure and solvent accessibility for proteins of known 3D structure [6]
EVAserver automatically evaluating structure prediction methods [7]
8]; HMMHidden Markov Model
HMMSTRHidden Markov model-based prediction of secondary structure [9]
HMMTOPHidden Markov model predicting transmembrane helices [10]
JPred2divergent profile (PSI-BLAST) based neural network prediction of secondary structure and solvent accessibility [11]
MEMSATdynamic-programming based prediction of transmembrane helices [12]
META-PPinternet service allowing to access a variety of bioinformatics tools through one single interface [13]
PHDProfile based neural network prediction of secondary structure, solvent accessibility and transmembrane helices [14]
PHDpsidivergent profile (PSI-BLAST) based neural network prediction [15]
PSI-BLASTposition specific iterated database search [16]
PROFphdAdvanced profile-based neural network prediction of secondary structure [17]
PSIPREDdivergent profile (PSI-Blast) based neural network prediction [18
SAM-T99secneural network prediction, using hidden Markov models as input [19]
SOSUIhydrophobicity and amphiphilicity based transmembrane helix prediction [20]
SPLITtransmembrane helix prediction [21]
SSproprofile-based advanced neural network prediction method [22]
SSpro2divergent profile-based advanced neural network prediction method [23]
TMtransmembrane
TMAPalignment-based prediction of transmembrane helices [24]
TMHtransmembrane helix
TMHMMTransmembrane prediction using cyclic hidden Markov models [25]
TMpredprediction of transmembrane helices [26]
TopPred2hydrophobicity-based membrane helix prediction [27;28]
 
SYMBOLS USED:
 
secondary structure:h = helix; E = strand, L = other
transmembrane helix:T = transmembrane;   N = globular
solvent accessibility: e = exposed (≥ 16% relative accessible surface);   b = buried (< 16% relative accessible surface)

 




 


Summary

Predictions of simplified aspects of protein structure are often the first step to gaining some insight into the function of a protein. Furthermore, proteome analysis and methods predicting 3D structure are increasingly based upon 1D predictions. Developing 1D prediction methods may be one of the most active and most successful disciplines of bioinformatics. Here, I summarised some of the major ideas of available methods. Particular focus is on evaluating the performance of methods. Recent advances are reviewed and some hints for using methods for sequence analysis are given.

 

Introduction

No general prediction of 3D structure from sequence, yet.   The hypothesis that the 3D structure of a protein ('the fold') is uniquely determined by the specificity of the sequence, has been verified for many proteins   [29] . While it is now known that particular proteins (chaperones) often play an important rôle in folding   [30, 31, 32] , it is still generally assumed that the final structure is at the free-energy minimum  [33]. Thus, all information about the native structure of a protein is coded in the amino acid sequence, plus its native solution environment. Can we decipher the code? Hence, can we predict 3D structure from sequence? In principle, the code could by deciphered from physico-chemical principles   [34, 35] . In practice, the inaccuracy in experimentally determining the basic parameters, and the limited computing resources prevent prediction of protein structure from first principles   [36] . Hence, the only successful structure prediction tools are knowledge-based, using a combination of statistical theory and empirical rules. The field of protein structure prediction has advanced significantly over the last decade (see Chapter 27). However, we can still not predict structure from sequence. Rather, the best methods, now get the basic characteristics about a fold sometimes right   [37, 38] .

Structure prediction in 1D becomes increasingly accurate and important. An extreme simplification of the prediction problem is to project 3D structure onto strings of structural assignments. For example, we can assign a secondary structure state - marked by one symbol - for each residue, or we can assign a number for the accessibility of that residue. Such strings of per-residue assignments are essentially one-dimensional (1D). In fact, arguably the most surprising improvements in bioinformatics over the last decade may have been achieved by methods predicting protein structure in 1D. The key to this breakthrough came through the wealth of information about evolution contained in ever growing databases. Moreover, prediction accuracy continues to rise   [39] ! This success is crucial for target selection in structural genomics, for using structure prediction to get clues about function and for using simplified predictions for more sensitive database searches and predictions of higher dimensional aspects of protein structure (see below).

Apologies to developers! This brief synopsis of methods predicting protein structure in 1D has no chance of being fair to all developing methods for 1D protein structure prediction. Even a restricted MEDLINE search revealed over 200 publications in the last 12 months. Consequently, the review will be somehow unfair to the majority of developers. Instead, the focus lies on the small subset of most accurate or most widely used methods.

 

 

Methods



Secondary Structure Prediction Methods

Basic concept.  The principal idea underlying most secondary structure prediction methods is the fact that segments of consecutive residues have preferences for certain secondary structure states   [40, 14] . Thus, the prediction problem becomes a pattern classification problem tractable by pattern recognition algorithms. The goal is to predict whether a residue is in a helix, strand or in none of the two (no regular secondary structure, often referred to as the 'coil' or 'loop' state). The first generation prediction methods in the 60's and 70's all based on single amino acid propensities   [41, 42, 43, 44, 45] . Basically, these methods compiled the probability of a particular amino acid for a particular secondary structure state. The second-generation methods dominating the scene until the early 90's extended the principle concept to compiling propensities for segments of adjacent residues, i.e. taking the local environment of the residues into consideration. Typically methods used segments of 3-51 adjacent residues   [46, 47, 48, 49, 50, 51, 52, 53, 54] . Basically any imaginable theoretical algorithm had been applied to the problem of predicting secondary structure from sequence: physico-chemical principles, rule-based devices, expert systems, graph theory, linear and multi-linear statistics, nearest-neighbour algorithms, molecular dynamics, and neural networks  [44; 45; 55; 56]. However, it seemed that prediction accuracy stalled at levels around 60% of all residues correctly predicted in either of the three states helix, strand, or other. It was argued that the limited accuracy resulted from the fact that all methods used only information local in sequence (input: about 3-51 consecutive residues). Local information was estimated to account for roughly 65% of the secondary structure formation. Two additional problems were common to most methods developed from 1957 to 1993. First, predicted secondary structure segments were, on average, only half as long as observed segments. Historically, this problem was solved for the first time through a particular combination of neural networks   [57, 58] . Second, strands were predicted at levels of accuracy only slightly superior to random predictions. Again, the argument for this deficiency was that the hydrogen bonds determining the formation of sheets (note: paired strands form a sheet) are less local in sequence than the bonds responsible for helices (Chapter 17). Again, this problem was first solved through neural networks   [57, 58] . The solution was rather simple: we realised that about 20% of the correctly predicted residues were in strands, about 30% in helices and about 50% in non-regular secondary structure. These values are similar to the percentage of the respective classes in proteins. This observation prompted us to simply bias the database used for training neural networks by presenting each class equally often. The result was a prediction well balanced between the three classes, i.e. about 60% of the strand residues were predicted correctly. In practice, this was an important advance. However, it also cast an important spotlight onto the explanation that secondary structure formation is partially determined by non-local interactions. Clearly, sheets are non-local structures. Nevertheless, the preferences for a segment to form a strand or a helix appear similarly strong because both can be predicted at similar levels of accuracy designing the appropriate prediction method   [58, 59, 14] .

Evolutionary information key to significantly improved predictions.  On the one hand, about 67 out of 100 residues can be exchanged in a protein without changing structure   [60] . On the other hand, exchanges of very few residues often destabilise a protein structure. The explanation for this ostensible contradiction is simple: evolution has realised the unlikely by exploring all 'neutral' mutations that do not prevent structure formation (Footnote 1). . Thus, the residue exchange patterns extracted from an aligned protein family are highly indicative of specific structural details. This also implies that a profile of N consecutive residues taken from alignments implicitly contains non-local information since the evolutionary selection on the level of proteins work on a 3D object, rather than on sequence. Early on it was realised that this information can improve predictions   [64, 65, 66] . However, the breakthrough of the third generation methods to levels above 70% accuracy required a combination of larger databases with more advanced algorithms   [58, 56] . It was also recognised very early on that information from the position-specific evolutionary exchange profile of a particular protein family facilitates discovering more distant members of that family   [64] . Automatic database search methods successfully used position-specific profiles for searching   [67] . However, the breakthrough to large-scale routine searches has been achieved by the development of PSI-BLAST   [16] and Hidden Markov models   [68, 69] . Since the improvement of secondary structure prediction relies significantly on the information content of the family profile used, today's larger databases and better search techniques resulted in pushing prediction accuracy even higher. The current top-of-the-line secondary structure prediction methods are all based on extended profiles   [39, 15] .

The key players.  PHD was the program that surpassed the level of 70% accuracy first   [58, 59] . It uses a system of neural networks to achieve a performance well balanced between all secondary structure classes ( Fig. 1 ). Although still widely used, PHD is no longer the most accurate method   [70, 15] . Similar in performance is JPred2   [11] ; it combines the results from various prediction methods, in particular from JNet   [11] , NSSP   [71] , PREDATOR  [72] and PHD [14]. David Jones pioneered using automated, iterative PSI-BLAST searches   [18] . The most important step climbed by the resulting method PSIPRED has been the detailed strategy to avoid polluting the profile through unrelated proteins. To avoid this trap, the database searched has to be filtered first   [18] . Other than the advanced use of PSI-BLAST, PSIPRED achieves its success through a neural network system similar to that implemented in PHD. At the CASP meeting at which David Jones introduced PSIPRED, Kevin Karplus and colleagues presented their prediction method (SAM-T99sec) finding more diverged profiles through Hidden Markov models   [19] . The most important prediction method used by SAM-T99sec is a simple neural network with two layers of hidden units. However, the major strength of the method appears to be the quality of the alignment used. The only method published recently that improves prediction accuracy significantly not through more divergent profiles but through the particular algorithm is SSpro. Instead, SSpro1 is successful through the particular algorithmic improvement implemented   [22] . The principle idea of the method is to overcome the limitations of feed-forward neural networks with an input window of relative small and fixed size with bi-directional recurrent neural networks (BRNN) capable of taking the entire protein chain as input   [22, 73] . The most recent improvement realised in SSpro2 resulted from combining the advanced network architectures with PSI-BLAST profiles   [23] . Quite a different route toward secondary structure prediction is taken by the HMMSTR/I-sites programs   [9], described in more detail in Chapter 27.



Fig. 1
fig1.gif

Fig. 1. : neural network system for secondary structure prediction (PHDsec).     From the multiple alignment (here guide sequence SH3 plus 4 other proteins a1-a4, note that lower case letters indicate deletions in the aligned sequence) a profile of amino acid occurrences is compiled. To the resulting 20 values at one particular position m in the protein (one column) three values are added: the number of deletions and insertions, as well as the conservation weight (CW). 13 adjacent columns are used as input. The whole network system for secondary structure prediction consists of 3 layers: 2 network layers and 1 layer averaging over independently trained networks.




 

Specialised method: coiled-coil predictions. A coiled coil is a bundle of several helices assuming a side-chain packing geometry often referred to as 'knobs-into-holes'   [74] . The 'knobs' are the side-chains of one helix that pack into the hole created by four side-chains surrounding the facing helix. This super-coil slightly alters the helix periodicity from 3.6 to 3.5 and result in the coiled-coil specific symmetry in which every seventh residue occupies a similar position on the helix surface. The first and fourth of the seven residues are typically hydrophobic, the other four hydrophilic, frequently exposing the helix to solvent. These specific sequence features are at the base of accurate predictions for coiled-coil helices   [75, 76] . The most widely used program is COILS that bases on amino acid preferences compiled for the few coiled-coil proteins that were known at high resolution a decade ago  [5]. The program detects coiled coil preferences in windows of 14, 21 and 28 residues. The longer the window the better the distinction between proteins that have coiled-coil regions and those that do not   [75] . If we know the precise location of the coiled-coil regions and the multimeric state, we can predict 3D structure for coiled-coil regions at levels of accuracy that resemble experimentally determined structures (below 2.5Å   [77, 78] . O'Donoghue and Nilges used the experimentally known boundaries of the coiled-coil regions and their known multimeric state for prediction. It remains to be tested how sensitive that 3D prediction is with respect to errors in predicting the coiled-coil regions. Recently, Wolf et al. developed a method predicting the multimeric state of a coiled-coil region   [79] . When labelling all likely coiled-coil proteins in entire proteomes, we found that about 8-10% of all eukaryotic proteins and 2-10% of all proteins in archae and prokaryotes contain at least one coiled-coil region   [80] .



Solvent Accessibility Prediction Methods

Basic concept.  It has long been argued that if the segments of secondary structure could be accurately predicted, the 3D structure could be predicted by simply trying different arrangements of the segments in space   [81, 82, 83, 84] . One criterion for assessing each arrangement could be to use predictions of residue solvent accessibility   [85, 86, 87] . The principal goal is to predict the extent to which a residue embedded in a protein structure is accessible to solvent. Solvent accessibility can be described in several ways   [85, 86, 87] . The most detailed fast method compiles solvent accessibility by estimating the volume of a residue embedded in a structure that is exposed to solvent ( Fig. 2 ; note: this method was developed by [87] and later implemented in DSSP [6] . Different residues have a different possible accessible area. The most extreme simplification for accessibility accounts for this by normalising (dividing observed value by maximally possible value) to a two-state description distinguishing between residues that are buried (relative solvent accessibility < 16%) and exposed (relative solvent accessibility ≥ 16%). The precise choice of the threshold is not well-defined [88; 89] The classical method to predict accessibility is to assign either of the two states, buried or exposed, according to residue hydrophobicity, i.e. very hydrophobic stretches are predicted to be buried. However, more advanced methods have been shown to be superior to simple hydrophobicity analyses [94; 95; 96; 97; 98]]. Typically, these methods use similar ways of compiling propensities of single residues or segments of residues to be solvent accessible, as secondary structure prediction methods. For particular applications, such as using predicted solvent accessibility to predict glycosylation sites, it seems beneficial to train neural networks on different definitions of accessibility   [99, 100] . In particular, Hansen, Brunak et al. realised alternative compilations by changing the size of the water molecule used in DSSP ( Fig. 2 ). In contrast to the situation for secondary structure, most of the information needed to predict accessibility is contained in the preference of single residues   [89] . Nevertheless, using windows of adjacent residues also improves solvent accessibility prediction significantly   [89, 101] .

 



Fig. 2
fig2.gif

Fig. 2. : Measure accessibility.    Residue solvent accessibility is usually measured by rolling a spherical water molecule over a protein surface and summing the area that can be accessed by this molecule on each residue (typical values range from 0-300 Å2). To allow comparisons between the accessibility of long extended and spherical amino acids, typically relative values are compiled (actual area as percentage of maximally accessible area). A more simplified descriptions distinguishes two states: buried (here residues numbered 1-3 and 10-12) and exposed (here residues 4-9) residues. Since the packing density of native proteins resembles that of crystals, values for solvent accessibility provide upper and lower limits to the number of possible inter-residue contacts.




 

Evolutionary information improves accessibility prediction.  Solvent accessibility at each position of the protein structure is evolutionarily conserved within sequence families. This fact has been used to develop methods for predicting accessibility using multiple alignment information   [89, 102, 14, 101, 11, 17] . The two-state (buried, exposed) prediction accuracy is above 75%, i.e. more than four percentage points higher than for methods not using alignment information. Predictions of solvent accessibility have also been used successfully for prediction-based threading, as a second criterion towards 3D prediction by packing secondary structure segments according to upper and lower bounds provided by accessibility predictions, and as basis for predicting functional sites  [103]. 

Available key players. Possibly, the exclusion of methods predicting solvent accessibility from the CASP meetings (see 'Practical Aspects') slowed down the progress of the field. In particular, few of the methods developed are readily available through public servers. Prominent exceptions are the solvent accessibility predictions by PHD   [89] and PROFphd  [17] available through the PredictProtein server   [14, 104] . Both use systems of neural networks with alignment information. The improvement of PROFphd over PHD was achieved by (1) training the neural networks only on high resolution structures and by (2) using predicted secondary structure as additional input   [17] . Technically, both PROFphd and PHD are the only available methods predicting real values for relative solvent accessibility on a grid of 0, 1, 4, 9, 16, 25, 36, 49, 64, 81 (percentage relative accessibility). Another method that improved prediction accuracy considerably over older programs is embedded in the JPred2 server   [11] . It uses PSI-BLAST profiles as input to neural networks predicting accessibility in two states (buried/exposed).



Transmembrane Helix Prediction Methods

The task. Even in the optimistic scenario that in the near future most protein structures will be experimentally determined, one class of proteins will still represent a challenge for experimental determination of 3D structure: transmembrane proteins. The major obstacle with these proteins is that they do not crystallise, and are hardly tractable by NMR spectroscopy. Consequently, for this class of proteins structure prediction methods are even more needed than for globular water-soluble proteins. Fortunately, the prediction task is simplified by strong environmental constraints on transmembrane proteins: the lipid bilayer of the membrane reduces the degrees of freedom making the prediction almost a 2D problem   [105] . Two major classes of membrane proteins are known: proteins which insert helices into the lipid bilayer ( Fig. 3 ), and proteins that form pores by b-strand barrels   [106, 107, 108] . Since there is not much experimental information available on different porin-like (beta-strand barrel) membrane proteins, we can hardly estimate prediction accuracy for this class. The situation is quite different for helical membrane proteins. Knowing the precise location of transmembrane helices, we can predict 3D structure by simply exploring all possible conformations   [105] . While predicting transmembrane helices is simpler than predicting globular helices, there is ample evidence that prediction accuracy has been significantly over-estimated   [109, 110] .



Fig. 3
fig3.gif

Fig. 3. : Topology of helical membrane proteins.   In one class of membrane proteins, typically apolar helical segments are embedded in the lipid bilayer oriented perpendicular to the surface of the membrane. The helices can be regarded as more or less rigid cylinders. The orientation of the helical axes, i.e. the topology of the transmembrane protein, can be defined by the orientation of the first N-terminal residues with respect to the cell. Topology is defined as out when the protein N-term (first residue) starts on the extra-cytoplasmic region (protein A), and as in if the N-term starts on the intra-cytoplasmic side (proteins B and C). The lower part explains the 'inside-out-rule'. The difference between the positively charges are compiled for all even an old odd non-membrane regions. If the even loops have more positive charges, the N-term of the protein is predicted outside. This rule holds for most proteins of known topology.

 




 

Basic concept.  We can use a number of observations that constrain the problem of predicting membrane helices. (1) TM helices are predominantly apolar and between 12 and 35 residues long   [111] . (2) Globular regions between membrane helices are typically shorter than 60 residues  [112; 80]. (3) Most TMH proteins have a specific distribution of the positively charged amino acids Arginine and Lysine coined the 'positive-inside-rule' by Gunnar von Heijne  [113; 114]. Connecting 'loop' regions at the inside of the membrane have more positive charges than 'loop' regions at the outside ( Fig. 3 ). (4) Long globular regions (> 60 residues) differ in their composition from those globular regions subject to the 'inside-out-rule'. Most methods simply compile the hydrophobicity along the sequence and predict a segment to be a transmembrane helix if the respective hydrophobicity exceeds some given threshold   [115, 116, 117, 118, 12, 119, 20, 120, 10, 121, 122] . Additionally, some methods also explore the hydrophobic moment   [116, 106, 123] , or other membrane-specific amino acid preferences   [124, 125, 126, 127] . The most important step is to adequately average hydrophobicity values over windows of adjacent residues   [27, 119] . One of the major problems of hydrophobicity-based methods appears the poor distinction between membrane and globular proteins   [128, 109] . A number of methods use the positive-inside-rule to also predict the orientation of membrane helices   [129, 12, 130, 25, 121, 131] .

Evolutionary information improves prediction accuracy.  Using evolutionary information also improves TMH predictions, significantly   [132, 128, 133, 130] . However, the growth of the sequence databases seems to have reversed the advantage of using evolutionary information   [110] . Until around 1997, most membrane helices were conserved in the following sense. Assume protein A has a TMH at positions N1-N2. Since the number of membrane helices is important for the function of the protein, we expect that all proteins A' that are found to be similar to A in a database search will also have a membrane helix at the corresponding positions N1-N2. However, precisely this assumption proves no longer correct   [111] . The practical result is that alignment-based predictions are much less accurate when based on the large merger of SWISS-PROT and TrEMBL   [134] than when based on the smaller SWISS-PROT, only   [111] . Interestingly, we can explore the power of using evolutionary information by carefully filtering the results from PSI-BLAST searches   [110] .

Available key players. TopPred2 is one of the classics in the field. It averages the GES-scale of hydrophobicity   [118] using a trapezoid window   [27, 129] . MEMSAT  [12] introduced a dynamic programming optimisation to find the most likely prediction based on statistical preferences. TMAP   [24] uses statistical preferences averaged over aligned profiles. PHD combines a neural network using evolutionary information with a dynamic programming optimisation of the final prediction   [128, 133] . DAS optimises the use of hydrophobicity plots   [28] . SOSUI  [20] uses a combination of hydrophobicity and amphiphilicity preferences to predict membrane helices. TMHMM is the most advanced – and seemingly most accurate - current method to predict membrane helices   [25] . It embeds a number of statistical preferences and rules into a Hidden Markov model to optimise the prediction of the localisation of membrane helices and their orientation (note: similar concepts are used for HMMTOP   [10] ). 

 

 

Programs and public servers

All methods described are available through public servers. A list of URL's and the contact addresses is summarised in Table 1 . Most programs listed in except HMMSTR and PSIPRE - are also available by single-click: META-PP allows you to fill out a form with the sequence and your email address once and to simultaneously submit your protein to a number of high-quality servers   [13] . This concept of accessing many servers through one has been pioneered by the BCM-Launcher   [135] supposedly accessing the largest number of different methods. Other combinations are given by NPSA   [136] , META-Poland   [137] , and ProSAL  [138] . In contrast to all others, META-PP attempts to (i) return as few results as possible by filtering out technical messages and to (ii) combine only high-quality methods. Note that both the BCM launcher and the current GCG package  [139] return predictions of secondary structure from methods that are neither 'state-of-the-art' nor competitive with the best method from a decade ago without indicating this to the user. A generalisation of the 'common interface' idea is implemented in the sequence retrieval system SRS   [140, 141] enabling a simultaneous access of most existing databases. Successively SRS starts to also incorporate the direct access to prediction methods.

 

 

Practical aspects



Evaluation of Prediction Methods

Correctly evaluating protein structure prediction is difficult. Developers of prediction methods in bioinformatics may significantly over-estimate their performance because of the following reasons. First, it is difficult and time-consuming to correctly separate data sets used for developing and testing. Second, estimates of performance of the different methods are often based on different data sets. This problem frequently originates from the rapid growth of the sequence and structure databases. Third, single scores are usually not sufficient to describe the performance of a method. The lack of clarity is particularly unfortunate at a time when an increasing number of tools are made easily available through the Internet and many of the users are not experts in the field of protein structure prediction. Two prominent examples illustrate this problem. (1) Transmembrane helix predictions have been estimated to yield levels above 95% per-residue accuracy more than 18 percentage points more than seems to hold up [111] . (2) Many publications on predicting the secondary structural class from amino acid composition allowed correlations between 'training' and testing sets. Consequently, levels of prediction accuracy published – close to 100% - exceeded by far the theoretical possible margins – around 60% -   [142] .

CASP: how well do experts predict protein structure? The CASP experiments attempt to address the problem of over-estimated performance   [143, 144, 145, 4] . The procedure used by CASP is the following. (1) Experimentalists who are about to determine the structure of a protein send the sequence to the CASP organisers [4] . (2) Sequences are distributed to the predictors. The deadline for returning results is given by the date that the structure will be published. (3) All predictions are evaluated in a meeting at Asilomar. CASP resolves the bias resulting from using known protein structures as targets. However, it often cannot provide statistically significant evaluations since the number of proteins tested is too small   [146, 70] . Nevertheless, CASP provides valuable insights into the performance of prediction methods, and has become the major source of development in the field of proteins structure prediction. Due to the fact that 'failing at CASP is bad for the CV', most predictions are submitted only after experts have studied the data, in detail. Thus, CASP intrinsically evaluates how well the best experts in the field can predict structure.

CAFASP: how well do computers predict structure? CAFASP has recently extended CASP by testing automatic prediction servers on the CASP proteins   [147] . Although CAFASP aimed at evaluating programs rather than experts, it is still limited to a small number of test proteins   [148, 4] . In fact, in most categories (comparative modelling, fold recognition, novel folds) did not suffice to distinguish between the top 10 methods in a statistically significant way (see also Chapter 27). Furthermore, for most categories, we could not even conclude whether or not the field had improved over a period of two years.

EVA and LiveBench: automatic, large-scale evaluation of performance. The limitations of CASP and CAFASP prompted two efforts at creating large-scale and continuously running tools that automatically assesses protein structure prediction servers: EVA   [7, 8] and LiveBench   [149] . LiveBench specialises in the evaluation of fold recognition, while EVA analyses comparative modelling, contact prediction, fold recognition, and secondary structure prediction. The EVA results for secondary structure prediction methods were essential to conclude that these methods have improved significantly and to isolate the particular reasons for the improvements (mostly due to growing databases)   [39, 70, 15] .



Secondary Structure Prediction in Practice

77% right means 33% wrong! The best current methods (PSIPRED, PROFphd, SSpro) reach levels around 77% accuracy (percentage of residues predicted correctly in one of the three states helix, strand, or other)  [8; 39; 70] (Footnote 2). Five observations are important for using prediction methods. (1) Levels of accuracy are averages over many proteins ( Fig. 4 A). Hence, the accuracy for the prediction of your protein may be much lower – or much higher – than 77%. (2) Stronger predictions are usually more accurate ( Fig. 4 B). This allows – to some extent – to find out whether or not the prediction for your protein is more likely to be above or below average. (3) Often predictions go badly wrong, i.e. helices are incorrectly predicted as strands and vice versa. In fact, the best current methods confuse helices and strands for – on average – about 3% of all residues   [8] . Encouragingly, some of these 'bad errors' are in fact not so severe, after all, since some of these are due to regions that can switch structural conformations according in response to environmental changes (see below). (4) Prediction accuracy is rather sensitive to the information contained in the alignment used for the prediction: differences between single-sequence based predictions and optimal alignment-based predictions can exceed more than 25 percentage points [15] . (5) If on average 77% of the residues are correctly predicted, this trivially implies that 33% are wrong. Often it is extremely instructive to form an expert opinion about where these wrong predictions are   [150] .



Fig. 4
fig4.gif

Fig. 4. : Prediction accuracy varies but stronger predictions are better!     All results are based on 150 novel protein structures not used to develop any of the method shown  [8; 39]. For all methods shown, the three-state per-residue accuracy varies significantly between these proteins, with one standard deviation in the order of 10% (A). This implies that it is difficult for users to estimate the 'actual' accuracy for their protein. However, most methods now provide an index measuring the reliability of the prediction for each residue. Shown is the accuracy versus the cumulative percentages of residues predicted at a given level of reliability (coverage vs. accuracy). For example, PSIPRED and PROFphd reach a level above 88% for about 60% of all residues (dashed line). This particular line is chosen since secondary structure assignments by DSSP agree to about 88% for proteins of similar structure. Although JPred2 is only marginally less accurate than PSIPRED and PROFphd, it reaches this level of accuracy for less than half of all residues.




 

Sources of latest improvement: 4 parts database growth, 3 extended search, 2 other. Jones solicited two causes for the improved accuracy of PSIPRED: (1) training and (2) testing the method on PSI-BLAST profiles. Cuff & Barton examined in detail how different alignment methods improve   [11] . However, which fraction of the improvement results from the mere growth of the database, which from using more diverged profiles, and which from training on larger profiles? Using the PHD version from 1994 to separate the effects   [15] , we first compared a non-iterative standard BLAST   [3] search against SWISS-PROT   [134] with one against SWISS-PROT + TrEMBL   [134] + PDB   [151] . The larger database improves performance by about two percentage points   [15] . Secondly, we compared the standard BLAST against the big database with an iterative PSI-BLAST search. This yielded less than two percentage points additional improvement   [15] . Thus, overall, the more divergent profile search against today's databases supposedly improves any method using alignment information by almost four percentage points. The improvement through using PSI-BLAST profiles to develop the method, are relatively small: PHDpsi was trained on a small database of not very divergent profiles in 1994, e.g., PROFphd was trained on PSI-BLAST profiles of a 20 times larger database in 2000. The two differ by only one percentage point, and part of this difference resulted from implementing new concepts into PROF (Rost, unpublished).

Averaging over many methods may help. All methods predict some proteins at lower levels of accuracy than others   [152, 38, 70] . Nevertheless, for most proteins there is a method that predicts secondary structure at a level higher than average   [70] . The latter is applied when averaging over prediction methods. In fact, such averages are helpful as long as compiled over 'good' methods   [153] . Thus, using ALL available programs is a rather bad idea!



Solvent Accessibility Prediction in Practice

Very few of the seemingly more accurate methods predicting solvent accessibility are publicly available. Furthermore, there is no EVA-like evaluation of methods based on large sets and identical conditions. The first case scenario in which methods are compared based on different data sets, and different ways to define accessibility is the rule rather than the exception. Thus, it is important to view values published with caution. Most methods predict accessibility in two states (exposed and buried). Levels of prediction accuracy vary significantly according to choice of the thresholds to distinguish between the two states   [11] . If we define all residues that are less than 16% solvent accessible as 'exposed', the best current methods reach levels around 75±10% accuracy   [59, 154, 11, 15] . Using alignment information improves prediction accuracy significantly. However, accessibility predictions are more sensitive to alignment errors than are secondary structure predictions   [59, 15] . A reason for this may be that accessibility is evolutionarily less well conserved than is secondary structure   [15] .



Transmembrane Helix Prediction in Practice

Caution: no appropriate estimate of performance available! The appropriate evaluation of methods predicting membrane helices is even more difficult than the evaluation of other categories of structure prediction. Three major problems prevent adequate analyses. (1) We do not have enough high-resolution structures to allow a statistically significant analysis  [111]. (2) Low-resolution experiments (gene fusion) differ from high-resolution experiments (crystallography) almost as much as prediction methods do   [111] . Thus, low-resolution experiments do not suffice to evaluate prediction accuracy. (3) All methods optimise some parameters. Since there are so few high-resolution structures, all methods use as many of the known ones, as possible. However, that methods perform much better on proteins for which they were developed than on new proteins was impressively demonstrated – and overlooked in a recent analysis of prediction methods   [109] .

Crude estimates for where we are at in the field. The best current methods (HMMTOP2, PHDhtm, and TMHMM2) predict all helices correct for about 70% of all proteins   [111] . For more than 60% of the proteins, the topology is also predicted correctly   [109, 111] . The most accurate per-residue prediction is achieved by PHDhtm getting about 70% of the observed TMH residues right   [111] . All methods based on advanced algorithms tend to under-estimate transmembrane helices, thus, about 86% of the TMH residues predicted by the best methods in this category PHD and DAS are correct   [111] . Most method tend to confuse signal peptides with membrane helices, the best separation is achieved by a system predicting sub-cellular localisation ALOM2   [155] . Almost as accurate are PHD and TopPred2 (followed by TMHMM)   [109] . Surprisingly, most methods have also been over-estimated in their ability to distinguish between globular and helical membrane proteins; particularly, most methods based only on hydrophobicity scales incorrectly predict membrane helices in over 90% of a representative set of globular proteins   [110<, <, <, 2%falsepositives) , 111] . The most accurate hydrophobicity index appears to be the one recently developed in the Ben-Tal group   [156] . All methods fail to distinguish membrane helices from signal peptides to the extend that the best methods still falsely predict membrane helices for 25% (PHDhtm) to 34% (TMHMM2) of all signal peptides tested. The good news for the practical application are that we have an accurate method detecting signal peptides   [157] , and that most incorrectly predicted membrane helices start closer than ten residues to N-terminal Methionine residues, i.e. could be corrected by experts.

Genome analysis: many proteins contain membrane helices.  Despite the over-estimated performance, predictions of transmembrane helices are valuable tools to quickly scan entire genomes for membrane proteins. A few groups base their results only on hydrophobicity scales, known to have extremely high error rates in distinguishing globular and membrane proteins. Nevertheless, the averages published for entire genomes are surprisingly similar between different authors   [158, 133, 159, 160, 161, 112, 80] . About 10-30% of all proteins appears to contain membrane helices. One crucial difference however is that more cautious estimates do not perceive a statistically significant difference in the percentage of TMH proteins across the three kingdoms: eukaryotes, prokaryotes and archae   [162] . However, the preferences between particular types of membrane proteins differs, in particular, eukaryotes have more 7TM proteins (receptors), while prokaryotes have more 6- and 12TM proteins (ABC transporters)   [112, 80] .

 

 

Emerging and future developments

Regions likely to undergo structural change predicted successfully. Young, Kirshenbaum, Dill & Highsmith   [2] have unravelled an impressive correlation between local secondary structure predictions and global conditions. The authors monitor regions for which secondary structure prediction methods give equally strong preferences for two different states. Such regions are processed combining simple statistics and expert-rules. The final method is tested on 16 proteins known to undergo structural rearrangements, and on a number of other proteins. The authors report no false positives, and identify most known structural switches. Subsequently, the group applied the method to the myosin family identifying putative switching regions that were not know before, but appeared reasonable candidates   [1] . I find this method most remarkable in two ways: (1) it is the most general method using predictions of protein structure to predict some aspects of function, and (2) it illustrates that predictions may be useful even when structures are known (as in the case of the myosin family).

Classifying proteins based on secondary structure predictions in the context of genome analysis. Proteins can be classified into families based on predicted and observed secondary structure   [163, 164] . However, such procedures have been limited to a very coarse-grained grouping only exceptionally useful to infer function ( Table 2 ). Nevertheless, in particular, predictions of membrane helices and coiled-coil regions are crucial for genome analysis. Recently, we came across an observation that may have important implications for structural genomics, in particular: More than one fifth of all eukaryotic proteins appeared to have regions longer than 60 residues apparently lacking any regular secondary structure   [162] . Most of these regions were not of low-complexity, i.e. not composition-biased   [165, 166, 162] . Surprisingly, these regions appeared evolutionarily as conserved as all other regions in the respective proteins. This application of secondary structure prediction may aid in classifying proteins, and in separating domains, possibly even in identifying particular functional motifs.

Aspects of protein function predicted based on expert-analysis of secondary structure. The typical scenario in which secondary structure predictions help us to learn more about function come from experts combining predictions and their intuition, most often to find similarities to proteins of known function but insignificant sequence similarity   [167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179] . Usually, such applications are based on very specific details about predicted secondary structure (some examples in Table 2 ). Thus, successful correlations of secondary structure and function appear difficult to incorporate into automatic methods.

Exploring secondary structure predictions to improve database searches. Initially, three groups independently applied secondary structure predictions to fold recognition, i.e., the detection of structural similarities between proteins of unrelated sequences   [180, 181, 182] . A few years later, almost every other fold recognition/threading method has adopted this concept   [183, 184, 185, 186, 187, 188, 189, 190, 191, 192] . Two recent methods extended the concept by not only refining the database search, but by actually refining the quality of the alignment through an iterative procedure   [193, 194] . A related strategy has been implored by Ng and the Henikoffs to improve predictions and alignments for membrane proteins   [195] .

From 1D predictions to 2D, and 3D structure. Are secondary structure predictions accurate enough to help predicting higher order aspects of protein structure automatically? 2D (inter-residue contacts) predictions: Baldi, Pollastri, Andersen & Brunak   [196] have recently improved the level of accuracy in predicting beta-strand pairings over earlier work   [197] through using another elaborate neural network system. 3D predictions: the following list of five groups exemplifies that secondary structure predictions have now a popular first step toward predicting 3D structure. (1) Ortiz et al.   [198] successfully use secondary structure predictions as one component of their 3D structure prediction method. (2) Eyrich et al.   [199, 200] minimises the energy of arranging predicted rigid secondary structure segments. (3) Lomize et al.   [201] also start from secondary structure segments. (4) Chen et al.   [202] suggest using secondary structure predictions to reduce the complexity of molecular dynamics simulations. (5) Levitt et al.   [203, 204] combine secondary structure-based simplified presentations with a particular lattice simulation attempting to enumerate all possible folds.

Using accessibility to predict aspects of function. Features of multiple alignments can reveal aspects of protein function   [205, 206, 207, 208, 209, 210] . To simplify the story: residues can be conserved because of structural and functional reasons. If we could distinguish between the two, we could predict the functional residues. Obviously, residues that are exposed AND conserved are likely to reveal functional constraints. This suggests using predicted accessibility and combination with alignments to predict functional residues (B Rost, unpublished). Another possible application of predicted accessibility is the prediction of sub-cellular localisation: the surface compositions differ significantly between extra-cellular, cytoplasmic and nuclear proteins   [211] . Currently, we use predicted accessibility to improve the prediction of sub-cellular localisation (R Nair & B Rost, unpublished).

Using 1D predictions for target selection in structural genomics. Structural genomics proposes to experimentally determine one high-resolution structure for every known protein   [212, 213, 214, 215, 216, 217] . Obviously, this goal could be reached faster if we could avoid all proteins of known structure. This is relatively straightforward   [214, 218, 162] . More difficult is the task of avoiding proteins that do not express, do not purify, or are longer than 200 residues and do not crystallise (or do not diffract well enough). One way toward this goal is to exclude all proteins with membrane helices  [80; 218]. Can bioinformatics do more than that? In a preliminary analysis, we used predicted accessibility to predict the globularity of a protein   [219] . Although prediction accuracy is rather low, we observe some correlation between the percentage of surface residues and the globularity of a protein. One important task to choose structural genomics targets most effectively is the prediction of structural domains from sequence. Currently, we are exploring ways of using predicted accessibility and secondary structure toward this end (J Liu & B Rost, unpublished).

Eukaryotes full of floppy proteins? Recently it has been shown that regions of low-complexity – as predicted by the program SEG   [220] - are the rule rather than the exception in the protein universe   [221, 222, 223, 224, 225, 165, 166] . Using predictions of secondary structure, we found that there are many proteins that do not have low-complexity regions but nevertheless appear to have long (>70 residues) regions without regular secondary structure (helix/strand, dubbed 'NORS'). Such NORS proteins appear to be significantly more abundant in the eukaryotes than in all other kingdoms, reaching levels around 25% of the entire genome   [162] . We found many of the NORS to be evolutionarily conserved, suggesting that these may in fact be proteins with induced structure rather than without structure.

 

 

Further Reading

Review articles:

· Secondary structure, old   [45] : 
Goldmine for finding citations for very early prediction methods.

· Secondary structure, new   [39] : 
Brief walk through the recent highlights in the field of protein secondary structure prediction.

· Coiled-coil helices   [76] : 
Analysis of the performance of various programs predicting coiled-coil helices based on new structures.

· Transmembrane helix predictions   [106] : 
Review of methods predicting transmembrane helices based on hydrophobicity.

 

Original articles (sorted by subject/year):

1) Secondary structure, PHDsec   [58, 59] : 
Original papers describing the first method that surpassed the threshold of 70% prediction accuracy through combining neural networks and evolutionary information.

2) Secondary structure, PSIPRED   [18] :
The alignments used by PHD are replaced by PSI-BLAST alignments. This improves prediction accuracy significantly. However, possibly the most important aspect is the description of ways to run PSI-BLAST automatically without finding too many wrong hits.

3) Secondary structure, SSpro   [22] : 
The most complicated and seemingly most successful architecture for using neural networks predicting secondary structure is presented, here.

4) Secondary structure, switches   [2] : 
A data set of 16 protein sequences having functions which involve substantial backbone rearrangements are analysed with respect to the ambivalence of predicted secondary structure. They find all segments involved in conformational switches to have ambivalent predictions, measured by the similarity in prediction probabilities for helix, sheet and loop, as reported by PHD.

5) Secondary structure applied to classify genomes   [164] : 
Does a protein's secondary structure determine its three-dimensional fold? This question is tested directly by analyzing proteins of known structure and constructing a taxonomy based solely on secondary structure. The taxonomy is generated automatically, and it takes the form of a tree in which proteins with similar secondary structure occupy neighboring leaves.

6) Solvent accessibility, PHDacc   [89] : 
Analysis of the evolutionary conservation of solvent accessibility and description of an alignment-based neural network prediction method.

7) Transmembrane helices, TMHMM   [25] : 
Most advanced and seemingly most accurate method predicting transmembrane helices through cyclic Hidden Markov models.

8) Transmembrane helices, genome analysis   [112] : 
The authors analyse all membrane helix predictions for a number of entirely sequenced genomes. They conclude that more complex organisms use more helical membrane proteins than simpler organisms.

9) Using 1D predictions for genome analysis  [80]:  
28 entirely sequenced genomes are compared based on predictions of coiled-coil proteins (COILS   [75] ), membrane helices (PHD   [14] ), and functional classes (EUCLID   [226] ) In contrast to many other publications, the correlation between the complexity of an organism and its use of helical membrane proteins is not confirmed.

 

 

Acknowledgements

Thanks to Jinfeng Liu and Volker Eyrich (Columbia) for computer assistance. Thanks also to the EVA team that enabled to quote some of the numbers given: Volker Eyrich (Columbia), Marc Marti-Renom & Andrej Sali (both Rockefeller), Florencio Pazos and Alfonso Valencia (Madrid). The work of BR was supported by the grants 1-P50-GM62413-01 and RO1-GM63029-01 from the National Institute of Health. Last, not least, thanks to all those who deposit their experimental data in public databases, and to those who maintain these databases.

 

References

1.Kirshenbaum,K., Young, M. & Highsmith, S. (1999). Predicting allosteric switches inmyosins. Prot. Sci., 8, 1806-1815.
2.Young, M., Kirshenbaum, K., Dill, K. A. & Highsmith, S. (1999).Predicting conformational switches in proteins. Prot. Sci., 8, 1752-1764.
3.Altschul, S. F. & Gish, W. (1996). Local alignment statistics. Meth.Enzymol., 266,460-480.
4.Zemla, A., Venclovas, C. & Fidelis, K. (2001). Protein structureprediction center. 2001.
5.Lupas, A., Van Dyke, M. & Stock, J. (1991). Predicting coiled coils fromprotein sequences. Science, 252, 1162-1164.
6.Kabsch, W. & Sander, C. (1983). Dictionary of protein secondarystructure: pattern recognition of hydrogen bonded and geometrical features. Biopolymers, 22, 2577-2637.
7.Eyrich, V., Martí-Renom, M. A., Przybylski, D., Fiser, A., Pazos, F.et al. (2001). EVA: continuous automatic evaluation of protein structureprediction servers. Bioinformatics,in press.
8.Eyrich, V., Martí-Renom, M. A., Przybylski, D., Fiser, A., Pazos, F.et al. (2001). EVA: continuous automatic evaluation of protein structureprediction servers. 2001.
9.Bystroff, C., Thorsson, V. & Baker, D. (2000). HMMSTR: a hidden Markovmodel for local sequence-structure correlations in proteins. J. Mol. Biol., 301, 173-190.
10.Tusnady, G. E. & Simon, I. (1998). Principles governing amino acidcomposition of integral membrane proteins: application to topology prediction. J.Mol. Biol., 283,489-506.
11.Cuff, J. A. & Barton, G. J. (2000). Application of multiple sequencealignment profiles to improve protein secondary structure prediction. Proteins, 40, 502-511.
12.Jones, D. T., Taylor, W. R. & Thornton, J. M. (1994). A modelrecognition approach to the prediction of all-helical membrane proteinstructure and topology. Biochem., 33, 3038-3049.
13.Eyrich, V. & Rost, B. (2000). The META-PredictProtein server.
14.Rost, B. (1996). PHD: predicting one-dimensional protein structure byprofile based neural networks. Meth. Enzymol.,266, 525-539.
15.Przybylski, D. & Rost, B. (2002). Alignments grow, secondary structureprediction improves. Proteins,in press.
16.Altschul, S., Madden, T., Shaffer, A., Zhang, J., Zhang, Z. et al. (1997).Gapped Blast and PSI-Blast: a new generation of protein database searchprograms. Nucl. Acids Res., 25, 3389-3402.
17.Rost, B. (2001). Predicting protein structure: better data, better results! J.Mol. Biol.,insubmission.
18.Jones, D. T. (1999). Protein secondary structure prediction based onposition-specific scoring matrices. J. Mol. Biol., 292, 195-202.
19.Karplus, K., Barrett, C., Cline, M., Diekhans, M., Grate, L. et al. (1999).Predicting protein structure using only sequence information. Proteins, S3, 121-125.
20.Hirokawa, T., Boon-Chieng, S. & Mitaku, S. (1998). SOSUI: classificationand secondary structure prediction system for membrane proteins. Bioinformatics, 14, 378-379.
21.Juretic, D., Zucic, D., Lucic, B. & Trinajstic, N. (1998). Preferencefunctions for prediction of membrane-buried helices in integral membraneproteins. Comput. Chem., 22, 279-94.
22.Baldi, P., Brunak, S., Frasconi, P., Soda, G. & Pollastri, G. (1999).Exploiting the past and the future in protein secondary structure prediction. Bioinformatics, 15, 937-946.
23.Pollastri, G., Przybylski, D., Rost, B. & Baldi, P. (2001). Improvingthe prediction of protein secondary structure in three and eight classes usingrecurrent neural networks and profiles. Proteins,in press.
24.Persson, B. & Argos, P. (1996). Topology prediction of membraneproteins. Prot. Sci., 5, 363-371.
25.Sonnhammer, E. L. L., von Heijne, G. & Krogh, A. (1998). A hidden Markovmodel for predicting transmembrane helices in protein sequences. In SixthInternational Conference on Intelligent Systems for Molecular Biology (ISMB98)eds.), pp. 175-182.
26.Hofmann, K. & Stoffel, W. (1993). TMBASE - a database of membranespanning protein segments. Biol. Chem. Hoppe-Seyler, 374, 166.
27.von Heijne, G. (1992). Membrane protein structure prediction. J. Mol.Biol., 225, 487-494.
28.Cserzö, M., Wallin, E., Simon, I., von Heijne, G. & Elofsson, A.(1997). Prediction of transmembrane a-helices in prokaryotic membrane proteins: thedense alignment surface method. Prot. Engin.,10, 673-676.
29.Anfinsen, C. B. (1973). Principles that govern the folding of proteinchains. Science, 181, 223-230.
30.Corrales, F. J. & Fersht, A. R. (1996). Kinetic significance of GroEL14 . (GroES7)2 complexes inmolecular chaperone activity. Folding & Design, 1, 265-273.
31.Martin, J. & Hartl, F. U. (1997). Chaperone-assisted protein folding. Curr.Opin. Str. Biol., 7,41-52.
32.Ellis, R. J., Dobson, C. & Hartl, U. (1998). Sequence does specifyprotein conformation. TIBS, 23, 468.
33.Dobson, C. M. & Karplus, M. (1999). The fundamentals of protein folding:bringing together theory and experiment. Curr. Opin. Str. Biol., 9, 92-101.
34.Levitt, M. & Warshel, A. (1975). Computer simulation of protein folding.Nature, 253, 694-698.
35.Hagler, A. T. & Honig, B. (1978). On the formation of protein tertiarystructure on a computer. Proc. Natl. Acad. Sci. U.S.A., 75, 554-558.
36.van Gunsteren, W. F. (1993). Molecular dynamics studies of proteins. Curr.Opin. Str. Biol., 3,167-174.
37.CASP4WWW (2000). Fourth meeting on the critical assessment of techniques forprotein structure prediction. Prediction Center, Lawrence Livermore NationalLab, WWW document: http://PredictionCenter.llnl.gov/casp4/Casp4.html.
38.Lesk, A. M., Lo Conte, L. & Hubbard, T. J. P. (2001). Assessment ofnovel folds targets in CASP4: Predictions of three-dimensional structures,secondary structures, and interresidue contacts. Proteins,in press.
39.Rost, B. (2001). Protein secondary structure prediction continues to rise. J.Struct. Biol., 134,204-218.
40.Brändén, C. & Tooze, J. (1991). Introduction to ProteinStructure. Garland Publ., New York, London.
41.Chou, P. Y. & Fasman, U. D. (1974). Prediction of protein conformation. Biochem., 13, 211-215.
42.Robson, B. (1976). Conformational properties of amino acid residues inglobular proteins. J. Mol. Biol., 107, 327-56.
43.Garnier, J., Osguthorpe, D. J. & Robson, B. (1978). Analysis of theaccuracy and Implications of simple methods for predicting the secondarystructure of globular proteins. J. Mol. Biol.,120, 97-120.
44.Schulz, G. E. & Schirmer, R. H. (1979). Prediction of secondarystructure from the amino acid sequence. In Principles of protein structureeds.), pp. 108-130, Springer-Verlag, Berlin.
45.Fasman, G. D. (1989). The development of the prediction of proteinstructure. In Prediction of protein structure and the principles of proteinconformation (Fasman, G. D., eds.), pp. 193-303, Plenum Press, New York,London.
46.Nishikawa, K. & Ooi, T. (1982). Correlation of the amino acidcomposition of a protein to its structural and biological characteristics. J.Biochem., 91,1821-1824.
47.Nishikawa, K. & Ooi, T. (1986). Amino acid sequence homology applied tothe prediction of protein secondary structure, and joint prediction withexisting methods. Biochim. Biophys. Ac., 871, 45-54.
48.Deleage, G. & Roux, B. (1987). An algorithm for protein secnodarystructure prediction based on class prediction. Prot. Engin., 1, 289-294.
49.Biou, V., Gibrat, J. F., Levin, J. M., Robson, B. & Garnier, J. (1988).Secondary structure prediction: combination of three different methods. Prot.Engin., 2, 185-91.
50.Bohr, H., Bohr, J., Brunak, S., Cotterill, R. M. J., Lautrup, B. et al.(1988). Protein secondary structure and homology by neural networks. FEBSLett., 241, 223-228.
51.Gascuel, O. & Golmard, J. L. (1988). A simple method for predicting thesecondary structure of globular proteins: implications and accuracy. CABIOS, 4, 357-365.
52.Levin, J. M. & Garnier, J. (1988). Improvements in a secondary structureprediction method based on a search for local sequence homologies and its useas a model building tool. Biochim. Biophys. Ac.,955, 283-295.
53.Qian, N. & Sejnowski, T. J. (1988). Predicting the secondary structureof globular proteins using neural network models. J. Mol. Biol., 202, 865-884.
54.Garnier, J. & Robson, B. (1989). The GOR method for predicting secondarystructure in proteins. In Prediction of protein structure and the principles ofprotein conformation (D., F. G., eds.), pp. 417-465, Plenum Press, New York.
55.Rost, B. & Sander, C. (1996). Bridging the protein sequence-structuregap by structure predictions. Annu. Rev. Biophys. Biomol. Struct., 25, 113-136.
56.Rost, B. & Sander, C. (2000). Third generation prediction of secondarystructure. Methods in Molecular Biology, 143, 71-95.
57.Rost, B. & Sander, C. (1992). Exercising multi-layered networks onprotein secondary structure. In Neural Networks: From Biology to High EnergyPhysics (Benhar, O., Brunak, S., DelGiudice, P. & Grandolfo, M., eds.), pp.209-220, International Journal of Neural Systems, Elba, Italy.
58.Rost, B. & Sander, C. (1993). Prediction of protein secondary structureat better than 70% accuracy. J. Mol. Biol.,232, 584-599.
59.Rost, B. & Sander, C. (1994). Combining evolutionary information andneural networks to predict protein secondary structure. Proteins, 19, 55-72.
60.Rost, B. (1999). Twilight zone of protein sequence alignments. Prot.Engin., 12, 85-94.
61.Doolittle, R. F. (1986). Of URFs and ORFs: a primer on how to analyzederived amino acid sequences. University Science Books, Mill Valley California.
62.Rost, B. (1997). Protein structures sustain evolutionary drift. Folding& Design, 2,S19-S24.
63.Yang, A. S. & Honig, B. (2000). An integrated approach to the analysisand modeling of protein sequences and structures. II. On the relationshipbetween sequence and structural similarity for proteins that are not obviouslyrelated in sequence. J. Mol. Biol., 301, 679-689.
64.Dickerson, R. E., Timkovich, R. & Almassy, R. J. (1976). The cytochromefold and the evolution of bacterial energy metabolism. J. Mol. Biol., 100, 473-491.
65.Maxfield, F. R. & Scheraga, H. A. (1979). Improvements in the predictionof protein topography by reduction of statistical errors. Biochem., 18, 697-704.
66.Zvelebil, M. J., Barton, G. J., Taylor, W. R. & Sternberg, M. J. E.(1987). Prediction of protein secondary structure and active sites usingalignment of homologous sequences. J. Mol. Biol.,195, 957-961.
67.Barton, G. J. (1996). Protein sequence alignment and database scanning. InProtein structure prediction (Sternberg, M. J. E., eds.), pp. 31-64, OxfordUniv. Press, Oxford.
68.Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics, 14, 755-763.
69.Karplus, K., Barrett, C. & Hughey, R. (1998). Hidden Markov models fordetecting remote protein homologies. Bioinformatics, 14, 846-856.
70.Rost, B. & Eyrich, V. (2001). EVA: large-scale analysis of secondarystructure prediction. Proteins,in press.
71.Salamov, A. A. & Solovyev, V. V. (1997). Protein secondary structureprediction using local alignments. J. Mol. Biol.,268, 31-36.
72.Frishman, D. & Argos, P. (1996). Incorporation of non-local interactionsin protein secondary structure prediction from the amino acid sequence. Prot.Engin., 9, 133-142.
73.Baldi, P. & Brunak, S. (2001). Bioinformatics: the machine learningapproach. MIT Press, Cambridge.
74.Crick, F. H. C. (1953). The packing of a-helices: simple coiled-coils. ActaCrystallogr. Sect. A, 6, 689-697.
75.Lupas, A. (1996). Prediction and analyis of coiled-coil structures. Meth.Enzymol., 266,513-525.
76.Lupas, A. (1997). Predicting coiled-coil regions in proteins. Curr. Opin.Str. Biol., 7,388-393.
77.Nilges, M. & Brünger, A. T. (1993). Successful prediction of coiledcoil geometry of the GCN4 leucine zipper domain by simulated annealing: comparisonto the X-ray. Proteins, 15, 133-146.
78.O'Donoghue, S. I. & Nilges, M. (1997). Tertiary structure predictionusing mean-force potentials and internal energy functions: successfulprediction for coiled-coil geometries. Folding & Design, 2, S47-S52.
79.Wolf, E., Kim, P. S. & Berger, B. (1997). MultiCoil: a program forpredicting two- and three-stranded coiled coils. Prot. Sci., 6, 1179-1189.
80.Liu, J. & Rost, B. (2001). Comparing function and structure betweenentire proteomes. Prot. Sci., 10, 1970-1979.
81.Cohen, F. E., Sternberg, M. J. E. & Taylor, W. R. (1981). Analysis ofthe tertiary structure of protein b-sheet sandwiches. J. Mol. Biol., 148, 253-272.
82.Monge, A., Friesner, R. A. & Honig, B. (1994). An algorithm to generatelow-resolution protein tertiary structures from knowledge of secondarystructure. Proc. Natl. Acad. Sci. U.S.A., 91, 5027-5029.
83.Mumenthaler, C. & Braun, W. (1995). Predicting the helix packing ofglobular proteins by self-correcting distance geometry. Prot. Sci., 4, 863-871.
84.Cohen, F. E. & Presnell, S. R. (1996). The combinatorial approach. InProtein structure prediction (Sternberg, M. J. E., eds.), pp. 207-228, OxfordUniv. Press, Oxford.
85.Lee, B. K. & Richards, F. M. (1971). The interpretation of proteinstructures: estimation of static accessibility. J. Mol. Biol., 55, 379-400.
86.Chothia, C. (1976). The nature of the accessible and buried surfaces inproteins. J. Mol. Biol., 105, 1-12.
87.Connolly, M. L. (1983). Solvent-accessible surfaces of proteins and nucleicacids. Science, 221, 709-713.
88.Hubbard, T. J. P. & Blundell, T. L. (1987). Comparison ofsolvent-inaccessible cores of homologous proteins: definitions useful forprotein modelling. Prot. Engin., 1, 159-171.
89.Rost, B. & Sander, C. (1994). Conservation and prediction of solventaccessibility in protein families. Proteins,20, 216-226.
90.Richards, F. M. (1977). Areas, volumes, packing, and protein structure. Annu.Rev. Biophys. Bioeng., 6, 151-176.
91.Tanford, C. (1978). The hydrophobic effect and the organization of livingmatter. Science, 200, 1012-1018.
92.Kyte, J. & Doolittle, R. F. (1982). A simple method for displaying thehydrophathic character of a protein. J. Mol. Biol., 157, 105-132.
93.Sweet, R. M. & Eisenberg, D. (1983). Correlation of sequencehydrophobicities measures similarity in three-dimensional protein structure. J.Mol. Biol., 171,479-488.
94.Holbrook, S. R., Muskal, S. M. & Kim, S.-H. (1990). Predicting surfaceexposure of amino acids from protein sequence. Prot. Engin., 3, 659-665.
95.Mucchielli-Giorgi, M. H., Hazout, S. & Tuffery, P. (1999). PredAcc:prediction of solvent accessibility. Bioinformatics, 15, 176-177.
96.Carugo, O. (2000). Predicting residue solvent accessibility from proteinsequence by considering the sequence environment. Prot. Engin., 13, 607-609.
97.Li, X. & Pan, X. M. (2001). New method for accurate prediction ofsolvent accessibility from protein sequence. Proteins, 42, 1-5.
98.Naderi-Manesh, H., Sadeghi, M., Arab, S. & Moosavi Movahedi, A. A.(2001). Prediction of protein surface accessibility with information theory. Proteins, 42, 452-459.
99.Hansen, J., Lund, O., Tolstrup, N., Gooley, A. A., Williams, K. L. et al.(1998). NetOglyc: Prediction of mucin type O-glycosylation sites based onsequence context and surface accessibility. Glycoconjugate Journal, 15, 115-130.
100.Gupta, R., Jung, E., Gooley, A. A., Williams, K. L., Brunak, S. et al.(1999). Scanning the available Dictyostelium discoideum proteome for O-linkedGlcNAc glycosylation sites using neural networks. Glycobiology, 9, 1009-1022.
101.Thompson, M. J. & Goldstein, R. A. (1996). Predicting solventaccessibility: higher accuracy using Bayesian statistics and optimized residuesubstitution classes. Proteins, 25, 38-47.
102.Wako, H. & Blundell, T. L. (1994). Use of amino acidenvironment-dependent substitution tables and conformational propensities instructure prediction from aligned sequences of homologous proteins I. Solventaccessibility classes. J. Mol. Biol., 238, 682-692.
103.Rost, B. & O'Donoghue, S. I. (1997). Sisyphus and prediction of proteinstructure. CABIOS, 13, 345-356.
104.Rost, B. (2000). PredictProtein - internet prediction service.
105.Taylor, W. R., Jones, D. T. & Green, N. M. (1994). A method for a-helicalintegral membrane protein fold prediction. Proteins, 18, 281-294.
106.von Heijne, G. (1996). Prediction of transmembrane protein topology. InProtein structure prediction (Sternberg, M. J. E., eds.), pp. 101-110, OxfordUniv. Press, Oxford.
107.Seshadri, K., Garemyr, R., Wallin, E., von Heijne, G. & Elofsson, A.(1998). Architecture of beta-barrel membrane proteins: analysis of trimericporins. Prot. Sci., 7, 2026-2032.
108.Buchanan, S. K. (1999). b-Barrel proteins from bacterial outer membranes: Structure, functionand refolding. Curr. Opin. Str. Biol., 9, 455-461.
109.Möller, S., Croning, D. R. & Apweiler, R. (2001). Evaluation ofmethods for the prediction of membrane spanning regions. Bioinformatics, 17, 646-653.
110.Chen, C. P., Kernytsky, A. & Rost, B. (2002). Myths of transmembranehelix predictions. Prot. Sci.,submitted.
111.Chen, C. P. & Rost, B. (2002). State-of-the-art in membrane prediction.Applied Bioinformatics,submitted.
112.Wallin, E. & von Heijne, G. (1998). Genome-wide analysis of integralmembrane proteins from eubacterial, archaean, and eukaryotic organisms. Prot.Sci., 7, 1029-1038.
113.von Heijne, G. (1986). The distribution of positively charged residues inbacterial inner membrane proteins correlates with the trans-membrane topology. EMBOJ., 5, 3021-3027.
114.von Heijne, G. (1989). Control of topology and mode of assembly of apolytopic membrane protein by positively charged residues. Nature, 341, 456-458.
115.Tanford, C. (1980). The hydrophobic effect: formation of micelles andbiological membranes. John Wiley & Sons, New York.
116.Eisenberg, D., Schwartz, E., Komaromy, M. & Wall, R. (1984). Analysisof membrane and surface protein sequences with the hydrophobic moment plot. J.Mol. Biol., 179,125-142.
117.Klein, P., Kanehisa, M. & De Lisi, C. (1985). The detection andclassification of membrane-spanning proteins. Biochim. Biophys. Ac., 815, 468-476.
118.Engelman, D. M., Steitz, T. A. & Goldman, A. (1986). Identifyingnonpolar transbilayer helices in amino acid sequences of membrane proteins. Annu.Rev. Biophys. Biophys. Chem., 15, 321-353.
119.von Heijne, G. (1994). Membrane proteins: from sequence to structure. Annu.Rev. Biophys. Biomol. Struct., 23, 167-192.
120.Phoenix, D. A., Stanworth, A. & Harris, F. (1998). The hydrophobicmoment plot and its efficacy in the prediction and classification of membraneinteractive proteins and peptides. Membr Cell Biol, 12, 101-110.
121.Harris, F., Wallace, J. & Phoenix, D. A. (2000). Use of hydrophobicmoment plot methodology to aid the identification of oblique orientatedalpha-helices. Molecular Membrane Biology, 17, 201-207.
122.Lio, P. & Vannucci, M. (2000). Wavelet change-point prediction oftransmembrane proteins. Bioinformatics, 16, 376-82.
123.Liu, L. P. & Deber, C. M. (1999). Combining hydrophobicity andhelicity: a novel approach to membrane protein structure prediction. BioorgMed Chem, 7, 1-7.
124.Ben-Tal, N., Honig, B., Miller, C. & McLaughlin, S. (1997).Electrostatic binding of proteins to membranes. Theoretical predictions andexperimental results with charybdotoxin and phospholipid vesicles. Biophys.J., 73, 1717-1727.
125.Monne, M., Hermansson, M. & von Heijne, G. (1999). A turn propensityscale for transmembrane helices. J. Mol. Biol.,288, 141-145.
126.Pasquier, C., Promponas, V. J., Palaios, G. A., Hamodrakas, J. S. &Hamodrakas, S. J. (1999). A novel method for predicting transmembrane segmentsin proteins based on a statistical analysis of the SwissProt database: thePRED-TMR algorithm. Prot. Engin., 12, 381-385.
127.Pilpel, Y., Ben-Tal, N. & Lancet, D. (1999). kPROT: a knowledge-basedscale for the propensity of residue orientation in transmembrane segments.Application to membrane protein structure prediction. J. Mol. Biol., 294, 921-935.
128.Rost, B., Casadio, R., Fariselli, P. & Sander, C. (1995). Prediction ofhelical transmembrane segments at 95% accuracy. Prot. Sci., 4, 521-533.
129.Sipos, L. & von Heijne, G. (1993). Predicting the topology ofeukaryotic membrane proteins. Eur. J. Biochem.,213, 1333-1340.
130.Persson, B. & Argos, P. (1997). Prediction of membrane protein topologyutilizing multiple sequence alignments. J. Prot. Chem., 16, 453-457.
131.Tusnady, G. E. & Simon, I. (2001). Topology of membrane proteins. JChem Inf Comput Sci, 41, 364-8.
132.Neuwald, A. F., Liu, J. S. & Lawrence, C. E. (1995). Gibbs motifsampling: detection of bacterial outer membrane protein repeats. Prot. Sci., 4, 1618-1631.
133.Rost, B., Casadio, R. & Fariselli, P. (1996). Topology prediction forhelical transmembrane proteins at 86% accuracy. Prot. Sci., 5, 1704-1718.
134.Bairoch, A. & Apweiler, R. (2000). The SWISS-PROT protein sequencedatabase and its supplement TrEMBL in 2000. Nucl. Acids Res., 28, 45-48.
135.Smith, R. F., Wiese, B. A., Wojzynski, M. K., Davison, D. B. & Worley,K. C. (1996). BCM Search Launcher--an integrated interface to molecular biologydata base search and analysis services available on the World Wide Web. GenomeRes., 6, 454-462.
136.Combet, C., Blanchet, C., Geourjon, C. & Deléage, G. (2000).NPS@: Network Protein Sequence AnalysisTIBS2000 March Vol. 25, No 3 [291]:147-150. TIBS,25, 147-150.
137.Rychlewski, L. (2000). META server. IIMCB Warsaw, WWW document(www.bioinfo.pl/meta/).
138.Kleywegt, G. (2001). ProSAL: Protein sequence analysis launcher. 2001.
139.Devereux, J., Haeberli, P. & Smithies, O. (1984). GCG package. Nucl.Acids Res., 12,387-395.
140.Etzold, T. & Argos, P. (1993). SRS - an indexing and retrieval tool forflat file data libraries. Comput. Appl. Biosci.,9, 0-0.
141.Etzold, T., Ulyanov, A. & Argos, P. (1996). SRS: Information retrievalsystem for molecular biology data banks. Meth. Enzymol., 266, 114-128.
142.Wang, Z.-X. & Yuan, Z. (2000). How good is prediction of proteinstructural class by the component-coupled method? Proteins, 38, 165-175.
143.Moult, J., Pedersen, J. T., Judson, R. & Fidelis, K. (1995). Alarge-scale experiment to assess protein structure prediction methods. Proteins, 23, ii-iv.
144.Moult, J., Hubbard, T., Bryant, S. H., Fidelis, K. & Pedersen, J. T.(1997). Critical assessment of methods of protein structure prediction (CASP):Round II. Proteins, Suppl 1, 2-6.
145.Moult, J., Hubbard, T., Bryant, S. H., Fidelis, K. & Pedersen, J. T.(1999). Critical assessment of methods of protein structure prediction (CASP):round II. Proteins, Suppl 1, 2-6.
146.Marti-Renom, M. A., Madhusudhan, M. S., Fiser, A., Rost, B. & Sali, A.(2001). Reliability of assessment of protein structure prediction methods atCASP. Structure,inpress.
147.Fischer, D., Barret, C., Bryson, K., Elofsson, A., Godzik, A. et al.(1999). CAFASP-1: critical assessment of fully automated structure predictionmethods. Proteins, Suppl 3, 209-217.
148.Fischer, D., Elofsson, A., Rychlewski, L., Pazos, F., Valencia, A. et al.(2001). CAFASP2: the second critical assessment of fully automated structureprediction methods. Proteins,in press.
149.Bujnicki, J. M., Elofsson, A., Fischer, D. & Rychlewski, L. (2001).LiveBench-1: continuous benchmarking of protein structure prediction servers. Prot.Sci., 10, 352-361.
150.Hubbard, T., Tramontano, A., Barton, G., Jones, D., Sippl, M. et al.(1996). Update on protein structure prediction: results of the 1995 IRBMworkshop. Folding & Design, 1, R55-R63.
151.Berman, H. M., Westbrook, J., Feng, Z., Gillliland, G., Bhat, T. N. et al.(2000). The Protein Data Bank. Nucl. Acids Res.,28, 235-242.
152.Rost, B., Sander, C. & Schneider, R. (1993). Progress in proteinstructure prediction? TIBS, 18, 120-123.
153.Rost, B., Baldi, P., Barton, G., Cuff, J., Eyrich, V. et al. (2001). Simplejury predicts protein secondary structure best. Proteins,submitted.
154.Lesk, A. M. (1997). CASP-2: Report on ab initio predictions. Proteins,151-166.
155.Nakai, K. & Kanehisa, M. (1992). A knowledge base for predictingprotein localization sites in eukaryotic cells. Genomics, 14, 897-911.
156.Kessel, A. & Ben-Tal, N. (2002). Free energy determinants of peptideassociation with lipid bilayers. In Peptide-lipid interactions (Simon, S. &McIntosh, T., eds.), pp. in press, Academic Press, San Diego.
157.Nielsen, H., Engelbrecht, J., Brunak, S. & von Heijne, G. (1997).Identification of prokaryotic and eukaryotic signal peptides and prediction oftheir cleavage sites. Prot. Engin., 10, 1-6.
158.Goffeau, A., Slonimski, P., Nakai, K. & Risler, J. L. (1993). How manyyeast genes code for membrane-spanning proteins? Yeast, 9, 691-702.
159.Arkin, I. T., Brünger, A. T. & Engelman, D. M. (1997). Are theredominant membrane protein families with a given number of helices? Proteins, 28, 465-466.
160.Frishman, D. & Mewes, H. W. (1997). Protein structural classes in fivecomplete genomes. Nat. Struct. Biol., 4, 626-628.
161.Jones, D. T. (1998). Do transmembrane protein superfolds exist? FEBSLett., 423, 281-285.
162.Liu, J., Tan, H. & Rost, B. (2002). Eukaryotes full of loopy proteins? J.Mol. Biol.,inpreparation.
163.Gerstein, M. & Levitt, M. (1997). A structural census of the currentpopulation of protein sequences. Proc. Natl. Acad. Sci. U.S.A., 94, 11911-11916.
164.Przytycka, T., Aurora, R. & Rose, G. D. (1999). A protein taxonomybased on secondary structure. Nat. Struct. Biol.,6, 672-682.
165.Dunker, A. K. & Obradovic, Z. (2001). The protein trinity-linkingfunction and disorder. Nat .Biotechnol., 19, 805-806.
166.Romero, P., Obradovic, Z., Li, X., Garner, E. C., Brown, C. J. et al.(2001). Sequence complexity of disordered protein. Proteins, 42, 38-48.
167.Brautigam, C., Steenbergen-Spanjers, G. C., Hoffmann, G. F., Dionisi-Vici,C., van den Heuvel, L. P. et al. (1999). Biochemical and molecular geneticcharacteristics of the severe form of tyrosine hydroxylase deficiency. ClinChem, 45, 2073-2078.
168.Davies, G. P., Martin, I., Sturrock, S. S., Cronshaw, A., Murray, N. E. etal. (1999). On the structure and operation of type I DNA restriction enzymes. J.Mol. Biol., 290,565-579.
169.de Fays, K., Tibor, A., Lambert, C., Vinals, C., Denoel, P. et al. (1999).Structure and function prediction of the Brucella abortus P39 protein bycomparative modeling with marginal sequence similarities. Prot. Engin., 12, 217-223.
170.Di Stasio, E., Sciandra, F., Maras, B., Di Tommaso, F., Petrucci, T. C. etal. (1999). Structural and functional analysis of the N-terminal extracellularregion of beta-dystroglycan. Biochem Biophys Res Commun, 266, 274-278.
171.Gerloff, D. L., Cannarozzi, G. M., Joachimiak, M., Cohen, F. E., Schreiber,D. et al. (1999). Evolutionary, mechanistic, and predictive analyses of thehydroxymethyldihydropterin pyrophosphokinase family of proteins. BiochemBiophys Res Commun, 254, 70-6.
172.Juan, H. F., Hung, C. C., Wang, K. T. & Chiou, S. H. (1999). Comparisonof three classes of snake neurotoxins by homology modeling and computersimulation graphics. Biochem Biophys Res Commun,257, 500-10.
173.Laval, V., Chabannes, M., Carriere, M., Canut, H., Barre, A. et al. (1999).A family of Arabidopsis plasma membrane receptors presenting animalbeta-integrin domains. Biochim. Biophys. Ac.,1435, 61-70.
174.Seto, M. H., Liu, H. L., Zajchowski, D. A. & Whitlow, M. (1999).Protein fold analysis of the B30.2-like domain. Proteins, 35, 235-249.
175.Xu, H., Aurora, R., Rose, G. D. & White, R. H. (1999). Identifying twoancient enzymes in Archaea using predicted secondary structure alignment. Nat.Struct. Biol., 6,750-4.
176.Jackson, R. M. & Russell, R. B. (2000). The serine protease inhibitorcanonical loop conformation: examples found in extracellular hydrolases,toxins, cytokines and viral proteins. J. Mol. Biol., 296, 325-334.
177.Paquet, J. Y., Vinals, C., Wouters, J., Letesson, J. J. & Depiereux, E.(2000). Topology prediction of Brucella abortus Omp2b and Omp2a porins aftercritical assessment of transmembrane beta strands prediction by severalsecondary structure prediction methods. J Biomol Struct Dyn, 17, 747-757.
178.Shah, P. S., Bizik, F., Dukor, R. K. & Qasba, P. K. (2000). Active sitestudies of bovine alpha1-->3-galactosyltransferase and its secondarystructure prediction. Biochim. Biophys. Ac.,1480, 222-234.
179.Stawiski, E. W., Baucom, A. E., Lohr, S. C. & Gregoret, L. M. (2000).Predicting protein function from structure: unique structural features ofproteases. Proc Natl Acad Sci U S A, 97, 3954-8.
180.Rost, B. (1995). TOPITS: Threading One-dimensional Predictions IntoThree-dimensional Structures. In Third International Conference on IntelligentSystems for Molecular Biology (Rawlings, C., Clark, D., Altman, R., Hunter, L.,Lengauer, T. et al., eds.), pp. 314-321, Menlo Park, CA: AAAI Press, Cambridge,England.
181.Fischer, D. & Eisenberg, D. (1996). Fold recognition usingsequence-derived properties. Prot. Sci., 5, 947-955.
182.Russell, R. B., Copley, R. R. & Barton, G. J. (1996). Protein foldrecognition by mapping predicted secondary structures. J. Mol. Biol., 259, 349-365.
183.Ayers, D. J., Gooley, P. R., Widmer-Cooper, A. & Torda, A. E. (1999).Enhanced protein fold recognition using secondary structure information fromNMR. Prot. Sci., 8, 1127-1133.
184.de la Cruz, X. & Thornton, J. M. (1999). Factors limiting theperformance of prediction-based fold recognition methods. Prot. Sci., 8, 750-759.
185.Di Francesco, V., Munson, P. J. & Garnier, J. (1999). FORESST: foldrecognition from secondary structure predictions of proteins. Bioinformatics, 15, 131-140.
186.Hargbo, J. & Elofsson, A. (1999). Hidden Markov models that usepredicted secondary structures for fold recognition. Proteins, 36, 68-76.
187.Jones, D. T. (1999). GenTHREADER: an efficient and reliable protein foldrecognition method for genomic sequences. J. Mol. Biol., 287, 797-815.
188.Jones, D. T., Tress, M., Bryson, K. & Hadley, C. (1999). Successfulrecognition of protein folds using threading methods biased by sequencesimilarity and predicted secondary structure. Proteins, 37, 104-111.
189.Koretke, K. K., Russell, R. B., Copley, R. R. & Lupas, A. N. (1999).Fold recognition using sequence and secondary structure information. Proteins, 37, 141-148.
190.Ota, M., Kawabata, T., Kinjo, A. R. & Nishikawa, K. (1999). Cooperativeapproach for the protein fold recognition. Proteins, 37, 126-132.
191.Panchenko, A., Marchler-Bauer, A. & Bryant, S. H. (1999). Threadingwith explicit models for evolutionary conservation of structure and sequence. Proteins, Suppl 3, 133-140.
192.Kelley, L. A., MacCallum, R. M. & Sternberg, M. J. (2000). Enhancedgenome annotation using structural profiles in the program 3D-PSSM. J. Mol.Biol., 299, 499-520.
193.Heringa, J. (1999). Two strategies for sequence comparison:profile-preprocessed and secondary structure-induced multiple alignment. Comput.Chem., 23, 341-364.
194.Jennings, A. J., Edge, C. M. & Sternberg, M. J. (2001). An approach toimproving multiple alignments of protein sequences using predicted secondarystructure. Prot. Engin., 14, 227-231.
195.Ng, P., Henikoff, J. & Henikoff, S. (2000). PHAT: atransmembrane-specific substitution matrix. Bioinformatics, 16, 760-766.
196.Baldi, P., Pollastri, G., Andersen, C. A. & Brunak, S. (2000). Matchingprotein beta-sheet partners by feedforward and recurrent neural networks. Ismb, 8, 25-36.
197.Hubbard, T. J. P. & Park, J. (1995). Fold recognition and ab initiostructure predictions using Hidden Markov models and b-strand pair potentials. Proteins, 23, 398-402.
198.Ortiz, A. R., Kolinski, A., Rotkiewicz, P., Ilkowski, B. & Skolnick, J.(1999). Ab initio folding of proteins using restraints derived fromevolutionary information. Proteins, Suppl 3, 177-185.
199.Eyrich, V. A., Standley, D. M., Felts, A. K. & Friesner, R. A. (1999).Protein tertiary structure prediction using a branch and bound algorithm. Proteins, 35, 41-57.
200.Eyrich, V. A., Standley, D. M. & Friesner, R. A. (1999). Prediction ofprotein tertiary structure to low resolution: performance for a large andstructurally diverse test set. J. Mol. Biol.,288, 725-742.
201.Lomize, A. L., Pogozheva, I. D. & Mosberg, H. I. (1999). Prediction ofprotein structure: the problem of fold multiplicity. Proteins, Suppl, 199-203.
202.Chen, C. C., Singh, J. P. & Altman, R. B. (1999). Using imperfectsecondary structure predictions to improve molecular structure computations. Bioinformatics, 15, 53-65.
203.Samudrala, R., Xia, Y., Huang, E. & Levitt, M. (1999). Ab initioprotein structure prediction using a combined hierarchical approach. Proteins, Suppl, 194-198.
204.Samudrala, R., Huang, E. S., Koehl, P. & Levitt, M. (2000).Constructing side chains on near-native main chains for ab initio proteinstructure prediction. Prot. Engin., 13, 453-457.
205.Casari, G., Sander, C. & Valencia, A. (1995). A method to predictfunctional residues in proteins. Nat. Struct. Biol., 2, 171-178.
206.Lichtarge, O., Bourne, H. R. & Cohen, F. E. (1996). Evolutionarilyconserved Galphabetagamma binding surfaces support a model of the Gprotein-receptor complex. Proc. Natl. Acad. Sci. U.S.A., 93, 7507-7511.
207.Lichtarge, O., Yamamoto, K. R. & Cohen, F. E. (1997). Identification offunctional surfaces of the zinc binding domains of intracellular receptors. J.Mol. Biol., 274,325-337.
208.Pazos, F., Sanchez-Pulido, L., Garcia-Ranea, J. A., Andrade, M. A., Atrian,S. et al. (1997). Comparative analysis of different methods for the detectionof specificity regions in protein families. In BCEC97: Bio-Computing andEmergent Computation (Olsson, B., Lundh, D. & Narayanan, A., eds.), pp.132-145, World Scientific, Skövde, Sweden.
209.Marcotte, E. M., Pellegrini, M., Ng, H. L., Rice, D. W., Yeates, T. O. etal. (1999). Detecting protein function and protein-protein interactions fromgenome sequences. Science, 285, 751-753.
210.Irving, J. A., Pike, R. N., Lesk, A. M. & Whisstock, J. C. (2000).Phylogeny of the serpin superfamily: implications of patterns of amino acidconservation for structure and function. Genome Res,, 10, 1845-1864.
211.Andrade, M. A., O'Donoghue, S. I. & Rost, B. (1998). Adaptation ofprotein surfaces to subcellular location. J. Mol. Biol., 276, 517-525.
212.Gaasterland, T. (1998). Structural genomics taking shape. TIGS, 14, 135.
213.Rost, B. (1998). Marrying structure and genomics. Structure, 6, 259-263.
214.Sali, A. (1998). 100,000 protein structures for the biologist. Nat.Struct. Biol., 5,1029-32.
215.Burley, S. K., Almo, S. C., Bonanno, J. B., Capel, M., Chance, M. R. et al.(1999). Structural genomics: beyond the human genome project. Nat. Gen., 23, 151-157.
216.Shapiro, L. & Harris, T. (2000). Finding function through structuralgenomics. Curr. Opin. Biotech., 11, 31-35.
217.Thornton, J. (2001). Structural genomics takes off. TIBS, 26, 88-89.
218.Liu, J. & Rost, B. (2002). Target space for structural genomicsrevisited. Bioinformatics,submitted.
219.Rost, B. (1999). Short yeast ORFs: expressed protein or not? CUBIC,Columbia University, Dept. of Biochemistry & Mol. Biophysics, CUBICpreprint CUBIC-99-02.
220.Wootton, J. C. & Federhen, S. (1996). Analysis of compositionallybiased regions in sequence databases. Meth. Enzymol., 266, 554-571.
221.Saqi, M. (1995). An analysis of structural instances of low complexitysequence segments. Prot. Engin., 8, 1069-1073.
222.Garner, E., Cannon, P., Romero, P., Obradovic, Z. & Dunker, A. K.(1998). Predicting disordered regions from amino acid sequence: common themesdespite differing structural characterization. Genome Inform., 9, 201-214.
223.Romero, P., Obradovic, Z., Kissinger, C., Villafranca, J. E., Garner, E. etal. (1998). Thousands of proteins likely to have long disordered regions. Pac.Symp. Biocomput., 3,437-448.
224.Wright, P. E. & Dyson, H. J. (1999). Intrinsically unstructuredproteins: re-assessing the protein structure-function paradigm. J. Mol.Biol., 293, 321-331.
225.Dunker, A. K., Lawson, J. D., Brown, C. J., Williams, R. M., Romero, P. etal. (2001). Intrinsically disordered protein. J Mol Graph Model, 19, 26-59.
226.Tamames, J., Ouzounis, C., Casari, G., Sander, C. & Valencia, A.(1998). EUCLID: automatic classification of proteins in functional classes bytheir database annotations. Bioinformatics,14, 542-3. 

Footnotes



1 
Russell Doolittle coined the term 'twilight zone' for the region in which sequence similarity ceases to imply similarity in 3D structure   [61] . Typically, this region begins around 33% pairwise sequence identity for proteins that align over 100 residues   [60] . However, the vast majority of all proteins of similar structure have levels of sequence identity far below this mark; they populate the 'midnight zone' in which sequences diverged to random levels of similarity   [62, 63] . This observation may indicate that evolution had enough time to reach an equilibrium at which we can in fact not distinguish between two different events, namely the convergence of two different sequences to the same structure and the divergence of sequences while maintaining structure   [62, 60] .

2 
By the time you read this methods may have already been improved. Thus, consult the EVA WWW pages at cubic.bioc.columbia.edu for the latest statistics.




 

Table 1: Availability of prediction methods

 

Method

Type

Server

Program

 

 

 

 

JPred2

 

jura.ebi.ac.uk:8888

James Cuff james@ebi.ac.uk

PHD 

acc

cubic.bioc.columbia.edu/predictprotein

Burkhard Rost rost@columbia.edu

PROFphd

acc

cubic.bioc.columbia.edu/predictprotein

Burkhard Rost rost@columbia.edu

 

 

 

 

ASP 

sec+

cubic.bioc.columbia.edu/predictprotein

Malin Young mmyoung@sandia.gov

COILS 

sec

cubic.bioc.columbia.edu/predictprotein

Andrei Lupas andrei.lupas@tuebingen.mpg.de

HMMSTR

sec+

 

Chris Bystroff bystrc@rpi.edu

JPred2

sec

jura.ebi.ac.uk:8888

James Cuff james@ebi.ac.uk

PHDpsi

sec

cubic.bioc.columbia.edu/predictprotein

Burkhard Rost rost@columbia.edu

PHD 

sec

cubic.bioc.columbia.edu/predictprotein

Burkhard Rost rost@columbia.edu

PROFking

sec

www.aber.ac.uk/~phiwww/prof

Ross King rdk@aber.ac.uk

PROFphd

sec

cubic.bioc.columbia.edu/predictprotein

Burkhard Rost rost@columbia.edu

PSIPRED

sec

insulin.brunel.ac.uk/psiform.html

David Jones d.jones@cs.ucl.ac.uk

SAM-T99sec

sec

www.cse.ucsc.edu/research/compbio/
HMM-apps/T99-query.html

Kevin Karplus karplus@cse.ucsc.edu

SSpro2 

sec

promoter.ics.uci.edu/BRNN-PRED

Pierre Baldi pfbaldi@ics.uci.edu

 

 

 

 

DAS 

tmh

www.sbc.su.se/~miklos/DAS

 

HMMTOP

tmh

www.enzim.hu/hmmtop

Gábor E. Tusnády tusi@enzim.hu

MEMSAT

tmh

insulin.brunel.ac.uk/psipred

David Jones d.jones@cs.ucl.ac.uk

PHD 

tmh

cubic.bioc.columbia.edu/predictprotein

Burkhard Rost rost@columbia.edu

SOSUI 

tmh

sosui.proteome.bio.tuat.ac.jp/
sosuiframe0.html

Takatsugu Hirokawa sosui@biophys.bio.tuat.ac.jp

SPLIT

tmh

www.mbb.ki.se/tmap/index.html

Davor Juretic juretic@mapmf.pmfst.hr

TMAP 

tmh

www.mbb.ki.se/tmap/index.html

Bengt Persson Bengt.Persson@ibp.vxu.se

TMHMM

tmh

www.cbs.dtu.dk/services/TMHMM-1.0

Anders Krogh krogh@cbs.dtu.dk

TMpred

tmh

www.ch.embnet.org/software/
TMPRED_form.html

Philipp Bucher pbucher@isrec-sun1.unil.ch

TopPred2

tmh

www.sbc.su.se/~erikw/TopPred22

Gunnar von Heijne gunnar@dbb.su.se





Contact:    rost@columbia.edu Version:    Dec 20, 2001

top - CUBIC-papers - CUBIC