Protein fold recognition by prediction-based threading

Burkhard Rost , Reinhard Schneider & Chris Sander

contact e-mail:rost@embl-heidelberg.de


JMB, 1997270, 1-10

Abstract

In fold recognition by threading one takes the amino acid sequence of a protein and evaluates how well it fits into one of the known three-dimensional (3D) protein structures. The quality of sequence-structure fit is typically evaluated using inter-residue potentials of mean force or other statistical parameters. Here, we present an alternative approach to evaluating sequence-structure fitness. Starting from the amino acid sequence we first predict secondary structure and solvent accessibility for each residue. We then thread the resulting one-dimensional (1D) profile of predicted structure assignments into each of the known 3D structures. The optimal threading for each sequence-structure pair is obtained using dynamic programming. The overall best sequence-structure pair constitutes the predicted 3D structure for the input sequence. The method is fine-tuned by adding information from direct sequence-sequence comparison and applying a series of empirical filters. Although the method relies on reduction of 3D information into 1D structure profiles, its accuracy is, surprisingly, not clearly inferior to methods based on evaluation of residue interactions in 3D. We therefore hypothesise that existing 1D-3D threading methods essentially capture not more than the fitness of an amino acid sequence for a particular 1D succession of secondary structure segments and residue solvent accessibility. The prediction-based threading method on average finds any structurally homologous region at first rank in 29% of the cases (including sequence information). For the 22% first hits detected at highest scores, the expected accuracy rose to 75%. However, the task to detect entire folds rather than homologous fragments, was managed much better: 45-75% of the first hits correctly recognised the fold.

Key words: protein structure prediction, threading, remote homology detection, fold recognition, secondary structure, relative solvent accessibility, multiple alignments, dynamic programming, neural networks.


Introduction

Reducing the sequence-structure gap by homology modelling. Large scale gene-sequencing projects accumulate data of gene, and respectively protein sequences, at breathtaking pace (Oliver et al., 1992; Fleischmann et al., 1995; Dujon, 1996; Johnston, 1996). However, information about three dimensional (3D ×) structure is available for a small fraction of the known proteins (Bernstein et al., 1977) . Thus, although experimental structure determination has improved (Lattman, 1994), the sequence-structure gap continues to increase. One of the main tasks of theoretical biology is to reduce this gap by predictions. However, the only somewhat reliable way to predict 3D structure is homology modelling (Greer, 1991; Lesk & Boswell, 1992; May & Blundell, 1994): the structure of a protein of unknown structure (dubbed U ) can be modelled by homology, if a protein of known 3D structure is found which has more than 25-30% pairwise sequence identity to U (Chothia & Lesk, 1986; Doolittle, 1986; Sander & Schneider, 1991).

Possible scope of remote homology modelling. Two naturally evolved proteins can have rather different sequences and still fold into homologous structures. Currently there are thousands of remote homologues, i.e., homologues with less than 25% pairwise sequence identity, stored in a database of structurally aligned remote homologues (Holm et al., 1993; Holm & Sander, 1994). To illustrate the possible scope of remote homology modelling by numbers. Homology modelling is currently applicable to over 11,000 SWISS-PROT (Bairoch & Apweiler, 1996) sequences with more than 25% pairwise sequence identity to a known structure. However, the majority of all homologues proteins has supposedly less than 25% pairwise sequence identity (Rost et al., 1996). Thus, for a significant fraction of currently known sequences remote homology modelling could yield 3D predictions.

Long way from fold recognition to remote homology modelling. The problem of detecting remote homologues is of the type 'needle in the haystack': aligning the unique folds (150) against the entire PDB (3,000) would yield 450,000 pairs, of which about 1,500 are remote homologues (Holm & Sander, 1994), i.e., goal is to find the one true homologue among 100-300 decoys. A test of threading methods at the first meeting to evaluate structure prediction accuracy (Moult et al., 1995) suggested levels of 10-40% accuracy in correctly detecting the homologous fold (Lemer et al., 1995; Shortle, 1995). However, detection of the homologue is the simpler part of a successful remote homology modelling. More problematic is to correctly align the homologous proteins and to correctly build the model (Bryant & Altschul, 1995; Lemer et al., 1995; Sippl, 1995). Only for a few cases threading has been shown to yield correct 3D models (Flöckner et al., 1995).

Here, we extend our previously proposed novel method for threading predictions of 1D structure into 3D structures (Rost, 1995a; Rost, 1995b). First, 1D structure profiles were predicted from multiple sequence alignments. Then, the 1D predictions were aligned to 1D projections of known structures. The novel aspect reported here was the combination of information from 1D predictions and sequences. We had to focus on the main aspects of the method, a detailed description of the algorithm is electronically available (Rost WWW, 1996c). The accuracy of the method in detecting remote homologues was evaluated on a data set of 89 unique protein folds. The ability to correctly build remote homologous models is investigated for all correctly detected remote homologues. Finally, we compared the performance of the method to other tools based on three different data sets.


Methods

Brief outline of the algorithm

The algorithm started from a protein sequence which was aligned by MaxHom (Sander & Schneider, 1991) against SWISS-PROT (Bairoch & Boeckmann, 1994) (Fig. 1). The resulting multiple sequence alignment was used as input to neural network systems predicting secondary structure (PHDsec, (Rost & Sander, 1994a)) and solvent accessibility (PHDacc, (Rost & Sander, 1994b)). The predictions were converted into 1D structural profiles. Up to this point the method was constrained to a straight prediction in 1D, i.e., without any reference to 3D structure or the final goal of threading. Effectively, the amino acid sequence had now been translated into a 1D string of structure symbols ('predicted structure profile'), with some cooperativity taken into account. The idea was now to find the 3D fold that had the most similar structure profile (in terms of secondary structure and accessibility). The next step was to represent each of the known folds in the database as an observed structure profile. Finally, predicted and observed 1D structure profiles were optimally aligned by a dynamic programming algorithm (MaxHom). The best hit of the alignment procedure was recorded, and the final best hit was taken as the predicted fold. The predicted 3D structure was modelled based on the alignment of the input sequence into the predicted fold.


Fig. 1

fig1.gif

Fig. 1. Threading predicted 1D structure profiles into known 3D structures. (1) A multiple sequence alignment is generated for a given sequence of unknown structure (U). (2) The alignment profile of U is used as input to a neural network system (PHD) that predicted secondary structure and relative solvent accessibility. (3) The resulting predicted 1D structure profile for U is aligned by dynamic programming (program MaxHom (Sander & Schneider, 1991) to 1D structure strings assigned from known structures by the program DSSP (Kabsch & Sander, 1983). Abbreviations: H , helix; E , strand; L , rest;, buried (<15% solvent accessible); o, exposed (³15% solvent accessible).


Alignment of 1D structure

Three alternatives for the aligned strings. For a practical application of the method, predicted 1D structure profiles were aligned to observed 1D structure profiles (PHD vs. PDB). To investigate the influence of the accuracy of 1D structure prediction, we performed the following calibration experiment: observed 1D structure profiles were aligned against observed 1D structure profiles (PDB vs. PDB). Another possible extension of the concept was the alignment of predicted against predicted 1D structure profiles (PHD vs. PHD). Such a search could yield a prediction of a fold identity between two proteins both of unknown structure. Alignments of 1D structure strings can reveal structural homologues as 1D structure is conserved between remote homologues (Rost WWW, 1996b).

Free parameters for dynamic programming. The predicted strings were aligned based on a Smith-Waterman type dynamic programming algorithm (Smith & Waterman, 1981). This algorithm was implemented in the program MaxHom (Sander & Schneider, 1991; Schneider, 1994). The following free parameters had to be adjusted: (i) the similarity matrix, and (ii) the penalties associated with the introduction of gaps in the alignment.

Similarity matrix for six states. Various strategies were explored to find the optimal matrix for weighting matches between 1D structure pairs (Rost, 1995a; Rost, 1995b). Here we used a matrix refined starting from database counts (Rost WWW, 1996c). Finally, we simplified the resulting matrix by making it symmetric and slightly more balanced.

Similarity for 120 states. The combination of information from 1D structure and sequence was accomplished by combining the 1D structure similarity matrix with a McLachlan (McLachlan et al., 1984) or a Blosum62 (Henikoff & Henikoff, 1992) exchange matrix:

M= µ ¥ M+ (100 - µ) ¥ M (1)

where M determined the score for a match at a given position between state i in the first string and state j in the second string ; and µ = 0 - 100 tuned the percentage of 1D structure contribution to the final alignment score E (note that µ = 0 corresponded to a simple sequence alignment; µ = 100 marked an alignment based on 1D structure only).

Gap open and gap elongation penalty. The optimal choice of gap penalties depends on the context, i.e., the particular alignment pair (Vingron & Waterman, 1994). For an alignment of a search sequence against a database, there is a trade-off between coverage (correct hits found vs. all possible correct hits) and accuracy (correct hits vs. all hits found) of detection for the choice of the gap parameters go (penalty for opening a gap) and ge (penalty for continuing an open gap). We compiled results for various gap open penalties. The relative values of the two were found to be of marginal importance; we used: ge = 0.1 ¥ go .

Evaluation of prediction accuracy

Cross validation and parameter optimisation. To ascertain that knowledge about structure was not used for the 1D prediction we used prediction networks that had been trained on proteins with less than 25% pairwise sequence identity to the predicted protein (cross validation). Furthermore, free parameters for the dynamic programming algorithm were optimised before the final results were compiled. This was achieved by varying free parameters based on a data set of 46 non-unique protein structures (list in (Rost, 1995b)).

Measuring accuracy of fold recognition. Prediction accuracy was defined as the cumulative percentage of correct predictions up to rank R , Q(R) (defined in eq. 6: (Rost WWW, 1996c)). To measure the accuracy obtained on subsets selected according to a fixed z-score (eq. 5: (Rost WWW, 1996c)) we defined Coras the cumulative percentage of correct hits up to rank R for a given threshold z > q (eq. 7: (Rost WWW, 1996c)). The corresponding coverage was Cov, defined as the percentage of hits found at R for q (eq. 8: (Rost WWW, 1996c)). Cor(q) and Cov(q) determined the trade-off between accuracy and coverage. Results will be given for first ranks (R=1), only. The definitions for coverage and accuracy vs. a cut-off given address the following questions. What is the expected accuracy to find correct homologues if the hit list is cut at rank R and at a z-score > q? And for which proportion of the proteins are predictions made at the given cut-off?

Measuring accuracy of remote homology modelling. We measured alignment quality by (1) the percentage of pairwise sequence identity between the predicted and the structural alignment; (2) the average number of residues shifted between the predicted and the structural alignment; and (3) an alignment shift score (eq. 9: (Rost WWW, 1996c)). For the quality of the model we simply determined backbone root mean square deviations (rmsd, (Sippl, 1982)). The superposition was based on the sequence alignment obtained from the threading without any further optimisation (loop regions were included when compiling the rmsd values). We regarded the structural alignments taken from the FSSP database (Holm & Sander, 1994) as the correct 'standard-of-truth'. However, alignments between two structures are not always unique (Zu-Kang & Sippl, 1996). In some cases the alternative correct structural alignment might have better fit the prediction.

Data sets used for validation

Set of 89 unique folds. As of early 1996, there were more than 200 unique protein folds in PDB (Holm & Sander, 1994). These were used as a starting point to compile a set of 89 proteins used to evaluate the accuracy in detecting remote homologues (Rost WWW, 1996a). The resulting list of remote homologues comprised a rather difficult test set, as it included many cases for which the structural alignment covered only fragments of the two aligned proteins rather than extended over the entire 'folds'. Consequently, the results provided conservative estimates for the accuracy of 1D structure threading. The 'correct' remote homologues for the 89 search proteins and the 723 sequence-unique (<25% pairwise sequence identity) proteins used to search remote homologues are listed on the WWW (Rost WWW, 1996a).

Data sets for comparison with other methods. Finally, we compiled the results of our method based on three tiny sets of proteins for which results were published in the literature: (1) a set of 11 proteins used by Jones et al. (Jones et al., 1992) to evaluate the performance of the program THREADER (Tab. 4: (Rost WWW, 1996c)); (2) a set of 11 representative protein families used by Russell et al. (Russell et al., 1996) to evaluate the performance of the programs THREADER and MAP (Tab. 5: (Rost WWW, 1996c)); and (3) a set of 11 proteins used for the Asilomar 1994 prediction contest (Lemer et al., 1995); Tab. 6: (Rost WWW, 1996c)).


Results

Fold recognition

Loss of information by projection onto 1D limiting factor. When threading 1D structure profiles taken from the DSSP (Kabsch & Sander, 1983) assignments based on coordinates of known 3D structures (in other words completely correct 'predictions'), the first hit was correct in 35% of all test cases (PDB vs. PDB, µ =100, Tab. 1). When using real predictions from PHD (at an average accuracy of about 70%), the first hit was correct in 23% of the cases (PHD vs. PDB, µ = 100, Tab. 1). Thus, the limited prediction accuracy of PHD (70%) reduced detection accuracy by 'only' 12 percentage points; whereas the loss of information by projecting 3D structure onto 1D accounted for 75 percentage points in reducing detection accuracy.


Table 1

tab1.gif


Significant improvement by including sequence information. When 1D structure and sequence information was combined (eq. (1)) detection accuracy increased markedly: for a 50:50 mixture of 1D-structure-to-sequence (µ = 50 in eq. (1)), 29% of the first hits were correct (Tab. 1); and in half of the test cases, the correct homologue was detected among the first five alignment hits (Fig. 2). The choice of a particular sequence matrix (McLachlan vs. Blosum62) yielded different alignments (and most often different first hits). However, the overall accuracy for the entire test set was similar (Fig. 1). For a random prediction, the first hit would be correct in 2% of the cases. For a sequence alignment method (MaxHom with McLachlan matrix), the first hit was correct in about 15% of all cases (Tab. 1).


Fig. 2

fig2.gif

Fig. 2. Cumulative accuracy of detection vs. rank of hit. How many of the homologues were detected up to a certain rank R of the alignment list? For ranks R = 1-11 , the cumulative percentages of correctly detected folds is shown (Q(R), for R = 1, ..., 11; see eq. 6: (Rost WWW, 1996c)). Thick lines: alignments mixing 1D structure and sequence 50:50 (µ = 50, eq. (1); dashed with full circles: McLachlan exchange matrix; solid with solid triangles: Blosum62 matrix); thin lines: alignments based on sequence (McLachlan matrix; dashed with crosses) and on 1D structure information only (full line with open squares). For all results, 1D structure information was obtained by cross-validated prediction (PHD), i.e., the knowledge about the 3D structure of the threaded sequence had been removed from the experiment and was used only to evaluate the results; the gap open penalty was chosen as 2. For example (arrow), the correct hit was found among the first five hits in more than 50% of the cases for an alignment including 1D structure, and in less than 15% of the cases for a simple sequence alignment. Or: 40% of the remote homologues were identified among the first two hits when combining 1D structure and sequence; among the first five when using only 1D structure and among the first 15 (not shown) when using sequence alignment only.


Stronger hits more likely to be correct. When the alignment list was cut off at a z-score > 4.5 (eq. 5: (Rost WWW, 1996c)), the first hit was correct in 88% of the cases (Tab. 1). At this higher level of accuracy only 10 out of the 89 test proteins were detected (Fig. 3). The correlation between z-score and prediction accuracy illustrated, in particular, the strength of prediction-based threading, as opposed to simple sequence alignment. The sequence alignment used as reference resulted in relatively many correct first hits (15%), but it was very difficult to separate the chaff from the wheat: for 25% of the first hits the z-score was above 4.5, and of these only 30% were predicted correctly (Tab. 1). In other words, sequence alignments reached a similar level of accuracy as prediction-based threading for every fourth protein.


Fig. 3

fig3.gif

Fig. 3. Focusing on stronger predictions. The percentage of correct first hits can be increased by focusing on hits detected with higher z-scores. However, the increase of accuracy was at the expense of coverage. For example, at z > 3.5 75% of all first hits were correct, but only for 22% of all test proteins the first hit reached a z-score > 3.5. In other words, the fifth of the test cases predicted most strongly reached an accuracy of 75% (Q(1)).


Successful detection of remote homology in absence of 3D information. One of the features of prediction-based threading is that the detection of remote homology is not restricted to knowing the structure of the target. Instead, a sequence of unknown structure can be threaded through a library of predicted 1D structure assignments. The result was surprisingly not much inferior to the case of using known 3D structures: 27% of the hits were correctly detected at first rank (PHD vs. PHD; Tab. 1).

Better recognition of entire folds than of shorter fragments. The test set of 89 proteins was deliberately chosen to answer the question: how accurate can the method detect any remote homologous fragment in a library of protein structures (remote homology detection). An easier task is to detect similarities between entire folds (fold detection). We generated subsets of our full test set by excluding all cases for which the structural alignments covered only a small fraction of the aligned pair. For example, if the goal is to detect similarities that cover at least 70% of the lengths of both proteins, the expected accuracy (correct first hit) rose to 50%. Thus, prediction-based threading was clearly more successful in capturing homologies between entire folds than in detecting homologies between local regions.

Remote homology modelling

Few correct predictions of 3D structure. Given a correctly detected remote homologue, how accurate was the alignment? This question was addressed in two ways. First, the predicted alignments were compared to the structural alignments. For the hits correctly detected at ranks 1 and 2, the average shift score (eq. (9): (Rost WWW, 1996c)) was 38%, the average identity of the residues between predicted and structural alignments was 33%, and the average shift 11 (Table 3: (Rost WWW, 1996c)). More than half of the hits correctly detected at first rank reached an alignment shift score above 50% (15 out of 25); and one half (13 out of 25) had more than 50% of the residues identical to the structural alignment (Tab. 3: Rost WWW, 1996c; three representative alignments are given in Fig. 7: Rost WWW, 1996c)). For the second way to evaluate the alignment, we simply super-imposed the backbone model resulting from the predicted alignment with the known structure of the search protein. For only six of the test cases correctly detected at first rank (total of 25) the final model for the 3D structure of the threaded sequence deviated less than 2Å rmsd from the optimal superposition of the two structures (Tab. 3: (Rost WWW, 1996c)).

Comparison to other threading methods

A favourable set of 11 proteins. Russell et al. (1996) recently evaluated their prediction-based threading method (MAP) and the THREADER program of Jones et al. (1996) based on a small set of 11 proteins. For the first hit they reported an accuracy of 37-45% (depending on the threshold used for defining homologue structures) for MAP and of only 9-19% for THREADER (Jones et al., 1992). On the same 11 families, our prediction-based threading resulted in 78% correct first hits (Tab. 5: (Rost WWW, 1996c)). The reported quality of the alignments (percentage identical residues between predicted and structural alignment) was 15% for MAP and 11% for THREADER (Russell et al., 1996). For our prediction-based threading the average number of correctly aligned residues was 27% (Table 5: (Rost WWW, 1996c)). Thus, although the set used by Russell et al. (1996) was much more conservative than the one used initially by Jones et al. (1992) (both THREADER and our method yielded 100% correct first hit on that set, Tab. 4: (Rost WWW, 1996c)), it still yielded very optimistic estimates for prediction accuracy when compared to the performance on our set of 89 proteins. Did we select a set that yielded too pessimistic estimates of performance accuracy?

The 11 Asilomar 1994 targets. A final test of our method on 11 proteins that were used as threading targets at the first Asilomar meeting for the evaluation of prediction methods (Lemer et al., 1995; Moult et al., 1995) suggested that the estimates derived on our initial set of 89 proteins might be closer to the 'reality' for using automated threading than those derived on favourable test sets. For the Asilomar 11 we correctly detected the remote homologues at first rank in four cases (i.e. 36%, Tab. 6: (Rost WWW, 1996c)). The average percentage of correctly aligned residues was 21%; the average shift 9 residues; and the alignment shift score on average AS = 26% (eq. 9: (Rost WWW, 1996c)). Thus, the alignments were mostly wrong. How did the results compare to the blind predictions made for the meeting? The best methods performed better than our method: (1) the expert-driven usage of THREADER by David Jones and colleagues (Jones et al., 1995) detected 5 out of 9 proteins correctly at first rank; and (2) the best alignments of the potential-based threading method perfected by Manfred Sippl and colleagues (Flöckner et al., 1995) were clearly better than our best ones.

Remote homology modelling. Correctness of the alignment and consequently the 3D model obtained by threading has hardly been evaluated in the literature. One common example is the homology between the heat shock protein 70 (PDB code: 2hsc) and the A chain of the muscle protein actin (PDB code: 2atnA). Searching with 2hsc, the 1D-profile threading brought up 2atnA at first rank. The predicted alignment agrees for 44% of the residues with the structural alignment taken from FSSP (Holm & Sander, 1994). For a threading method based on energy calculations, Abagyan et al. (1994) (Abagyan et al., 1994) published the predicted alignment for the last 232 residues of the same pair. They report that the alignment was wrong for the C-terminal part of the molecules, for the 232 aligned residues their alignment is for 14% of the residues identical to the structural alignment. Interestingly, for the same region the prediction-based threading has 22% of the residues identical to the structural alignment, i.e., is clearly worse than the average for the entire protein.


Conclusion

Successful fold recognition by threading predicted 1D structure profiles. Fold motifs could be detected automatically by aligning predicted and known 1D structure profiles (secondary structure and solvent accessibility). However, even for an - in practice unrealistic - optimal prediction of 1D structure (assignment from known coordinates), the first hit was correct in only 35% of all test cases (Q(1), Tab. 1). A realistic prediction of 1D structure (obtained by cross-validated PHD predictions) yielded 23% detection accuracy. This result suggested two conclusions. (1) The loss of information by projecting 3D information onto 1D structure profiles was the bottle-neck of the method. To illustrate this problem: at least 16 unrelated structures contain the secondary structure motif 'H-E-E-H-E-E' (data not shown). An additional incorporation of information about inter-residue distances may open that bottle-neck. (2) Further improvements of 1D structure predictions could improve the accuracy of prediction-based threading significantly.

Better fold recognition by combining 1D structure profiles and sequence information. The novel step introduced here (combining 1D structure profiles with sequence information, eq. (1)) increased detection accuracy significantly: 29% of all first hits were correct (Tab. 1), and in about 53% of the test cases the correct homologues was found among the first five hits (Fig. 2). Thus, the prediction-based threading was clearly superior to sequence alignments (15% correct first hits, Fig. 2). Furthermore, accuracy could be increased by focusing on the subset of those hits which were predicted with higher z-scores. For example, for the 10% of all proteins predicted at z > 4.5 (eq. 5: Rost WWW, 1996c) the expected accuracy of correctly detecting the fold at first rank rose to 88% (Tab. 1, Fig. 3). Homologous folds were detected more accurately than homologous fragment. For example, for a test set with true homologues for which the alignment covered 70% of both aligned sequences, one half of the first hits were correct (Fig. 6: Rost WWW, 1996c). A feature of prediction-based threading that may become particularly interesting for applications in practice is that remote homology can successfully be detected between protein pairs without knowledge of 3D structure: when using 1D structure predictions as fold library, we correctly detected the remote homologue in 27% of the test cases at first rank (Tab. 1).

Prediction-based threading competitive to other threading techniques. A recent analysis based on a small set of 11 structure families (Russell et al., 1996), suggested a significantly detection accuracy (correct first hits) below 20% for the potential-based threading program THREADER (Jones et al., 1992). The prediction-based threading method of Russell et al. reached 37% to 45% accuracy. For the same 11 families our method had 75% correct first hits (one standard deviation > 15%; Tab. 2). When - in retrospect - evaluating our method on the 11 threading targets used for the Asilomar 1994 prediction contest (Moult et al., 1995) we had four first hits correct (36%). However, other methods performed better (Lemer et al., 1995): an expert-driven usage of THREADER had more correct first hits (Jones et al., 1995), and the potential-based threading by Sippl and colleagues obtained the best alignments more accurately (Flöckner et al., 1995). Fischer and Eisenberg (Fischer & Eisenberg, 1996; Fischer et al., 1996) have recently developed a method for prediction-based threading that is very similar to the one presented here They evaluated their and previous potential-based threading methods based on a large set of 64 remote homologues and reported 31% correct hits for potential-based threading (Bowie et al., 1990; Bowie et al., 1991; Lüthy et al., 1991; Lüthy et al., 1992) and 48% correct hits for prediction-based threading (Fischer & Eisenberg, 1996). This confirms the conclusions suggested by the results presented here and previously (Rost, 1995a; Rost, 1995b): in correctly identifying the first hit prediction-based threading is, at least, as accurate as potential-based threading.


Table 2

tab2.gif


Correct prediction of 3D structure by remote homology modelling for single cases. The correct detection of remote homology is the precondition for remote homology modelling. However, correct detection does not imply correct alignments. On the contrary, for most correctly detected remote homologues the alignment was, at least, partially wrong (for one half of the hits correctly predicted at first rank the identity between predicted and structural alignment was above 50%, Tab. 3: (Rost WWW, 1996c)). The same is true for most other threading techniques (Flöckner et al., 1995; Lemer et al., 1995; Shortle, 1995; Fischer & Eisenberg, 1996; Russell et al., 1996). How can a false alignment result in the detection of the true remote homologue among a huge set of decoys? The answer remains open.

Method available by automatic prediction service. The prediction-based threading of 1D structure profiles (PHDthreader) is available via an automatic prediction service (send the word help to the internet address PredictProtein@EMBL-Heidelberg.DE , or use the World Wide Web (WWW) site http://dodo.bioc.columbia.edu/predictprotein/). By default input strings (1D structure profile) are generated by a PHD prediction, however, users can also opt to provide their own predictions of secondary structure and solvent accessibility.

Will threading replace structure determination ? The number of different protein folds is probably limited (Chothia, 1992). Thus, will threading eventually close the sequence-structure gap by remote homology modelling? Three reasons make this appear an overoptimistic science fiction. (1) Correct alignments are still the exception rather than the rule. (2) Even when the alignments are correct, remote homology modelling at levels of less than 30% pairwise sequence identity is yet another unsolved problem (even for close homologues, modelling is not always successful). (3) The more unique folds are contained in the database, the more difficult the detection will become. This was illustrated by the following experiment. We aligned our 89 test proteins against three different 'fold libraries': (a) the largest set of sequence-unique proteins as of spring 1996 (723 chains (Rost WWW, 1996a)), (b) the largest set of 1995 (449 chains), and (c) a set of unique folds (plus the detectable homologues, 403 chains). The percentage of correctly detected first hits was inversely proportional to the size of the data set: 29% (a), 31% (b) and 33% (c). This result, probably, stems from the fact that the selection procedure is non-linear. Thus, the likelihood of random errors is increased by increasing the fold library. In other words, we doubt that threading is likely to close the sequence-structure gap in the future, but it can contribute to bridge it today.


Acknowledgements

First of all, thanks to Manfred Sippl (Univ. Salzburg) for discussions and help. Furthermore, thanks to Michael Braxenthaler (CARB, Washington, DC), Séan O'Donoghue (EMBL, Heidelberg), and Daniel Fischer and David Eisenberg (both UCLA, Los Angeles) for helpful dialogues; and to Rob Hooft (EMBL, Heidelberg) for software assistance. Last, not least, thanks to all those who deposit protein structures and protein sequences in public databases and those maintaining high quality databases - to mention, in particular, Amos Bairoch and his group (Basel) - thereby enabling the design of prediction methods.


References