Pedestrian guide to analysing sequence databases

Burkhard Rost 1,2 and Reinhard Schneider 1

in: Ashman K. (ed.): 'Core techniques in Biochemistry'. Heidelberg: Springer, 1997, in press.

1 EMBL, 69012 Heidelberg, Germany;

2EBI, Hinxton Hall, Hinxton, Cambridge CB10 1RQ, England

e-mail: and

Table of Contents


Over the past few years our means of communication have changed rapidly due to the growth of the World Wide Web (WWW). The Web enables molecular biologists to immediately access databases, scan literature, find information about related research and researchers, and to trace cell cultures. Wet-lab biologists can uncover information about the protein of interest without having to become experts in sequence analysis. Here, we present a variety of tools; provide an overview of the state-of-the art in sequence analysis; and described some of the principles of the methods.

1 Introduction

Theory can contribute to speeding up experiments. Imagine you have a protein sequence, either sequenced in your own lab or pulled down from genome projects of EST production. Can theoretical biology assist you, today, in finding a priori information about your protein that may be useful to accelerate and design experiments? A general message from a workshop organised by Anna Tramantano (IRBM, Rome) and Tim Hubbard (MRC, Cambridge) to explore the scope of current tools in sequence analysis was (Hubbard, et al. 1996): this depends on how much you are interested in finding answers to your questions! Theoretical biology still fails to predict 3D structure from sequence, but predictions of various simplified 1D aspects of structure of functional residues and sequence patterns become more accurate and more useful with every new sequence added to public databases.

How to access the services? Suppose, you decide to let theoretical biology assist you. How can you access the tools? Rapidly developing electronic communication (Internet, World Wide Web) facilitate spread of prediction methods. When we set up a service for the prediction of secondary structure in 1992, we happened to be among the first who offered sequence analysis via the internet (Henikoff 1993; Rost and Sander 1992; Rost, et al. 1994a). Four years later, the number of methods available via the net has exploded to the extent that the problem is not to find a service, but to select the 'best' among those on offer (e.g. 'Pedro's list' in Box 1).

In this review, we attempted to provide a brief overview of the availability and the principles used by some of the publicly available methods (a comprehensive collection of tools and databases was recently edited by Russ Doolittle (Doolittle 1996); the first 1996 issue of Nucleic Acids Research was reserved to presenting databases). The boxes associated with sections 3 - 5 give WWW addresses (abbreviations and technical terms in Table 1 ; and more detailed description of websites in chapter 3). All addresses are available directly on the WWW via (convention: WWW addresses are typed in Courier font). The possible pitfalls of sequence analysis are numerous, including picking a lousy server or misinterpreting the results (Rost and Valencia 1996). In section 6, we indicate ways around such traps. The literature referenced comprises a tiny excerpt from an extremely active field of research.

Table 1: Abbreviations and technical terms

3D three-dimensional (co-ordinates of protein structure)
2D 2D two-dimensional (e.g. inter-residue contacts)
1D 1D one-dimensional (e.g. sequence or string of secondary structure)
db database
ORF open reading frame
U protein sequence of unknown 3D structure and/or unknown function (e.g. search sequence in alignment procedure)
T target used for homology modelling (protein of known 3D structure)

sequence identity percentage of residues identical (D->E = 0; D->D = 1) between two sequences aligned (insertions excluded)
sequence similarity percentage of residues similar (D->E = 1; D->D = 1) between two sequences aligned (insertions excluded); there are two ways to convert similarity into percentage values: (i) by normalising the similarity score by the maximal possible score (percentage residue similarity), and (ii) by setting an arbitrary threshold of the similarity score to distinguish similar-not similar and counting the percentage of residues that are similar according to this threshold (percentage of similar residues); note that different similarity metrices are used to account for physico-chemical properties of amino acids, consequently, levels of similarity are usually not directly comparable between different alignment methods).

browser program used to access the WWW
CPU Central Processing Unit, i.e., 'core' of the computer
ftp file transfer protocol, i.e., program to copy files via the internet
HTML HyperText Markup Language, i.e., format of a WWW document
Internet most important network connecting computers worldwide
plug-in module to be plugged into your WWW browser, provides special functionality, such as graphics to display 3D structures
URL Uniform Resource Locator, i.e., address of a WWW site (in other words the address you have to provide to open a location)
WWW (W3, Web) World Wide Web, i.e., vehicle for accessing and offering documents via the Internet

2 State of the art in predicting protein structure and function

2.1 Sequence-structure gap

Large-scale sequencing projects produce data of gene, and hence protein, sequences at breathtaking pace (Fleischmann, et al. 1995; Johnston 1996). Although experimental determination of protein three-dimensional (3D) structure has become more efficient (Lattman 1994), the gap between the number of known sequences (>150,000; (Bairoch and Apweiler 1996; Benson, et al. 1996) and the number of known structures (>4,000; (Bernstein, et al. 1977) is rapidly increasing. Protein structure prediction aims at reducing this sequence-structure gap. Methods can be grouped by the level of abstraction of 3D structure that is addressed, namely one, two, and three dimensions (Fig. 1; (Rost and Sander 1994c; Rost and Sander 1996)). Given a protein sequence of unknown structure (dubbed U), what can we uncover about the structure of U by using theoretical tools, or what can theory contribute to bridging the sequence-structure gap?

Fig. 1. Representation of HIV protease (PDB code 1hiv) in 1D and 3D. Each of the representations gives rise to a different type of prediction. (1D) AA , sequence in one-letter alphabet; sec , secondary structure, with H for helix, E for strand, and blank for other (observed OBS taken from DSSP (Kabsch and Sander 1983) and predicted PHD (Rost 1996)); acc , relative solvent accessibility, with b for buried (<9% exposed), blank for intermediate (9%-25%), and e for exposed (>25%; observed: OBS = DSSP and predicted: PHD). (2D) inter-residue contact-map (for simplicity omitted, see e.g. (Rost and Sander 1996)). (3D) the trace of the protein chain in three dimensions is plotted schematically as a ribbon a-carbon trace (graph generated using the RIBBON interface in WHAT IF Vriend and Sander 1993)).

2.2 Predicting protein structure in 3D

Homology modelling applicable to 3-30% of the known protein sequences. The most successful tool for prediction of 3D structure is homology modelling. An approximate 3D model can be built for a seqeunce of unknown structure (U), if U has significant similarity to a protein of known structure (target T). The proportion of proteins deposited in SWISS-PROT for which structure can be inferred by homology modelling appears to be increasing (Fig. 2). Thus, homology modelling raised the number of 'known' 3D structures from 4,000 (June 1996) to over 16,000 (Schneider and Sander 1996)? This figure is slightly overoptimistic. The accuracy of homology modelling is proportional to the similarity between U and T. Above levels of 90% sequence identity, homology modelling is on average as accurate as is experimental determination of proteins structure (Chinea, et al. 1995; Overington, et al. 1992; Sali and Blundell 1994). However, at this level of accuracy 3D structure can be predicted for only about 1,900 sequences (a considerable fraction of which are myoglobins; Fig. 3). For lower levels of pairwise sequence identity, homology modelling, at best, is able to produce a correct sketch or ribbon plot of 3D structure (Fig. 1) (Moult, et al. 1995).

Fig. 2. Protein sequences for which 3D structure can be predicted. The full clock cycle corresponds to all protein sequences stored in SWISS-PROT (Bairoch and Apweiler 1996) in the year given. The black areas mark the percentage of proteins for which 3D structure could be predicted (for the entire protein or for fragments) by homology to a protein of known 3D structure (Schneider and Sander 1996). On average the alignments between the SWISS-PROT proteins U for which homology modelling could be applied and its homologue of known structure covered about 70% of the residues of the U's.

Fig. 3. Coverage of homology modelling vs. sequence identity. The accuracy of homology modelling depends on the level of pairwise sequence identity between the protein to be modelled (U) and the template (T) of known structure. Above 90% sequence identity homology modelling reaches the accuracy of structure determination; at this level about 4% of the SWISS-PROT proteins (i.e. 1,900) can be modelled. Down to levels of about 70% identity, i.e., for about 11% of SWISS-PROT (4,600 sequences), homology modelling can still provide rather accurate models.

Threading as a means to widen the scope of homology modelling. Part of the problem of homology modelling at lower levels of similarity is to correctly align U and T. Sequence alignments are more or less straightforward for levels of above 30% pairwise sequence identity. The region between 20 and 30% sequence identity is frequently referred to as the twilight zone (Chothia and Lesk 1986; Doolittle 1986; Sander and Schneider 1991): experts may manage to produce good alignments, but automatic alignment methods have a hard time to even distinguish between correct homologues and false positives. A means to automatically intrude into the twilight zone by detecting remote homologues (sequence identity < 25%) are threading techniques (Bryant and Altschul 1995; Shortle 1995; Sippl 1995; Wodak and Rooman 1993), i.e., the attempt to 'thread' a sequence of unknown structure into a sequence of known structure and to assess the fitness of the sequences for that structure (4.3.2). Threading can produce correct alignments and consequently can lead to 3D predictions even for levels below 10% sequence identity. However, the accuracy of threading methods is rather limited (Lemer, et al. 1995): at least partially correct automatic predictions of 3D structure are, supposedly, to be expected for less than 10% of the proteins threaded (Lemer, et al. 1995; Rost, et al. 1996c).

NO 3D predictions for proteins from sequence, yet! Claims that the structure prediction problem has been solved are constantly being issued in the public press (Brown 1995) or even in scientific journals (Holden 1995). However, so far not a single successful prediction of 3D structure from sequence alone has been published. And despite the advance of the field enabled by the growth of public databases (Rost and Sander 1994c), we probably have to work until the next millennium to solve the "structure prediction problem".

2.3 Predicting protein structure in 1D

If there is no homologue of known structure for U, we are forced to resort to simplifications of the prediction problem (Fig. 1). In the process, we can make use of the rich diversity of information in current databases. The pay-off from simplification is that predictions can be made for all proteins of known sequence. Examples readily available via automatic services are predictions of: secondary structure, solvent accessibility, location and topology for transmembrane helices, and coiled-coils (Henikoff 1993; Rost 1996; Rost, et al. 1994a). 1D predictions are most often correct. Predictions of inter-residue contacts (2D) are more difficult. Currently, methods are restricted to predictions of contacts between beta-strand residues (Hubbard 1994; Hubbard and Park 1995), or to proteins for which there are sufficiently informative multiple sequence alignments available (Goebel, et al. 1994; Neher 1994; Rodionov and Johnson 1994; Shindyalov, et al. 1994; Taylor and Hatrick 1994). For example, if the goal is to predict 5% of the long-range contacts (sequence separation above 10 residues) the expected accuracy for these contacts - given a reasonably informative sequence alignment - is about 50% (Goebel, et al. 1994).

2.4 Predicting protein function

Inferring function based on alignment. For a long time the assumption was that we would first have to predict 3D structure from sequence and then to predict details of protein function based on the 3D structure. However, this concept apears to not hold up in practice. Currently, the most successful means to predict aspects of protein function is inferring function from alignments: if there is a protein in the database for which we have annotated experimental information and if this protein (T) is homologous to a protein of unknown function U, then U is predicted to have the same function T. An extension of this concept is particularly useful when analysing entire genomes: rather than predicting the function for a particular protein, only, entire metabolic pathways are built (Gaasterland and Selkov 1995; Karp, et al. 1996).

To what extend can we infer function? The first eukaryotic genome (that of yeast Saccaromyces cerevisiae) has been sequenced (Johnston 1996; Oliver 1996). For about 30% of all yeast sequences function is directly known from experiment (Dujon 1996). But for a large fraction, about 35% of the total, functional information was derived by homology transfer (Casari, et al. 1996). In some cases, the transferred information exquisitely describes the detailed biochemical and/or cellular function of a new gene, e.g., that of the yeast open reading frame (ORF) YCR14c on chromosome III as the functional cousin of the mammalian DNA polymerase b (Casari, et al. 1996). In other cases, the power of prediction is very limited, because of strong functional divergence or because the homology is limited to a sequence fragment, e.g., the prediction of nucleic acid binding properties based on the presence of a zinc finger motif (Casari, et al. 1996).

3 Getting access to databases and services

3.1 Surfing on the World Wide Web

What you need to take off. Except for the motivation you need a computer, an internet connection, and a program that can display documents on the WWW (a 'browser', e.g. Netscape, Mosaic). Should you have any problems, contact your local system manager. The initial barrier may be considerable, but once you are hooked in, the rest will be much easier.

What to do next? You typically begin by opening a document ('open location'). For example, you can try: to display the contents of Box 1. Documents are written in a format dubbed hypertext (or HTML: hypertext mark-up language). Some of the text will appear highlighted (or underlined) on your screen. Those are the 'links' (click!). Some introductions and general sites to start searching are given in Box 1. Search engines are programs that search through all WWW pages (world-wide!) for specific keywords (some sites enable the restriction of searches to certain subjects and/or certain sites).

Finding literature, reading journals and following pathways. You want to quickly scan the literature (Medline) or want to see the contents of the latest Nature issue? Click around! We have listed some of the journal sites, the NCBI Entrez Browser (Schuler, et al. 1996) for doing restricted Medline searches and other interesting sites in Box 2 (more in 4.1 and 5.1).

3.2 Public databases

Database access. In principle there are two ways to access public databases. One is to display the files on the WWW, the other to copy the files to your local machine (usually via ftp-servers). A program that appears like any other WWW site is SRS, the ingenious information retrieval system devised by Thure Etzold (Etzold, et al. 1996). SRS navigates through the troubled waters of most database formats facilitating the search with elaborated combinations of keywords (Rost WWW and Schneider 1996). Two sites maintaining copies (dubbed mirrors) of various databases are the NCBI and the EMBL-EBI (Shomer, et al. 1996).

Nucleotide databases. The two major databases for nucleotide sequences are maintained at the EMBL-EBI in England (EMBL Nucleotide db (Shomer, et al. 1996)) and at the NCBI in the USA (GenBank (Benson, et al. 1996)). Large-scale genome sequencing projects tend to maintain their own databases for specific species (e.g. TIGR (Fleischmann, et al. 1995)).

Protein databases. The major databases for protein sequences (Rost WWW and Schneider 1996) are SWISS-PROT (Basel, Switzerland) that has recently been expanded by the addition of TREMBL - the translation of the EMBL coding DNA to protein sequences (Bairoch and Apweiler 1996) - and PIR (George, et al. 1996). Protein sequence databases are complemented by databases holding sequence motifs or fingerprints, such as PROSITE (Bairoch, et al. 1996) or BLOCKS (Henikoff and Henikoff 1996), and by databases specific for, e.g., disease-related proteins, enzyme-classifications, proteins of immunological interest. Information about protein 3D structure is stored in PDB (Bernstein, et al. 1977) and PDB-derived databases (e.g. HSSP storing sequence alignments of all PDB proteins against all SWISS-PROT proteins (Schneider and Sander 1996), or FSSP storing structural alignments of PDB against PDB (Holm and Sander 1996); classification of protein structures by SCOP (Brenner, et al. 1996) and CATH (Orengo, et al. 1993); links in Rost WWW and Schneider 1996).

3.3 General services

Databases are not restricted to storage of nucleotide or protein sequences. Instead you can also access addresses and databases for obtaining cell lines and cultures, or for 2D gels. Furthermore, you can obtain various programs, or search through, e.g., metabolic pathways (Gaasterland and Selkov 1995; Karp, et al. 1996) (links in Rost WWW and Schneider 1996).

4 Searching for homologues

4.1 Getting alignments

4.1.1 Fast searches for similarities

Starting from the DNA. If you have a nucleotide sequence (not knowing the corresponding protein sequence) you would probably start with the NCSA Workbench, BioSCAN, the splice site and promoter predictions from CBS (Copenhagen, Denmark; (Brunak, et al. 1991; Engelbrecht, et al. 1992; Larsen, et al. 1995)), or the most comprehensive genome analysis program GRAIL that finds coding regions, splice sites, frameshift errors, CpG islands, promoters and repetitions (Uberbacher, et al. 1996) (links in Rost WWW and Schneider 1996). Alignments of nucleotide sequences can be obtained by, e.g., the standard tools BLAST (Basic Local Alignment Search Tool; (Altschul and Gish 1996)), or, more precisely, the nucleotide derivatives of that program (BLASTN, BLASTX, TBLASTN, TBLASTX; (Madden, et al. 1996)) or FASTA (Pearson 1996; Pearson and Lipman 1988) (links in Rost WWW and Schneider 1996).

Initial search through protein databases. The most common tools for starting a database search are BLAST, FASTA and BLITZ. All are available via various WWW sites (EBI, BCM, NCSA Workbench; links in (Rost WWW and Schneider 1996)). If you have a long sequence or want to submit a large set of sequences to database servers, you may prefer to use email interfaces (Box 3). Some of the refined alignment programs will run fast searches automatically as a pre-filter (e.g. MAXHOM, links in Rost WWW and Schneider 1996).

Composition bias. Some of the search tools automatically check regions that are composition biased (e.g., Gly-Arg-Ala in DNA binding proteins) by running the SEG program ( (Wootton and Federhen 1996); e.g., BLAST (Altschul and Gish 1996)). To be sure you may submit your sequence additionally to an analysis of composition bias (SAPS, links in Rost WWW and Schneider 1996).

Selecting the putative homologues from the hit list. Given a list of proteins aligned (hit list) to U, which ones are likely to be homologues? Where to set the threshold in terms of the score reported by the alignment program (e.g. the BLAST score)? Few programs help you with that decision (e.g. MAXHOM, links in Rost WWW and Schneider 1996); or the consistent alignment parser CAP in BLAST (Altschul and Gish 1996)). The best strategy may be to select a large number of proteins from the hit list (>100) and repeat a more accurate full dynamic programming alignment of U against this list (rather than against the entire database).

4.1.2 Finding sequence motifs

A complement to alignment programs are tools that search for motifs, blocks or patterns. Examples on the WWW are PROSITE (Bairoch, et al. 1996), PFSCAN (links in Rost WWW and Schneider 1996)), or the tools associated with databases of block alignments (BLOCKS, PRINTS, PRODOM; links in Rost WWW and Schneider 1996)). Motifs can be used to detect more distantly related homologues (<30% sequence identity) and to refine the alignment (e.g. by defining a family specific profile that can be given as input to, e.g., CLUSTALW).

4.1.3 Refining the alignment

There are various ways to automatically refine alignments. One is simply running a more accurate dynamic programming algorithm (BIOACCELERATOR or SSEARCH, or MAXHOM). Another is to build a profile (4.2.3) and to thus generate a more family-specific alignment (e.g. by ClustalW (Higgins, et al. 1996)). An approach that will work better the better the alignment you start with, are hidden Markov models (SAM, (Eddy 1995; Krogh, et al. 1994)). An expert tool that enables you to investigate the final alignment even more is TopAlign (links in Rost WWW and Schneider 1996).

4.2 Principles of alignment methods

4.2.1 Principle concept of alignments

Evolution distinguishes signal from noise. At the level of protein molecules, selective pressure results from the need to maintain function, which in turn requires maintenance of the specific 3D structure (Doolittle 1986; Doolittle 1994; Pastore and Lesk 1990) This is the basis for attempts to align protein sequences, i.e., to optimally detect equivalent positions in strings of amino-acid letters. Aligning protein sequences may appear to be purely a problem of matching letters. However, sequence alignments unravel information about structural and functional relations between residues in different proteins. Obviously it is not trivial to map the complexity of factors determining protein structure and function onto 1D relations between letters.

Finding the best match for two strings. Goal is to find the best match between two strings of letters (amino acids or nucleotide acids). The solution of this problem is in principle trivial. The main constraint is the computer time required when more than 150,000 pair alignments have to be explored (run of U against TREMBL and SWISS-PROT). This led to the development of a variety of fast alignment programs (4.2.2). Once the time demanding task of scanning the entire database is accomplished, more refined dynamic programming-based (or hidden Markov model-based) alignment programs can be applied to refine the alignment for the hits found (4.2.4). For more sensitive searches, biological knowledge has to be included by basing the alignment on profiles for residue exchange probabilities (4.2.3).

Multiple alignments improve as databases grow. For high levels of pairwise sequence identity (say above 40%), alignment procedures are (more or less) straightforward. For less similar protein sequences, however, alignments may fail (Henikoff and Henikoff 1993; Vingron and Waterman 1994). The art of sequence alignment is to accurately align related sequence segments and to avoid aligning unrelated sequence stretches (Deperieux and Feytmans 1992; Eddy 1995; Henikoff and Henikoff 1994; Krogh, et al. 1994; Lawrence, et al. 1993; Livingstone and Barton 1993; Russell and Barton 1992; Sander and Schneider 1991; Thompson, et al. 1994). Alignment techniques can be improved by incorporating information derived from 3D structures (Henikoff and Henikoff 1993). Profile-based multiple alignments appear to be sensitive and fast enough to scan entire databases if implemented on parallel machines (Schneider 1994).

Drawback: lack of sufficiently tested cut-off criteria. One of the difficulties in comparing different alignment procedures is the lack of well-defined criteria for measuring the quality of an alignment. Very few papers have attempted to define such measures for the comparison of various methods (Henikoff and Henikoff 1993). The second problem is that most methods do not supply a cut-off criterion for distinguishing between homologous and non-homologous sequences (i.e., false positives). For some large sequence families remote homologues can be aligned correctly, but for most cases sequences with less than 25% sequence identity will be false positives, i.e., will have no structural or functional similarity to the guide sequence. A simple length-dependent cut-off based on sequence identity is provided by the program MAXHOM (Sander and Schneider 1991). However, this does not quantify the influence of (more subtle) similarities and of the occurrence of gaps.

4.2.2 Fast, non-optimal alignments

The two most widely used fast alignment tools are FASTA (Pearson 1996; Pearson and Lipman 1988) and BLAST (Altschul and Gish 1996; Altschul, et al. 1990). The principle assumption is that most similarities between two sequences can be detected by local (Smith, et al. 1981) rather than global (Needlman and Wunsch 1970; Sellers 1974) alignments. FASTA and BLAST base on slightly different concepts. FASTA first searches 'words' of residues with a minimal length of two identical between the two sequences; then the aligned regions are widened based on profiles (4.2.3). BLAST first lists words of residues in one sequence typically of length four with high scoring information content; then the database is scanned for words identical to the words in the list of highly informative words; finally the words are expanded to segments (profile-based). An additional detail is that statistics are based on regions without composition bias (Pearson 1996; Wootton and Federhen 1996).

4.2.3 Knowledge-based exchange matrices (profiles)

Some amino acid exchanges (e.g. L -> I) are more or less neutral in terms of maintaining structure and/or function. This suggests basing the alignment of two sequences on exchange matrices (or profiles) that capture physico-chemical properties either directly (Feng, et al. 1985; McLachlan 1972) or indirectly by database extraction (Dayhoff 1978). The latter was initially proposed by Dayhoff and her co-workers who measured evolutionary distance by the PAM (Percentage of Acceptable point Mutations per 106 years) matrix. For example, PAM 256 corresponds to an exchange of 80% of all amino acids (with different probabilities for different amino acids: S has a high mutability, W a low variability). Recently other exchange matrices have been proposed (Bowie, et al. 1991; Gonnet, et al. 1992; Gribskov, et al. 1990; Henikoff and Henikoff 1994; Overington, et al. 1990; Risler, et al. 1988; Taylor 1986; Thompson, et al. 1994). Which exchange matrix to use? Jorja and Steven Henikoff have systematically compared performance for various exchange matrices (Henikoff and Henikoff 1993). The results favoured on average the BLOSUM62 matrix (on average 62% of the residues exchanged (Henikoff and Henikoff 1992). However, none of the matrices investigated performed worse than BLOSUM62 on all test examples (Henikoff and Henikoff 1993). Thus, given a particular search protein U, a priori there is no 'best' exchange matrix. A useful way around this problem is to re-align based on different exchange metrices.

4.2.4 Slow, optimal alignments (dynamic programming)

What is the optimal alignment? Alignments are intended to unravel evolutionary pathways and/or structural homology between two proteins. These two objectives (functional/structural) may be mutually contradictory, i.e., the 'optimal' alignment may differ according to the objective. Yet another perspective is the 'mathematical' optimal alignment. This is the alignment that optimises a given objective function, e.g., to find the alignment with the highest number of pairwise identical residues. FASTA and BLAST are not guaranteed to find such a mathematically optimal alignment. In contrast, dynamic programming methods assure the optimal global (Needlman and Wunsch 1970; Sankoff and Kruskal 1983) or local (Smith, et al. 1981) alignment by simply exploring all possible alignments and choosing the best. An important problem is the treatment of gaps, i.e., residue inserted (or deleted) to optimise the objective function. Usually, gap penalties (cost of inserting and extending gaps) are chosen to be length dependent (Sellers 1974). Typically, the cost of extending a gap (gap elongation) is 5-10 times lower than is the cost for introducing a gap (gap open). The optimal choice of gap penalties depends on the particular method and, in detail, on the particular sequence family (Altschul and Gish 1996; Barton and Sternberg 1987; Lesk, et al. 1986; Smith and Smith 1992; Taylor 1996; Thompson, et al. 1994; Vingron and Waterman 1994).

4.2.3 Multiple alignments

Merging pairwise alignments into a multiple alignment. The concept of dynamic programming cannot be extended to align more than three sequences optimally (Murata 1990). A way around this problem is to first find optimal pairwise alignments and to then merge the pairs (Barton and Sternberg 1987; Boswell and McLachlan 1984; Feng and Doolittle 1987; Higgins and Sharp 1988; Taylor 1987; Vingron and Argos 1989). This procedure is illustrated by the program MAXHOM implemented for the PredictProtein prediction service (Rost 1996; Rost, et al. 1994a) or the generation of the HSSP database (Sander and Schneider 1991; Schneider and Sander 1996). (1) A fast algorithm (BLAST) is used to scan the database for possible homologues. (2) The list of putative homologues is filtered by a length-dependent cut-off threshold for structural homology (Sander and Schneider 1991). (3) All sequences that fall above the threshold are aligned consecutively to the guide sequence (U) by a standard dynamic programming algorithm (Smith, et al. 1981). (4) After each sequence has been added to the alignment an alignment profile is compiled and used to align the next sequence. (5) After all the sequences have been aligned the profile is recompiled and the dynamic programming algorithm starts once again to align consecutively the sequences, this time using the conservation profile as derived after completion of the first sweep. A slightly different concept is to additionally sort the aligned sequences according to evolutionary trees (Barton and Sternberg 1987; Feng and Doolittle 1987; Feng and Doolittle 1996; Higgins and Sharp 1988; Higgins, et al. 1996; Taylor 1996; Thompson, et al. 1994).

Hidden Markov models and genetic algorithms. Recently, a method different from dynamic programming is gaining ground in the field of generating high quality multiple alignments: the hidden Markov models (Baldi, et al. 1994; Bucher and Hofmann 1996; Eddy 1995; Krogh, et al. 1994). The principle idea is to deduce a family specific model reflecting evolutionary processes and to align new sequences according to the model derived. The idea is the same as for traditional family specific profiles used with dynamic programming, the details of the mathematics involved are, however, different (Bucher and Hofmann 1996; Gribskov and Veretnik 1996). Hidden Markov-based alignments appear to be particularly sensitive to detecting less obvious homologues (Bucher and Hofmann 1996; Hubbard and Park 1995). Another interesting method is the application of genetic algorithms to the multiple alignment problem; the particular advantage being that any objective function can be optimised by the genetic algorithm (Notredame and Higgins 1996). A large-scale analysis of the quality of different methods has yet to be accomplished.

5 Predicting protein structure

5.1 Getting predictions for protein structure

5.1.1 Prediction of 3D structure

If your database search with U picked a homologue that has known 3D structure (i.e. is deposited in PDB, you can predict 3D structure for U by homology modelling. The problem is that homology modelling is still not trivial (Moult, et al. 1995). There is currently only one WWW server for homology modelling (SWISS-MODEL, links in Rost WWW and Schneider 1996). If you venture along this path, be careful not to over-interpret results ( Table 2 )!

5.1.2 Prediction of 1D structure

Prediction of 1D aspects of 3D structure (e.g. secondary structure, solvent accessibility, transmembrane helices, coiled-coils) is a much simpler task than homology modelling. This becomes obvious by the large number of services that have mushroomed since the first prediction service PredictProtein went on line in 1992 (Tim Hubbard and Jong Park at MRC, Cambridge, England collected some services; links in Rost WWW and Schneider 1996). Some of the services are also available via email (Box 3). Unfortunately, not all services are sufficiently tested. In general, prediction accuracy is significantly superior if predictions are based on multiple alignments (Barton 1995; Di Francesco, et al. 1996; Rost and Sander 1996). Another general feature that can be predicted via the WWW are O-glycosilation sites (Hansen, et al. 1995).

5.2 Principles of structure prediction

5.2.1 Homology modelling

Basic concept. An analysis of sequence alignments for proteins of known structure reveals that all protein pairs with more than 30% pairwise sequence identity (for alignment length > 80; (Sander and Schneider 1991) have homologous 3D structures, i.e., the essential fold of the two proteins is identical, details such as additional loop regions may vary. This is the pillar for the success of homology modelling. The principal idea is to model the structure of U (protein of unknown structure) based on the template of a sequence homologue of known structure (T). The accuracy of homology modelling depends on the level of similarity between U and T (Fig. 3).

High level of sequence identity: atomic resolution. The basic assumption of homology modelling is that U and T have identical backbones. The task is to correctly place the side chains of U into the backbone of T. For very high levels of sequence identity between U and T (ideally differing by one residue only), side chains can be 'grown' during molecular dynamics simulations (Cornell, et al. 1991). For slightly lower levels (still of high sequence similarity), side chains are built based on similar environments in known structures (De Filippis, et al. 1994; Eisenmenger, et al. 1993; Levitt 1992; May and Blundell 1994; Sali and Blundell 1994; Summers and Karplus 1990; Vriend and Sander 1993). For levels of above 70-90% sequence identity, resulting models are quite accurate (De Filippis, et al. 1994). The limiting factor is the computation time required (Holm, et al. 1994).

Low level of sequence identity: ribbon plots of secondary structure. With decreasing sequence identity the number of loops inserted grows and the divergence between U and T becomes considerable. An accurate modelling of loop regions, however, implies solving the structure prediction problem. The problem is simplified in two ways. (1) Loop regions are often relatively short and can thus be simulated by molecular dynamics. (2) The ends of the loop regions are fixed by the backbone of the template structure. Various methods are employed to model loop regions. The best have the orientation of the loop regions correct in some cases (Abagyan and Totrov 1994; Cardozo, et al. 1995). For lower levels of pairwise sequence identity the accuracy of the sequence alignment used as basis for homology modelling becomes an additional problem. The information about 3D structure accurately captured by the resulting models is typically at the level of ribbon plots (i.e. the mutual orientation of elements such as helices and sheets can be identified; Fig. 1).

5.2.2 Remote homology modelling (threading)

Basic concept. Successful remote homology modelling (<25% pairwise sequence identity between unknown structure U and template T) has to master three obstacles. (1) The remote homologue (T) has to be detected. (2) U and T have to be aligned correctly. (3) The homology modelling procedure has to be tailored to the harder problem of extremely low sequence identity. Most threading methods developed so far have been primarily addressed to detect similar folds (step 1). The basic idea is to thread the sequence of U into the backbone of T and to evaluate the fitness of sequence for structure by environment-based or knowledge-based potentials (Bryant and Altschul 1995; Sippl 1995). Most methods are based on so called pseudo-potentials and differ in the way such potentials are derived from PDB (Wodak and Rooman 1993). One alternative is to use 1D predictions for the threading procedure (Fischer and Eisenberg 1996; Rost 1995a; Rost 1995b; Russell, et al. 1996).

Remote homologues can often be identified. The optimism generated by one of the first papers on threading published in the 90s (Bowie, et al. 1991) has boosted attempts to develop threading methods (Sippl 1995). The good news after half a decade of intensive research by dozens of groups is that all potentials capture different aspects, and it is likely that the correct remote homologue is found by at least one of them (Lemer, et al. 1995; Shortle 1995). The bad news is that no single method is accurate enough to correctly identify the remote homologue in most cases (Lemer, et al. 1995). Instead, evaluated on a larger test set, the correct remote homologue appears to be detected in less than 40% of all cases (Fischer and Eisenberg 1996; Lemer, et al. 1995; Rost 1995b; Rost, et al. 1996c; Russell, et al. 1996). At the given level of sequence identity (<25%) this accuracy is still clearly superior to traditional sequence alignments (Fischer and Eisenberg 1996; Rost 1995b; Rost, et al. 1996c).

Correct 3D predictions from threading require expertise. Detecting the homologue is only the first of the three obstacles. The second (correct alignment between U and T) is much harder. This is fatal for the third step, the model-building procedure. So far, correct models from threading methods have been published for very few cases (Flˆckner, et al. 1995; Lemer, et al. 1995). Successful application of threading tools will often require rather sceptical expert users who can spot wrong hits and false alignments (Hubbard, et al. 1996). Threading techniques may become one of the most successful ways to structure prediction, but theoreticians will have to clear many rocks out of the way.

5.2.3 Predicting secondary structure

Basic concept. The principal idea underlying most secondary structure prediction methods is the fact that segments of consecutive residues have preferences for certain secondary structure states (Kabsch and Sander 1984). Thus, the prediction problem becomes a pattern-classification problem. The goal is to predict whether the residue at the centre of a segment of typically 13-21 adjacent residues is in a helix, strand or in non-regular secondary structure. Many different algorithms have been applied to tackle this simplest version of the protein structure prediction problem (for overviews: (Barton 1995; Garnier, et al. 1996; Rost and Sander 1993; Rost and Sander 1996; Rost, et al. 1993)). When basing predictions on single sequences, prediction accuracy is limited to about 60% (percentage of residues correctly predicted in either helix, strand or other).

Multiple alignment-based predictions achieved breakthrough. The first method that reached a sustained level of a three-state prediction accuracy above 70% was the profile-based neural network program PHD which uses multiple sequence alignments as input (Rost and Sander 1993). By stepwise incorporation of more evolutionary information, prediction accuracy can be pushed above 72% accuracy (Rost and Sander 1994a). A nearest-neighbour algorithm can be used to incorporate the same information with a similar performance (Salamov and Solovyev 1995). In comparison to methods using single sequence information only, methods making use of the growing databases are 6-14 percentage points more accurate. Of practical importance is that some prediction methods provide a reliability for the assignment of a given residue, i.e., a reliability for the prediction. Predictions not using alignment information reach an accuracy of 80% for the 10% best predicted residues (Rost and Sander 1994a); alignment-based predictions (PHD) reach the same level of accuracy for 70% of the best predicted residues, i.e., are seven times better (Rost 1996).

Improved secondary structure predictions of practical use. How good is a prediction accuracy of 72%? It is certainly reasonably good compared with the prediction of secondary structure by homology modelling (Colloc'h, et al. 1993; Rost, et al. 1994b; Russell and Barton 1993). Various applications of improved secondary structure predictions prove that predictions are accurate enough to be of practical use (prediction-based threading, (Fischer and Eisenberg 1996; Hubbard and Park 1995; Rost 1995a; Rost 1995b; Russell, et al. 1996); inter-strand contact prediction, (Hubbard 1994); chain tracing in X-ray crystallography; design of residue mutations). Secondary structure predictions can be used to predict the coarse-grained structural class of a protein (all-a), all-b,a/b, or other; (Levitt and Chothia 1976)). The accuracy of correctly assigning one of the four classes based on predictions is above 70% (Eisenhaber, et al. 1995; Rost and Sander 1994a). Thus, predictions are surprisingly, on average, about as accurate as circular dichroism (CD) spectroscopy (Rost 1996).

5.2.4 Predicting solvent accessibility

Basic concept. The goal is to predict the extent to which a residue embedded in a protein structure is accessible to solvent. Solvent accessibility can be described in several ways (Rost and Sander 1994b). The simplest is a two-state description distinguishing between residues that are buried (relative solvent accessibility < 16%) and exposed (relative solvent accessibility ³ 16%). The classical method to predict accessibility is to assign either of the two states, buried or exposed, according to residue hydrophobicity (for an overview (Rost and Sander 1994b) which is less accurate than a simple neural network prediction (Holbrook, et al. 1990).

Multiple alignment-based predictions mirror conservation. Solvent accessibility at each position of the protein structure is evolutionarily conserved within sequence families (Rost and Sander 1994b). This fact has been used to develop methods for predicting accessibility using multiple alignment information (Benner, et al. 1994; Rost and Sander 1994b; Wako and Blundell 1994). Prediction accuracy is about 75±7% and thus significantly higher than for methods using single-sequence information only (Rost and Sander 1994b).

5.2.5 Predicting transmembrane helices

Basic concept. Predicting the locations of transmembrane helices is a task comparable to secondary structure prediction. Elaborated combinations of expert-rules, hydrophobicity analyses and statistics yield high levels of accuracy (Jones, et al. 1992; Rost, et al. 1995; Sipos and von Heijne 1993; von Heijne 1992). Prediction of signal peptides is a separate task (see service from CBS Box 5; (Schneider and Wrede 1993)).

Multiple alignment-based predictions rather accurate. For two methods the use of multiple alignment information is reported to clearly improve the accuracy of predicting transmembrane helices (Persson and Argos 1994; Rost, et al. 1995). The best current prediction methods have a similar high accuracy of around 95%.

Prediction of topology at 85% accuracy. Intra- and extra-cytoplasmic regions have different amino acid compositions (Nakashima and Nishikawa 1992; von Heijne 1992). This difference allows the prediction of the orientation of transmembrane helices with respect to the membrane (N-term pointing inside or outside). Alignment-based predictions of topology have an expected accuracy of above 85%, i.e., for 85% of all proteins a transmembrane helices and the topology are correctly predicted (Rost, et al. 1996a; Rost, et al. 1996b); single sequence-based predictions of topology are about ten percentage points less accurate (Jones, et al. 1992).

Scanning entire genomes for helical transmembrane proteins. Predictions of transmembrane helix locations and topology constitute an effective tool to analyse entire genomes (e.g. scanning the entire Haemophilus influenzae genome of > 1600 proteins for transmembrane helices takes a couple of hours on a workstation). An important result for such enterprises is the accuracy of distinguishing proteins with and without transmembrane helices. In fact, this distinction can be accomplished at a very high level of accuracy: the rate of false positives is below 2% (proteins with no observed transmembrane helices predicted to contain membrane helices); and the rate of false negatives is below 3% (proteins with transmembrane helices that were not predicted) (Rost, et al. 1996b).

6 Avoiding common traps of sequence analysis

Ease of use bears an ease of misuse. All the tools we summarised are easy to use. You do not have to become an expert in sequence analysis to run several of the methods. However, the ease of providing and accessing prediction has inherent problems. (1) Inaccurate methods (or insufficiently validated ones) are made available bypassing selection systems such as referees. (2) Users may misinterpret results due to a lack of insight into the features of prediction methods. The possible pitfalls in analysing the results are numerous, including picking a lousy server, mis- or over-interpreting the results. Recently, we summarised some of the common traps of sequence analysis (Rost and Valencia 1996). Here, we shall briefly sketch a few possible misinterpretations of results ( Table 2 ; other hints in: (Bork and Gibson 1996)).

Table 2: Rules of thumb for sequence analysis

Tool: pairwise alignment: inferring structural homology
Rule: pairwise sequence identity > 25%
Restrictions: 25% over more than 80 aligned residues (for shorter regions identity must be higher (Sander and Schneider 1991))

short motifs (< 10 residues) not sufficiently indicative for homology

25% level may not apply to engineered proteins

composition biased regions (e.g. GRA rich regions in DNA binding proteins) should be excluded from compiling level of sequence identity

many gaps: if an alignment between two proteins contains too many insertions (gaps) even a relatively high value of sequence identity may not suffice to ascertain homology (typical structure alignments contain up to 10% gaps)

Tool: pairwise alignment: inferring structural homology
Rule: pairwise sequence similarity > pairwise sequence identity (Table 1)
Restrictions: depends on similarity metric chosen, hence comparison between different methods problematic

Tool: pairwise alignment: inferring function
Rule: level of similarity required for identifying functionally equivalent proteins in two species depends on the overall divergence of the species and on the particular protein family
Restrictions: functional annotations for homlogue used to infer function may be incomplete or wrong. Thus, annotations for the putative homologue ought to be verified in the original sources of functional assignments (a more reliable database is SWISS-PROT (Bairoch and Apweiler 1996)).

errors in sequences, such as frame-shifts or sequencing errors (very frequent in EST's) could lead to falsely inferred function

Tool: pairwise alignment: inferring function
Rule: alignment used to infer function should contain the functional residues, ideally the alignment should extend over the entire proteins
Restrictions: false annotations (s. a.)

errors in sequences (s. a.)

Tool: multiple alignment: sufficient information
Rule: cover entire range of evolutionary divergence, i.e., representatives with 90%, 80%, ... , 30% pairwise sequence identity
Restrictions: if A1 and A2 have 99% identical residues and both have a pairwise sequence identity of 40% to U, including A2 in a multiple alignment does not increase the amount of information contained in the multiple alignment

Tool: multiple alignment: aligning entire families
Rule: align entire folds, rather than short fragments, assure conservation of local motifs
Restrictions: if A1 is aligned to U in a region where A1 is known to have a functionally important sequence motif (Bairoch, et al. 1996) and this motif is not in U, this may indicate a false alignment

Tool: multiple alignment: aligning based on family specific profiles
Rule: alignments based on family specific profiles are more accurate and more sensitive (finding non-trivial homologues) than pairwise alignments
Restrictions: if the family profile used for the search contains errors, the final alignment may be less accurate than pairwise alignments

good profile-based alignments require expertise and alignments containing many sequences

Tool: 1D prediction
Rule: 70% correct implies 30% incorrect
Restrictions: 70% is an average over a distribution with a typical standard deviation of 10% (Rost 1996). Thus, for a particular protein U prediction accuracy can be less than 60% or more than 80% (Rost 1996).

expected values for accuracy hold for classes of proteins used to set up prediction method (e.g. no prediction of transmembrane helices with tools optimised for globular proteins (Rost, et al. 1995))

prediction accuracy for engineered proteins hard to estimate

Tool: 1D prediction based on alignments
Rule: the more accurate and informative the alignment, the more accurate the prediction
Restrictions: the quality of multiple alignments depends on the divergence of the sequences aligned and the completeness with which the family is covered

Tool: 3D prediction: homology modelling
Rule: accuracy depends on level of sequence identity
Restrictions: simulating ligand binding requires > 70-90% pairwise sequence identity

no accurate predictions for inserted loop regions

Tool: 3D prediction: remote homology modelling
Rule: most models proposed by threading methods are wrong!
Restrictions: be careful to use threading without a lot of caution, skill and intuition!

6.1 Pitfalls of sequence alignments

Inferring structural homology from pairwise alignments. The success of alignment programs is based on evolutionary connections between homologous proteins: if 24 out of 80 aligned residues (i.e. 30%; more for shorter matches; (Sander and Schneider 1991)) are identical between two naturally evolved proteins, the two have similar 3D structures and similar functions (Chothia and Lesk 1986; Feng and Doolittle 1987; Sander and Schneider 1991). This margin for sufficient sequence identity holds only for 'normal' alignments ( Table 2 ).

Inferring functional homology from pairwise alignments. A typical mistake is to predict function by putative homology based on an over-interpreted level of sequence similarity. Functional and structural constraints are translated into sequence conservation in a particular way that depends on the particular protein structure and its evolution. The level of similarity required for identifying functionally equivalent proteins in two species depends on the overall divergence of the species and on the particular protein family. When predicting function based on similarity to proteins of known function (as annotated in databases), it is important to be aware of incomplete or wrong annotations and possible errors in sequences (very frequent in EST's).

Quality and stability of pairwise alignments. Despite the central rÙle that alignment programs play in sequence analysis, a thorough analysis of the quality of methods based on statistically significant numbers of proteins has yet to be accomplished. In general, alignments are more likely to be correct for higher levels of pairwise sequence identity; and are less likely to be correct in more variable regions. If the alignment between two proteins is not stable, i.e., if two 'good' alignment programs yield different alignments in detail for some regions, this may imply that these regions cannot be uniquely aligned. Such regions could be situated in non-regular secondary structure. In practice, such unreliably aligned regions should be ignored when inferring function or structure (Saqi and Sternberg 1991; Vingron and Argos 1990; Zuker 1991).

Profile alignments may intrude into the twilight zone. In the twilight zone (Feng and Doolittle 1987) of 20-30% pairwise sequence identity , sequence alignments become tricky. Only methods using profiles derived from the sequence family of your protein U may reliably intrude into that zone (Barton and Russell 1993; Gribskov and Veretnik 1996; Higgins, et al. 1996; Schneider 1994; Taylor 1996). The quality of such alignments depends crucially on the information contained in the alignment, i.e., the size (number of sequences) and divergence (levels of pairwise sequence identity) of the sequence family. In sparse regions (less sequences) alignments are generally less reliable. However, penetrating the twilight zone requires attention (hints to generate alignments: (Bork and Gibson 1996; Koonin, et al. 1996; Poch and Delarue 1996)).

6.2 Pitfalls of 1D structure prediction

Estimates for expected prediction accuracy refer to distributions. How can we estimate the accuracy of prediction methods? The principle problem is that we want to know how accurately a tool predicts aspects of structure for proteins of unknown structure. The way out of this dilemma are careful cross-validation experiments (Rost and Sander 1993; Rost and Sander 1994c; Rost and Sander 1995; Rost and Sander 1996). Users should bear in mind various aspects of such an analysis. (1) Estimates for expected accuracy are averages, i.e., the estimate is that on average 72% of all residues are predicted in the correct state of secondary structure (Rost 1996). (2) Estimates for expected accuracy refer to distributions for which one standard deviation is in the order of ten percentage points (Rost 1996) ( Table 2 ). (3) Estimates are valid for a certain type of proteins (Rost, et al. 1995; Rost and Valencia 1996), e.g., prediction accuracy is particularly difficult to estimate for engineered proteins. (4) The accuracy for alignment-based predictions depends on the quality of the multiple alignment used for the prediction ( Table 2 ).

Combining different methods may be fatal. Today, several automatic services accomplish secondary structure predictions. Some users fall into the what-is-common-is-correct trap, i.e., they average over all prediction methods and consider identical regions as more reliable. Exceptionally, such a majority vote may be beneficial. However frequently, the result will be the worst-of-all prediction. Often, it is preferable to use reliability indices provided by some methods. Such indices answer the question: how reliably is the tryptophan at position 307 predicted in a surface loop? (Note: the correlation between such indices and prediction accuracy has been adequately tested for a few methods, only.)

1D structure may or may not be sufficient to infer 3D structure. Say you obtain as a prediction for regular secondary structure: helix-strand-strand-helix-strand-strand (abbabb). Assume, you find a protein of known structure with the same motif (abbabb). Can you conclude that the two proteins have the same fold? Yes and no, your guess may be correct, but there are various ways to realise the given motif by completely different structures. For example, the secondary structure motif 'abbabb' is contained in, at least, 16 structurally unrelated proteins.

6.3 Pitfalls of 3D structure prediction

Accuracy of homology modelling at the level of ribbon plots. The main purpose of homology modelling is to translate a given alignment into more intuitive 3D images. Such images often look temptingly 'real'. It is crucial to bear in mind which regions were unreliable in the alignment or in the original structure. Homology modelling tends to yield sketches of structure rather than accurate co-ordinates. Is homology modelling of any use? Is ANY hypothesis about a structure better than NO hypothesis? Guessing details about protein function is such a difficult task that any null hypothesis may help to guide experiments. This may result in over-interpretations of the level of homology suggested by the sequence alignment and the level of accuracy of the resulting model. In practice, modelling often strikes a balance between users pushing for more interpretations and models pin-pointing the limitations of the methods.

Remote homology modelling (threading) is extremely tricky! One problem of homology modelling for lower levels of pairwise sequence identity is to get the alignment between your sequence U and the template structure T correct. But even if the alignment were correct, a principle limitation is that T and U just do NOT have identical 3D structures. This problem is particularly fatal for remote homology modelling (threading), i.e., prediction of 3D structure based on less than 25% pairwise sequence identity. However, current threading methods are even more limited: getting the alignment correct is the exception rather than the rule (Moult, et al. 1995). The basic message of this statement for you is NOT "don't use threading programs", but "use them with extreme caution and be aware that most resulting models are likely to be mostly wrong".

6.4 How to separate the chaff from the wheat?

The Asilomar prediction contest. A systematic testing of performance is a precondition for any prediction to become reliably useful. For example, the history of secondary structure prediction has partly been a hunt for highest accuracy scores, with over-optimistic claims by predictors seeding the scepticism of potential users. John Moult (CARP, Washington, DC) has initiated a large-scale experiment aiming at estimating the realistic state-of-the-art of current prediction methods (Moult, et al. 1995): predictions had to be submitted prior to publication of the experimental structure determination and were then evaluated at a meeting in Asilomar, C.A. (Dec. 1994). One message of that first experiment (the follow-up will be held in Dec. 1996) was: exaggerated claims are more damaging than genuine errors. Even a prediction method of limited accuracy can be useful if the user knows what to expect. However, the temptation to violate this concept is too high. On the one hand, stakes are publications and grants. On the other hand, detection of over-optimism, mild cheating, or even straight fraud is unlikely. Thus, users still have to acquire some skill in separating the chaff from the wheat.

The fewer numbers for accuracy, the better? On the contrary, users should always be more sceptical if the prediction service offered was not thoroughly analysed. Estimates for standard deviations of performance accuracy are crucial (typically about 10% for secondary structure prediction). Note that too small values for standard deviations (< 5% for secondary structure prediction) indicate more likely wrong evaluations than more accurate methods.

Method appropriately evaluated? A sustained evaluation of prediction methods, in our view, needs to meet four requirements. (1) No significant pairwise sequence identity: the proteins used for setting up a method (training set) and those used for evaluating it should have a pairwise sequence identity of less than 25% (length-dependent cut-off (Sander and Schneider 1991)), otherwise homology modelling could be applied which would be much more accurate than ab initio predictions. (2) Sufficiently large data set: all available unique proteins should be used for testing (currently more than 400), evaluations based on too small numbers are not representative, and completely wrong are estimates based on subsets of proteins for which performance accuracy looks more favourable. (3) Avoid comparing apples with oranges: no matter which data sets are used for a particular evaluation, results should always be reported additionally on standard sets. (4) No optimisation with respect to test set: a seemingly trivial - and often violated - rule is that methods should never be optimised with respect to the data set chosen for final evaluation.

Test it yourself? If possible, we advise users to test prediction services based on a handful of examples, preferably composed of protein data (sequences or structures) which became available after the development of a given prediction method.

7 Conclusions

No prediction of 3D structure, but useful information. After four decades of research, theoretical biology can still not predict protein structure from sequence. But, by using the knowledge deposited in exploding databases predictions become increasingly more accurate and more useful. In particular cases, you can, e.g., find all proteins in yeast that are likely to perform a similar function as the human protein you are interested in (Casari, et al. 1996). Alignments and predictions of 1D structure can be extremely useful to design your next experiment (which residue to mutate? where is the binding site?).

The growing WWW jungle. Sequence analysis is a rapidly expanding field of research. We have tried to squeeze methods into a few pages which are detailed in some 1000 papers per year. Putting methods on the WWW is becoming increasingly popular as it is simpler both for the developer of the program and for the user (Henikoff 1993). Thus, prediction services are mushrooming, e.g., the number of services for the prediction of secondary structure has increased from 1 (1992) to more than a dozen (1996). A few years ago, the problem was only to find services. Now, you have to find and to make a good selection from a variety of alternatives at the same time. Consequently, you will have to dedicate quite some time to keep up with the flow of data. However, the stakes are high. The World Wide Web is a platform to learn (e.g., try to use one of the search engines (Box 1) to search for 'centrifugation'), to find researchers working on the same subject, to find some material needed for a manuscript (e.g. a picture of mycoplasma), and to facilitate your work (find the cell culture you need directly on the WWW, rather than going through the trouble of first locating the paper, than getting and reading it, than finding the address of the author and than waiting for snail mail to be delivered). The initial time and energy barrier you have to overcome to take off into the WWW will certainly be compensated by the time you gain in making adequate use of the web!


Abagyan R and Totrov M (1994) Biased probability Monte Carlo conformational searches and electrostatic calculations for peptides and proteins. J Mol Biol 235:983-1002

Altschul Sf and Gish W (1996) Local alignment statistics. Meth Enzymol 266:460-480

Altschul SF, Gish W, Miller W, Myers EW and Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403-410

Bairoch A and Apweiler R (1996) The SWISS-PROT protein sequence data bank and its new supplement TREMBL. Nucl Acids Res 24:21-25

Bairoch A, Bucher P and Hofmann K (1996) The PROSITE database, its status in 1995. Nucl Acids Res 24:189-196

Baldi P, Chauvin Y, Hunkapiller T and McClure MA (1994) Hidden Markov models of biological primary sequence information. Proc Natl Acad Sc USA 91:1059-1063

Barton GJ (1995) Protein secondary structure prediction. Curr Opin Str Biol 5:372-376

Barton GJ and Russell RB (1993) Protein structure prediction. Nature 361:505-506

Barton GJ and Sternberg MJE (1987) A strategy for the rapid multiple alignment of protein sequences: confidence levels from tertiary structure comparisons. J Mol Biol 198:327-337

Benner SA, Badcoe I, Cohen MA and Gerloff DL (1994) Bona Fide Prediction of Aspects of Protein Conformation. J Mol Biol 235:926-958

Benson DA, Boguski M, Lipman DJ and Ostell J (1996) GenBank. Nucl Acids Res 24:1-5

Bernstein FC, Koetzle TF, Williams GJB, Meyer EF, Brice MD, Rodgers JR, Kennard O, Shimanouchi T and Tasumi M (1977) The Protein Data Bank: a computer based archival file for macromolecular structures. J Mol Biol 112:535-542

Bork P and Gibson TJ (1996) Applying motif and profile searches. Meth Enzymol 266:162-184

Boswell DR and McLachlan AD (1984) Sequence comparison by exponentially-damped alignment. Nucl Acids Res 12:457-464

Bowie JU, L¸thy R and Eisenberg D (1991) A Method to Identify Protein Sequences That Fold into a Known Three-Dimensional Structure. Science 253:164-169

Brenner SE, Chothia C, Hubbard TJP and Murzin AG (1996) Understanding protein structure: using Scop for fold interpretation. Meth Enzymol 266:635-643

Brown D (1995) Washington Post, October 1:A1

Brunak S, Engelbrecht J and Knudsen S (1991) Prediction of human mRNA donor and acceptor sites from the DNA sequence. J Mol Biol 220

Bryant SH and Altschul SF (1995) Statistics of sequence-structure threading. Curr Opin Str Biol 5:236-244

Bucher P and Hofmann K (1996) A sequence similarity search algorithm based on a probabilistic interpretation of an alignment scoring system. In: States D, et al. (eds) Fourth International Conference on Intelligent Systems for Molecular Biology. AAAI Press, St. Louis, M.O., U.S.A., pp 44-51

Cardozo T, Totrov M and Abagyan R (1995) Homology modeling by the ICM method. Proteins 23:403-414

Casari G, De Daruvar A, Sander C and Schneider R (1996) Bioinformatics and the discovery of gene function. TIG 12:244-245

Chinea G, Padron G, Hooft RWW, Sander C and Vriend G (1995) The use of position-specific rotamers in model building by homology. Proteins 23:415-421

Chothia C and Lesk AM (1986) The relation between the divergence of sequence and structure in proteins. EMBO J 5:823-826

Colloc'h N, Etchebest C, Thoreau E, Henrissat B and Mornon J-P (1993) Comparison of three algorithms for the assignment of secondary structure in proteins: the advantages of a consensus assignment. Prot Engng 6:377-382

Cornell WD, Howard AE and Kollman P (1991) Molecular mechanical potential functions and their application to study molecular systems. Curr Opin Str Biol 1:201-212

Dayhoff MO (1978) Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Washington, D. C., U. S. A.

De Filippis V, Sander C and Vriend G (1994) Predicting local structural changes that result from point mutations. Prot Engng 7:1203-1208

Deperieux E and Feytmans E (1992) MATCH-BOX: a fundamentally new algorithm for the simultaneous alignment of several protein sequences. Comput Appl Biosci 8:501-509

Di Francesco V, Garnier J and Munson PJ (1996) Improving protein secondary structure prediction with aligned homologous sequences. Prot Sci 5:106-113

Doolittle R (1996) Computer methods for macromolecular sequence analysis. Academic Press, San Diego

Doolittle RF (1986) Of URFs and ORFs: a primer on how to analyze derived amino acid sequences. University Science Books, Mill Valley California

Doolittle RF (1994) Convergent evolution: the need to be explicit. Trends Biochem Sci 19:15-18

Dujon B (1996) The yeast genome project: what did we learn? TIG 12:263-270

Eddy SR (1995) Multiple alignment using hidden Markov models. In: Rawlings C, et al. (eds) Third International converence on Intelligent Systems for Molecular Biology (ISMB). Menlo Park, CA: AAAI Press, Cambridge, England, pp 114-120

Eisenhaber F, Persson B and Argos P (1995) Prediction of protein structure. Recognition of primary, secondary and tertiary structural features from amino acid sequence. Crit Rev Biochem & Mol Biol 30:1-94

Eisenmenger F, Argos P and Abagyan R (1993) A Method to Configure Protein Side-chains from the Main-chain Trace in Homology Modelling. J Mol Biol 231:849-860

Engelbrecht J, Knudsen S and Brunak S (1992) G + C-rich tract in 5' end of human introns. J Mol Biol 227:108-113

Etzold T, Ulyanov A and Argos P (1996) SRS: Information retrieval system for molecular biology data banks. Meth Enzymol 266:114-128

Feng D-F and Doolittle RF (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol 25:351-360

Feng D-F and Doolittle RF (1996) Progressive alignment of amino acid sequences and construction of phylogenetic tress from them. Meth Enzymol 266:368-382

Feng D-F, Johnson MS and Doolittle RF (1985) Aligning amino acid sequences: commonly used methods. J Mol Evol 21:112-125

Fischer D and Eisenberg D (1996) Protein fold recognition using sequence-derived predictions. Prot Sci 5:947-955

Fleischmann RD, et al. (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496-512

Flˆckner H, Braxenthaler M, Lackner P, Jaritz M, Ortner M and Sippl MJ (1995) Progress in fold recognition. Proteins 23:376-386

Gaasterland T and Selkov E (1995) Reconstruction of metabolic networks using incomplete information. In: Rawlings C, et al. (eds) Third International converence on Intelligent Systems for Molecular Biology (ISMB). Menlo Park, CA: AAAI Press, Cambridge, England, pp 127-135

Gaasterland T and Sensen C (1996) Automated Microbial Genome Analysis. Tutorial held at ISMB'96, Report

Garnier J, Gibrat J-F and Robson B (1996) GOR method for predicting protein secondary structure from amino acid sequence. Meth Enzymol 266:540-553

George DG, Hunt LT and Barker WC (1996) PIR-international protein sequence database. Meth Enzymol 266:41-59

Goebel U, Sander C, Schneider R and Valencia A (1994) Correlated mutations and residue contacts in proteins. Proteins 18:309-317

Gonnet GH, Cohen MA and Benner SA (1992) Exhaustive matching of the entire protein sequence database. Science 256:1443-1445

Gribskov M, Luethy R and Eisenberg D (1990) Profile analysis. Meth Enzymol 183:146-159

Gribskov M and Veretnik S (1996) Identification of sequence patterns with profile analysis. Meth Enzymol 266:198-227

Hansen JE, Lund O, Engelbrecht J, Bohr H, Nielsen JO, Hansen J-ES and Brunak S (1995) Prediction of O-glycosylation of mammalian proteins: specificity patterns of UDP-GalNAc: polypeptide N-acetylgalctosaminyltransferase. Biochem J 308:801-813

Henikoff JG and Henikoff S (1996) Blocks database and its applications. Meth Enzymol 266:88-104

Henikoff S (1993) Sequence analysis by electronic mail server. Trends Biochem Sci 18:267-268

Henikoff S and Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sc USA 89:10915-10919

Henikoff S and Henikoff JG (1993) Performance evaluation of amino acid substitution matrices. Proteins 17:49-61

Henikoff S and Henikoff JG (1994) Position-based sequence weights. J Mol Biol 243:574-578

Higgins DG and Sharp PM (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73:237-244

Higgins DG, Thompson JD and Gibson TJ (1996) Using CLUSTAL for multiple sequence alignments. Meth Enzymol 266:383-402

Holbrook SR, Muskal SM and Kim S-H (1990) Predicting surface exposure of amino acids from protein sequence. Prot Engng 3:659-665

Holden C (1995) Folding proteins fast. Science 269:1821

Holm L, Rost B, Sander C, Schneider R and Vriend G (1994) Data based modeling of proteins. In: Doniach S (eds) Statistical Mechanics, Protein Structure, and Protein Substrate Interactions. Plenum Press, New York, pp 277-296

Holm L and Sander C (1996) The FSSP database: fold classification based on structure-structure alignment of proteins. Nucl Acids Res 24:206-210

Hubbard T, et al. (1996) Update on protein structure prediction: results of the 1995 IRBM workshop. Folding & Design 1:R55-R63

Hubbard TJP (1994) Use of b-strand interaction pseudo-potential in protein structure prediction and modelling. In: Hunter L (eds) 27th Hawaii International Conference on System Sciences. IEEE Society Press, Maui, Hawaii, USA, pp 336-344

Hubbard TJP and Park J (1995) Fold recognition and ab initio structure predictions using Hidden Markov models and b-strand pair potentials. Proteins 23:398-402

Johnston M (1996) Towards a complete understanding of how a simple eukaryotic cell works. TIG 12:242-243

Jones DT, Taylor WR and Thornton JM (1992) A new approach to protein fold recognition. Nature 358:86-89

Kabsch W and Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen bonded and geometrical features. Biopolymers 22:2577-2637

Kabsch W and Sander C (1984) On the use of sequence homologies to predict protein structure: Identical pentapeptides can have completely different conformations. Proc Natl Acad Sc, USA 81:1075-1078

Karp PD, Ouzounis C and Paley S (1996) HinCyc: a knowledge base of the complete genome and metabolic pathways of H. influenzae. In: States D, et al. (eds) Fourth International Conference on Intelligent Systems for Molecular Biology. AAAI Press, St. Louis, M.O., U.S.A., pp 116-124

Koonin Ev, Tatusov RL and Rudd KE (1996) Protein sequence comparison at genome scale. Meth Enzymol 266:295-322

Krogh A, Brown M, Mian IS, Sjˆlander K and Haussler D (1994) Hidden Markov Models in Computational Biology: Applications to Protein Modeling. J Mol Biol 235:1501-1531

Larsen NI, Engelbrecht J and Brunak S (1995) Analysis of eukaryotic promoter sequences reveals a systematically occurring CT-signal. Nucl Acids Res 23:1223-1230

Lattman EE (1994) Protein crystallography for all. Proteins 18:103-106

Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF and Wootton JC (1993) Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment. Science 262:208-214

Lemer CM-R, Rooman MJ and Wodak SJ (1995) Protein structure prediction by threading methods: evaluation of current techniques. Proteins 23:337-355

Lesk AM, Levitt M and Chothia C (1986) Alignment of the amino acid sequences of distantly related proteins using variable gap penalties. Prot Engng 1:77-78

Levitt M (1992) Accurate Modeling of Protein Conformation by Automatic Segment Matching. J Mol Biol 226:507-533

Levitt M and Chothia C (1976) Structural patterns in globular proteins. Nature 261:552-558

Livingstone CD and Barton GJ (1993) Protein sequence alignment: a strategy for the hierarchical analysis of residue conservation. Comput Appl Biosci 9:745-756

Madden TL, Tatusov RL and Zhang J (1996) Applications of network BLAST server. Meth Enzymol 266:131-140

May ACW and Blundell TL (1994) Automated comparative modelling of protein structures. Curr Opin Biotechn 5:355-360

McLachlan AD (1972) Repeating sequences and gene duplication in proteins. J Mol Biol 64:417-437

Moult J, Pedersen JT, Judson R and Fidelis K (1995) A large-scale experiment to assess protein structure prediction methods. Proteins 23:ii-iv

Murata M (1990) Three-way Needleman-Wunsch algorithm. Meth Enzymol 183:365-375

Nakashima H and Nishikawa K (1992) The amino acid composition is different between the cytoplasmic and extracellular sides in membrane proteins. FEBS Lett 303:141-146

Needlman SB and Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443-53

Neher E (1994) How frequent are correlated changes in families of protein sequences? Proc Natl Acad Sc USA 91:98-102

Notredame C and Higgins DG (1996) SAGA: sequence alignment by genetic algorithm. Nucl Acids Res 24:1515-1524

Oliver S (1996) A network approach to the systematic analysis of yeast gene function. TIG 12:241-242

Orengo CA, Flores TP, Taylor WR and Thornton JM (1993) Identification and classification of protein fold families. Prot Engng 6:485-500

Overington J, Donnelly D, Johnson MS, Sali A and Blundell TL (1992) Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds. Prot Sci 1:216-226

Overington J, Johnson MS, Sali A and Blundell TL (1990) Tertiary structural constraints on protein evolutionary diversity: templates, key residues and structure prediction. Proc Royal Soc Lond B 241:132-145

Pastore A and Lesk AM (1990) Comparison of the Structures of Globins and Phycocyanins: Evidence for Evolutionary Relationship. Proteins 8:133-55

Pearson WR (1996) Effective protein sequence comparison. Meth Enzymol 266:227-258

Pearson WR and Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sc USA 85:2444-2448

Persson B and Argos P (1994) Prediction of transmembrane segments in proteins utilising multiple sequence alignments. J Mol Biol 237:182-192

Poch O and Delarue M (1996) Converting sequence block alignments into structural insights. Meth Enzymol 266:662-680

Risler J-L, Delorme M-O, Delacroix H and Henaut A (1988) Amino acid substitutions in structurally related proteins. A pattern recognition approach. J Mol Biol 204:1019-1029

Rodionov MA and Johnson MS (1994) Residue-residue contact substitution probabilities derived from aligned three-dimensional structures and the identification of common folds. Prot Sci 3:2366-2377

Rost B (1995a) Fitting 1-D predictions into 3-D structures. In: Bohr H and Brunak S (eds) Protein folds: a distance based approach. CRC Press, Boca Raton, Florida, pp 132-151

Rost B (1995b) TOPITS: Threading One-dimensional Predictions Into Three-dimensional Structures. In: Rawlings C, et al. (eds) Third International Conference on Intelligent Systems for Molecular Biology. Menlo Park, CA: AAAI Press, Cambridge, England, pp 314-321

Rost B (1996) PHD: predicting one-dimensional protein structure by profile based neural networks. Meth Enzymol 266:525-539

Rost B, Casadio R and Fariselli P (1996a) Refining neural network predictions for helical transmembrane proteins by dynamic programming. In: States D, et al. (eds) Fourth International Conference on Intelligent Systems for Molecular Biology. Menlo Park, CA: AAAI Press, St. Louis, M.O., U.S.A., pp 192-200

Rost B, Casadio R and Fariselli P (1996b) Topology prediction for helical transmembrane proteins at 86% accuracy. Prot Sci 5:in press

Rost B, Casadio R, Fariselli P and Sander C (1995) Prediction of helical transmembrane helices at 95% accuracy. Prot Sci 4:521-533

Rost B and Sander C (1992) Jury returns on structure prediction. Nature 360:540

Rost B and Sander C (1993) Prediction of protein secondary structure at better than 70% accuracy. J Mol Biol 232:584-599

Rost B and Sander C (1994a) Combining evolutionary information and neural networks to predict protein secondary structure. Proteins 19:55-72

Rost B and Sander C (1994b) Conservation and prediction of solvent accessibility in protein families. Proteins 20:216-226

Rost B and Sander C (1994c) Structure prediction of proteins - where are we now? Curr Opin Biotechn 5:372-380

Rost B and Sander C (1995) Progress of 1D protein structure prediction at last. Proteins 23:295-300

Rost B and Sander C (1996) Bridging the protein sequence-structure gap by structure predictions. Annu Rev Biophys Biomol Struct 25:113-136

Rost B, Sander C and Schneider R (1993) Progress in protein structure prediction? Trends Biochem Sci 18:120-123

Rost B, Sander C and Schneider R (1994a) PHD - an automatic server for protein secondary structure prediction. Comput Appl Biosci 10:53-60

Rost B, Sander C and Schneider R (1994b) Redefining the goals of protein secondary structure prediction. J Mol Biol 235:13-26

Rost B, Schneider R and Sander C (1996c) Protein fold recognition by prediction-based threading. J Mol Biol submitted Nov 27, 1995

Rost B and Valencia A (1996) Pitfalls of protein sequence analysis. Curr Opin Biotechn in press

Rost WWW B and Schneider R (1996) WWW services for sequence analysis. EMBL, WWW document (

Russell RB and Barton GJ (1992) Multiple Protein sequence alignment From Tertiary Structure Comparison: Assignment of Global and Residue Confidence Levels. Proteins 14:309-323

Russell RB and Barton GJ (1993) The limits of protein secondary structure prediction accuracy from multiple sequence alignment. J Mol Biol 234:951-957

Russell RB, Copley RR and Barton GJ (1996) Protein fold recognition by mapping predicted secondary structures. J Mol Biol in press

Salamov AA and Solovyev VV (1995) Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignment. J Mol Biol 247:11-15

Sali A and Blundell T (1994) Comparative Protein Modelling by Satisfaction of Spatial Restraints. In: Bohr H and Brunak S (eds) Protein Structure by Distance Analysis. IOS Press, Amsterdam, Oxford, Washington, pp 64-87

Sander C and Schneider R (1991) Database of homology-derived structures and the structurally meaning of sequence alignment. Proteins 9:56-68

Sankoff D and Kruskal JB (1983) Time warps, string edits, and macromolecules: The theory and practice of sequence comparison. Addison-Wesley, Reading, MA

Saqi MAS and Sternberg MJE (1991) A simple method to generate non-trivial alternate alignments of protein sequences. J Mol Biol 219:727-732

Schneider G and Wrede P (1993) Development of Artificial Neural Filters for Pattern Recognition in Protein Sequences. J Mol Evol 36:586-595

Schneider R (1994) Sequenz und Sequenz-Struktur Vergleiche und deren Anwendung f¸r die Struktur- und Funktionsvorhersage von Proteinen. Univ. of Heidelberg, PhD

Schneider R and Sander C (1996) The HSSP database of protein structure-sequence alignments. Nucl Acids Res 24:201-205

Schuler GD, Epstein JA, Hokawa H and Kans JA (1996) Entrez: molecular biology database and retrieval system. Meth Enzymol 266:141-162

Sellers PH (1974) An algorithm for the distance between two finite sequences. J. Combin. Theor. A 16:253-258

Shindyalov IN, Kolchanov NA and Sander C (1994) Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? Prot Engng 7:349-358

Shomer B, Harper RAL and Cameron GN (1996) Information services of the European Bioinformatics Institute. Meth Enzymol 266:3-27

Shortle D (1995) Protein fold recognition. Nature Struct Biol 2:91-92

Sipos L and von Heijne G (1993) Predicting the topology of eukaryotic membrane proteins. Eur J Biochem 213:1333-1340

Sippl MJ (1995) Knowledge-based potentials for proteins. Curr Opin Str Biol 5:229-235

Smith RF and Smith TF (1992) Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for use in comparative protein modelling. Prot Engng, 5:35-41

Smith TF, Waterman MS and Fitch WM (1981) Comparative biosequence metrics. J Mol Evol 18:38-46

Summers NL and Karplus M (1990) Modeling of Globular Proteins. J Mol Biol 216:991-1016

Taylor WR (1986) The classification of amino acid conservation. J Theor Biol 119:205-218

Taylor WR (1987) Multiple sequence alignment by a pairwise algorithm. Comput Appl Biosci 3:81-87

Taylor WR (1996) Multiple protein sequence alignment: algorithms and gap insertion. Meth Enzymol 266:343-367

Taylor WR and Hatrick K (1994) Compensating changes in protein multiple sequence alignment. Prot Engng 7:341-348

Thompson J, Higgins D and Gibson T (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties nd weight matrix choice. Nucl Acids Res 22:4673-4690

Uberbacher EC, Xu Y and Mural RJ (1996) Discovering and understanding genes in human DNA sequence using GRAIL. Meth Enzymol 266:259-281

Vingron M and Argos P (1989) A fast and sensitive multiple sequence alignment algorithm. Comput Appl Biosci 5:115-121

Vingron M and Argos P (1990) Determination of reliable regions in protein sequence alignment. Protein Enngineering 3:565-569

Vingron M and Waterman MS (1994) Sequence alignment and penalty choice. J Mol Biol 235:1-12

von Heijne G (1992) Membrane protein structure prediction. J Mol Biol 225:487-494

Vriend G and Sander C (1993) Quality of Protein Models: Directional Atomic Contact Analysis. J Appl Cryst 26:47-60

Wako H and Blundell TL (1994) Use of amino acid environment-dependent substitution tables and conformational propensities in structure prediction from aligned sequences of homologous proteins I. Solvent accessibility classes. J Mol Biol 238:682-692

Wodak SJ and Rooman MJ (1993) Generating and testing protein folds. Curr Opin Str Biol 3:247-259

Wootton JC and Federhen S (1996) Analysis of compositionally biased regions in sequence databases. Meth Enzymol 266:554-571

Zuker M (1991) Suboptimal sequence alignment in molecular biology. Alignment with error analysis. J Mol Biol 221:403-420