| Title: | Domains, motifs, and clusters in the protein universe |
| Author: | Jinfeng Liu & Burkhard Rost |
| Quote: | Current Opinion in Chemical Biology (2003), 7, 5-11 |
Domains, motifs, and clusters in the protein universe
| 1 | CUBIC, Dept. of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA |
| 2 | North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA |
| 3 | Dept. of Pharmacology, Columbia Univ., 630 West 168th Street, New York, NY 10032, USA |
| 4 | Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA |
| * | Corresponding authors: email = liu@cubic.bioc.columbia.edu URL http://cubic.bioc.columbia.edu/ Tel: +1-212-305-4018, fax: +1-212-305-7932 |
NOTE for authors (after publication): Upon publication the notice must be changed to read This article is published in (Current Opinion in Chemical Biology, 7, 2003 and pages) © copyright Elsevier Science (2002). Elsevier is the only authorised source. All copying of this article including placing on another website requires the written permission of the copyright owner.
The rapid growth bio-sequences results in an increasing demand for reliable methods that group proteins. A few databases with curated alignments of protein families have demonstrated that expert-driven repositories can keep up with the data deluge in the genome era. These original resources implicitly identify domain-like modules in proteins. An increasing number of automatic methods have sprouted over the last years that cluster the protein universe. Many of these implicitly dissect proteins into structural domain-like fragments. In a very coarse-grained evaluation some of the automatic methods appear on par with expert-driven approaches. However, neither automatic nor manual methods are currently entirely up to the challenges of tasks such as target selection in structural genomics. Thus, we urgently need refined and sustained automatic clustering tools.
Key words: family classification, clustering, protein domain, motif, evolution; protein structure; bioinformatics
| 3D structure | three-dimensional co-ordinates of protein structure |
| Blocks | database of protein alignment blocks derived from multiple compilations [1] |
| COGs | Clusters of Orthologous Groups of proteins [2] |
| HMM | Hidden-Markov model, i.e. particular alignment method |
| InterPro | repository cross-linking most important original databases of protein sequences and families [3, 4] |
| PDB | Protein Data Bank of experimentally determined 3D structures of proteins [5] |
| Pfam | database of expert-curated alignments of protein families (strictly called 'Pfam-A') [6] |
| SMART | database of expert-curated protein modules [7] |
| SWISS-PROT | data base of protein sequences [8] |
| TIGRFAMs | expert-curated database of protein families [9] |
| TrEMBL | translation of the EMBL-nucleotide database coding DNA to protein sequences [10] |
| UniProt | database integrating PIR, SWISS-PROT and TrEMBL (future). |
Gordon Moore correctly predicted that the potency of computers doubled every 18-24 months (Moore's 'law') [11] . The only example for computer-independent information growing faster may be the unravelling of bio-sequences [12] . And while the growth of computer potency begins slowing down [13] , the growth-rate for bio-sequences continues to grow. This reality is one of the technical reasons why clustering and classifying proteins becomes increasingly important. We challenge that there is no reasonable way of clustering and classifying proteins without dissecting proteins into structural domain-like fragments [14, 15] . In fact, such domain-like fragments also appear crucial to infer structure and function. Here, we reviewed some of the recent manual and automatic methods that attempt to classify proteins (URLs in Table 1 ).
| DB/Method | Version | Latest Update | Entries | Update | URL (all begin with http://) |
| Short sequence motifs | |||||
| PROSITE | 17.23 | 10/2002 | 1573 | manual | www.expasy.ch/prosite/ |
| Blocks+ | 8/2001 | 8656 | manual | blocks.fhcrc.org/blocks | |
| PRINTS | 35.0 | 7/2002 | 1750 | manual | www.bioinf.man.ac.uk/dbbrowser/PRINTS/ |
| Structural domain-like regions | |||||
| Pfam-A | 7.6 | 9/2002 | 4463 | manual | pfam.wustl.edu |
| TIGRFAM | 2.1 | 9/2002 | 1622 | manual | www.tigr.org/TIGRFAMs |
| SMART | 3.4 | 10/2002 | 654 | manual | smart.embl-heidelberg.de |
| SBASE | 9.0 | 10/2002 | 483 | semi-manual | hydra.icgeb.trieste.it/~kristian/SBASE/ |
| DOMO | 2.0 | 4/1998 | automatic | www.infobiogen.fr/services/domo/ | |
| ProDom | 2001.3 | 12/2001 | automatic | prodes.toulouse.inra.fr/prodom/doc/prodom.htm | |
| GeneRAGE | automatic | www.ebi.ac.uk/research/cgg/services/rage/ | |||
| TribeMCL | automatic | www.ebi.ac.uk/research/cgg/tribe/ | |||
| CHOP | 10/2002 | automatic | cubic.bioc.columbia.edu/db/chop/ | ||
| Integration | |||||
| InterPro | 5.2 | 9/2002 | 5875 | N/A | www.ebi.ac.uk/interpro |
| MetaFam | 4.1 | 9/2002 | N/A | metafam.ahc.umn.edu | |
| Clusters of proteins | |||||
| CluSTr | automatic | www.ebi.ac.uk/clustr/ | |||
| SYSTERS | 3.0 | automatic | systers.molgen.mpg.de | ||
| PICASSO | 0 | 3/1998 | automatic | systers.molgen.mpg.de | |
| ProtoNet | 1.4 | 9/2002 | automatic | www.protonet.cs.huji.ac.il/protonet/ | |
| ProClust | 1.0 | automatic | promoter.mi.uni-koeln.de/~proclust/ | ||
Motifs and domains. Two types of expert-curated resources complement one another: motif-based and domain-based databases. It is extremely difficult to infer similarities in structure or function from short alignments [16] . Particular short sequence motifs such as nuclear localization signals [17] are related to protein function, and often span evolutionarily diverged families. In fact, short motifs may constitute candidates for the 'atoms of evolution' [15] . Even more powerful are motifs defined by proximity in three-dimensional (3D) structures that constitute skeletons of 'functional units' [18] . However, protein families often cannot be characterised by single motifs. In contrast, structural domains constitute regions that (1) share a common fold, (2) have some functional similarity, and (3) may be evolutionarily related. Thus, domain-based families capture biologically crucial features beyond short motifs.
Motif-based classifications. PROSITE motifs are extracted from the literature [19, 20, 21] ;
annotations are cross-linked to and updates synchronised with SWISS-PROT [10] . Not all motifs are equally informative; this reality is reflected by
statistics on how often a certain motif matches in SWISS-PROT. Families are usually defined as 'all
proteins that
share a certain
motif' that is expressed as a regular expression (e.g. [KH]DE[LF]
abbreviates the following four peptides KDEL, HDEL, KDEF, HDEF). Profiles have been added in
order to allow detecting diverged families; these profile-extended patterns
currently cover 15% of all
entries. Motifs
based on single sequences can capture signatures that are specific to a
particular protein. Replacing single-sequence motifs by motifs derived from
alignments significantly reduces the noise. However, over 80% of the PROSITE
entries are still based on single sequences. The BLOCKSlocks database
[22] extends short PROSITE-like motifs builds un-gapped, weighted local alignments (blocks) through
dynamic programming [23] for proteins grouped by PROSITE into
'blocks' that are aligned without any gap, PRINTS [24] , Pfam-A
[6] , ProDom [25] and Domo [26] . Blocks alignments extend over 5 to 55 residues
( Fig. 1 B). The
PRINTS [24] database also contains groups of aligned, un-weighted motifs
referred to as 'fingerprints' that are derived through iterative database searches,
followed by semi-manual alignments, and by a final manual
validation/annotation.
Structure-based domain classification. A particular example for 'human with machine vs. machines' are the SCOP classifications for proteins of known structure [27] . When structures are added to PDB [5] , Alexei Murzin visually classifies these into 'known fold' and 'new fold'. Folds are further grouped into families and super-families, and structural domains are assigned. CATH also classifies structures and defines domains [28] ; it has been moving steadily from expert-driven to automatic classifications. Fully automated structure-based domain classifications are available through DALI [29] , VAST [30] and PrISM [31] .
Classifying structural domain-like families. Another comprehensive expert-curated resource is
Pfam [6] (more precisely Pfam-A). Pfam pioneered the following concept: (1)
build seed alignments for domain-like regions, and (2) extend seed alignments into
larger alignments families. Domain
seeds from the literature are extended by expert-controlled searches with
HMMer. New domain-like entries are added on a first-springs-to-the-eye base.
TIGRFAMs [9] and SMART [7] implement a similar strategy as Pfam; thus
all three overlap ( Fig. 1 A). One difference is that SMART seeds are identified
by PSI-BLAST [32] . SMART modules are much shorter than structural domains
( Fig. 1 B).
SMART also estimates the likelihood of a
given domain to be secreted, cytoplasmic, or nuclear and annotates trans-membrane
helices, coiled-coils, signal peptides, internal repeats, and cross-links to
OMIM [33] . While Pfam, TIGRFAMs, and SMART are cross-linked
to one another, it is not clear to which extend these three databases overlap
SBASE [34] groups families by recursively applying k-means clustering to
proteins with similar biological names. Families are defined as groups of
domain-like regions with significant BLAST similarities; new query sequence can
be assigned to the family either by nearest-neighbour approach, a probabilistic
score, or by neural network. Another grouping is realised by the COGs database
that attempts to classify proteins according to phylogeny [2] .
Fig. 1. B: Length distribution of alignments/domains in different databases.
(B) Length distribution of fragments: We plotted the average lengths of family entries against the cumulative percentage of families. For SCOP [27] (version 1.59, 1824 families) and SBASE [34] , the numbers refer to the average lengths of all sequences in each family/domain, for PFAM [6] , TIGRFAMs [9] and SMART [7] to the lengths of the HMMs, and for Blocks+ [22] , DOMO [26] and ProDom [25] to the lengths of the family alignments. The closer the curves to the central line defined by SCOP, the more the entries in that database resemble structural domains. All Blocks+ alignments are shorter than 55 residues. Since Blocks+ is not designed to capture structure-like domains, the Blocks+ distribution constitutes the lower end of the distribution (too fragmented). The corresponding upper end (too long) is given by TIGRFAMs for which the distribution is similar to that of full-length proteins [15] . ProDom and SMART are biased towards short fragments, with almost half of the families shorter than 60 residues. In comparison to the expert-curated SMART modules, the automatic ProDom domain dissection appears surprisingly accurate, on average. The observation that all other data sets fall below SCOP indicates that too many proteins are not dissected into domains.
Database integration. All expert-curated family databases have their strength and weakness. InterPro [3] provides a unified documentation resource for protein families, domains and functional sites by merging annotations from PROSITE, PRINTS, Pfam, TIGRFAMs, SMART, and ProDom. The next-generation extension UniProt will merge SWISS-PROT, TrEMBL, InterPro and PIR resources [35] . MetaFam [36] combines Blocks, DOMO [26] , Pfam, PIR-ALN, PRINTS, PROSITE, ProDom [25] , ProtoNet [37] , SBASE, and SYSTERS [38] . MetaFam first converts all proteins in the family databases into a common set of non-redundant proteins, then common families are identified, and supersets are created. Domain boundaries are identified through finding consensus regions among the databases.
Different objectives yield different clusters. One problem for automatic clustering methods is the
definition of similarity thresholds that yield biologically relevant
classifications. Another problem is sketched by the following alternative: (1)
group all proteins that share features X into one cluster Y or (2) ascertain
that no protein outside cluster Y shares feature X with any protein in Y. Both
objectives first must translate 'similarity in sequence' into 'similarity in
feature X'. For the feature 'similarity in structure', the criteria are well
defined in the following way: if the sequence similarity (SAB)
between proteins A and B exceeds threshold T, we can reliably infer that A and
B are structurally similar [39, 31] . However, if SAB >14] . The threshold problem becomes more difficult when
we want to infer similarity in function: Different aspects of function such as
sub-cellular localisation [40] , enzymatic activity [41, 16] , or
cellular function [42] require different thresholds. We may seek a way out
of this problem by restricting clusters to close homologues, such as COGs [2] . However, the dilemma between the Skylla of 'restrictive thresholds yielding
many small clusters' and the Charibdis of 'permissive thresholds yielding few
large clusters' is a principle one. Neither objective automatic, nor subjective
expert-driven classifications can ship around this problem.
Evaluating clustering methods is problematic. Since there is no single 'correct' solution to the clustering problem, there is also no unambiguous way to evaluate methods. Methods classifying proteins into structural domains could be compared to large sets of structural domains annotated by SCOP, CATH, DALI [29] , VAST [30] , and PrISM [31] . However, even structure-based domain-assignments agree only to some extent. A simple, coarse-grained feature is the agreement between the distributions of domain lengths suggested by structure- and by sequence-based domain assignments ( Fig. 1 B).
Clusters establish similarity not distance. Calculating pairwise sequence similarity is usually the first step toward clustering. Expectation values (E-values) from BLAST/PSI-BLAST [43] are adopted to save CPU-time [26, 44, 45, 38, 25] . One problem is that E-values change when adding sequences to the cluster. Another problem is that BLAST E-values are not symmetric, i.e. differ between aligning A against B and aligning B against A. Most methods account for the asymmetry by ad-hoc hacks: use Smith-Waterman alignments when only one BLAST E-value is above the threshold [44, 38] , replace asymmetric E-values by averages over both [46] . Other methods establish similarity through Smith-Waterman alignments [23] . For example, ProClust uses normalised Smith-Waterman scores [47] ; CluSTr uses Z-scores resulting from Monte-Carlo simulations of Smith-Waterman alignments [48] . ProtoMap [45] ProtoNet [37] and BioSphere [49] combine measures from Smith-Waterman, BLAST, and FASTA alignments. One important reality of sequence comparisons is that alignment methods optimise the similarity between two sequences. 'Less similar' does not imply 'more distant'. To illustrate this point for structural similarity: 90% of all pairs of proteins that have 15% identical residues over their entire length have different structures, however, 90% of the pairs of proteins with similar structure have less than 15% identical residues [12, 39, 31] .
Clustering without considering the domain problem. Some methods try to ignore the domain problem by applying very conservative thresholds. SYSTERS clusters proteins with BLAST E-values < 10-40 by single-linkage [38] . At this level, many partial matches in multi-domain proteins are eliminated at the cost of small clusters. ProtoNet classifies proteins at different levels of confidence [37] . It begins at a high sequence similarity (E-values < 10-100) with many small clusters. These initial clusters are then merged gradually at various levels of similarity [50] . Users can determine the wanted level of coarse-grained representation by dialling through different thresholds.
Domain-based clustering.
A few methods explicitly predict domain boundaries from sequence information,
in particular, through database searches [51, 52] , concepts from protein
folding [53] , statistics [54] , and neural networks [55, 56] . None
of these is well enough established yet for large-scale sequence analysis. Many
proteins appear to have regions depleted of regular structure [57] ;
identifying such regions may assist predicting domain boundaries [58] .
However, most methods that predict domain boundaries use alignment information
and also classify the protein universe in two steps: (1) chop proteins into
domain-like fragments, (2) cluster these fragments (ProDom [25] , DOMO [26] , GeneRAGE [44] , CHOP [59] ). ProDom applies the following algorithm
[25] . First, stack all sequences in SWISS-PROT and TrEMBL. Then iterate:
(1) identify the
shortest sequence in the stack, (2) find related regions through
PSI-BLAST, and (3) remove already clustered fragments from stack. The
algorithm terminates when no sequences are left. DOMO applies successive steps
based on similarity in amino acid composition, di-peptide composition, local
sequence similarity, and multiple sequence alignment similarity to detect
domain boundaries and then clusters the domains [26] . DOMO tends to propose
longer regions than ProDom ( Fig. 1 B). Picasso dissects and then clusters
domain-like fragments by [60] : (1) defining close neighbours by pairwise
BLAST, and (2) hierarchically
merging the initially neighbours through profile-profile
comparisons. Domain borders are determined based on overlapping maximal
clusters (clusters
that are not fully contained in any other cluster); unified families are defined as sets of
clusters that share at
least one common domain. GeneRAGE [44] detects multi-domain proteins
through simple phylogenetic transitivity: If A similar to B and C, and B not
similar to C, then A has at least two domains. The resulting fragments are
clustered by single-linkage. Although GeneRAGE appeared adequate for bacterial
genomes, its accuracy does not suffice to cope with the complexity of
eukaryotes [46, 14] .
Dissect into domains and cluster in one step.Clustering with
implicit domain information. A few
methods based on graph-theory attempt to avoid the explicit dissection into
domains by merging the two steps chop and cluster embedding domain
information into the clustering procedures. ProClust [47] encodes partial
alignments resulting from multi-domain proteins into the edge of similarity
graphs. Instead of using symmetric Smith-Waterman scores corresponding to
undirected edges, ProClust normalises the score by the length of the proteins
thus yielding two directed edges differentiated by protein length. The graphs
are then partitioned into Strongly Connected Components (SCCs) that constitute
the final clusters. For about 55% of the data, the method is reported to
achieve a high specificity (>99%) when tested against SCOP [27] .
TRIBE-MCL expresses pairwise similarity through a particular matrix (Markov
matrix) that is then clustered (by a Markov cluster algorithm) [46] . The
algorithm iterates over rounds of expansion and inflation to alter the matrix.
Another graph-based method uses the normalised Ncut-algorithm to classify
proteins through pairwise relations [61] . The method was reported to
reproduce COG families [2] accurately.
Grouping proteins into families is important both for biological and computational reasons. Expert-curated family databases like Pfam [6] , TIGRFAMs [9] , SMART [7] , and COGs [2] have steadily increased their coverage of the protein universe over the last year. Obviously, these resources overlap to some extent ( Fig. 1 A). Therefore, the first large-scale efforts toward integration of many resources are extremely important additions to the field of databases [3, 36, 4] . Structural genomics reveals the importance of identifying structural domains [62, 63, 64, 65, 14, 18, 37] . While the expert-driven family databases implicitly identify domains, the quality of this identification is quite mixed ( Fig. 1 B). Furthermore, entire proteomes can currently only be clustered through automatic methods. Some methods try to avoid the domain problem by elaborate hierarchical clustering schemata [38, 37, 50] . Others identify domains through alignments and then cluster these domains [26, 44, 59, 25] . The first methods have been published that address the task of identifying domain boundaries directly [54, 51, 55, 52, 53, 58, 56] . Three methods implicitly combine domain-dissection and clustering through algorithms from graph-theory [47, 61, 46] . At this point, most of these methods have not been compared to one another. A coarse-grained comparison suggests that some of the automatic methods may be able to compete with expert-driven annotations ( Fig. 1 B). None of the existing clustering and domain-dissection methods appears to solve the problems conclusively. If we assume that structural domains constitute one candidate for 'the atom of evolution', we may hope to find the 'final' solution some day. Lupas and colleagues [66] speculated that proteins evolved through inserting and deleting fragments that are more like Blocks [22] , PRINTS [24] , or PROSITE [19, 20, 21] motifs than like structural domains. If true, methods that dissect proteins into domains based on sequence similarity alone may be doomed to fail. Additional information, such as predicted secondary structure, may be needed to determine the domain border. One point is clear: We urgently need better tools to dissect proteins into domains and to cluster these domains.
Thanks to Henry Bigelow (Columbia University) for helpful comments and for critical proofreading. JL and BR were supported by the grants 1-P50-GM62413-01 and RO1-GM63029-01 from the National Institute of Health (NIH). Last, not least, thanks to all those who deposit their experimental data in public databases, and to those who maintain these databases.
| 1. | Henikoff, S., Henikoff, J. G. &Pietrokovski, S. (1999). Blocks+: a non-redundant database of protein alignmentblocks derived from multiple compilations. Bioinformatics, 15, 471-9. |
| 2. | Tatusov, R. L., Natale, D. A.,Garkavtsev, I. V., Tatusova, T. A., Shankavaram, U. T. et al. (2001). The COGdatabase: new developments in phylogenetic classification of proteins fromcomplete genomes. Nucl. Acids Res., 29, 22-8. |
| 3. | Apweiler, R., Attwood, T. K.,Bairoch, A., Bateman, A., Birney, E. et al. (2001). The InterPro database, anintegrated documentation resource for protein families, domains and functionalsites. Nucl. Acids Res., 29, 37-40. |
| 4. | Mulder, N. J., Apweiler, R.,Attwood, T. K., Bairoch, A., Bateman, A. et al. (2002). InterPro: an integrateddocumentation resource for protein families, domains and functional sites. BriefBioinform, 3,225-35. |
| 5. | Berman, H. M., Westbrook, J., Feng,Z., Gillliland, G., Bhat, T. N. et al. (2000). The Protein Data Bank. Nucl.Acids Res., 28,235-242. |
| 6. | Bateman, A., Birney, E., Cerruti, L.,Durbin, R., Etwiller, L. et al. (2002). The Pfam protein families database. Nucl.Acids Res., 30,276-80. |
| 7. | Letunic, I., Goodstadt, L., Dickens,N. J., Doerks, T., Schultz, J. et al. (2002). Recent improvements to the SMARTdomain-based sequence annotation resource. Nucl. Acids Res., 30, 242-4. |
| 8. | Bairoch, A. & Apweiler, R.(2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in2000. Nucl. Acids Res., 28, 45-48. |
| 9. | Haft, D. H., Loftus, B. J.,Richardson, D. L., Yang, F., Eisen, J. A. et al. (2001). TIGRFAMs: a proteinfamily resource for the functional identification of proteins. Nucl. AcidsRes., 29, 41-3. |
| 10. | O'Donovan, C., Martin, M. J.,Gattiker, A., Gasteiger, E., Bairoch, A. et al. (2002). High-quality proteinknowledge resource: SWISS-PROT and TrEMBL. Brief Bioinform, 3, 275-84. |
| 11. | Moore, G. (1965). Cramming morecomponents onto integrated circuits. Electronics,38, . |
| 12. | Rost, B. (1998). Marrying structureand genomics. Structure, 6, 259-263. |
| 13. | Moore, G. & Dillon, P. (2002).Chip "law" expands beyond its creator's wildest expectations. Forbes, 38, . |
| 14. | Liu, J. & Rost, B. (2002).Target space for structural genomics revisited. Bioinformatics, 18, 922-933. |
| 15. | Rost, B. (2002). Did evolution leapto create the protein universe? Curr. Opin. Str. Biol., 12, 409-416. |
| 16. | Rost, B. (2002). Enzyme functionless conserved than anticipated. J. Mol. Biol.,318, 595-608. |
| 17. | Nair, R., Carter, P. & Rost, B.(2002). NLSdb: database of nuclear localization signals. Nucl. Acids Res.,in press. |
| 18. | Nagano, N., Orengo, C. &Thornton, J. (2002). One Fold with Many Functions: The EvolutionaryRelationships between TIM Barrel Families Based on their Sequences, Structuresand Functions. J. Mol. Biol., 321, 741. |
| 19. | Hofmann, K., Bucher, P., Falquet,L. & Bairoch, A. (1999). The PROSITE database, its status in 1999. Nucl.Acids Res., 27,215-219. |
| 20. | Falquet, L., Pagni, M., Bucher, P.,Hulo, N., Sigrist, C. J. et al. (2002). The PROSITE database, its status in2002. Nucl. Acids Res., 30, 235-8. |
| 21. | Sigrist, C. J., Cerutti, L., Hulo,N., Gattiker, A., Falquet, L. et al. (2002). PROSITE: a documented databaseusing patterns and profiles as motif descriptors. Briefing Bioinf., 3, 265-274. |
| 22. | Henikoff, J. G., Greene, E. A.,Pietrokovski, S. & Henikoff, S. (2000). Increased coverage of proteinfamilies with the blocks database servers. Nucl. Acids Res., 28, 228-30. |
| 23. | Smith, T. F. & Waterman, M. S.(1981). Identification of common molecular subsequences. J. Mol. Biol., 147, 195-7. |
| 24. | Attwood, T. K., Blythe, M. J.,Flower, D. R., Gaulton, A., Mabey, J. E. et al. (2002). PRINTS and PRINTS-Sshed light on protein ancestry. Nucl. Acids Res.,30, 239-41. |
| 25. | Servant, F., Bru, C., Carrere, S.,Courcelle, E., Gouzy, J. et al. (2002). ProDom: automated clustering ofhomologous domains. Brief Bioinform, 3, 246-51. |
| 26. | Gracy, J. & Argos, P. (1998).DOMO: a new database of aligned protein domains. TIBS, 23, 495-7. |
| 27. | Lo Conte, L., Brenner, S. E.,Hubbard, T. J., Chothia, C. & Murzin, A. G. (2002). SCOP database in 2002:refinements accommodate structural genomics. Nucl. Acids Res., 30, 264-7. |
| 28. | Orengo, C. A., Bray, J. E., Buchan,D. W., Harrison, A., Lee, D. et al. (2002). The CATH protein family database: Aresource for structural and functional annotation of genomes. Proteomics, 2, 11-21. |
| 29. | Dietmann, S. & Holm, L. (2001).Identification of homology in protein structure classification. Nat. Struct.Biol., 8, 953-957. |
| 30. | Marchler-Bauer, A., Panchenko, A.R., Ariel, N. & Bryant, S. H. (2002). Comparison of sequence and structurealignments for protein domains. Proteins, 48, 439-446. |
| 31. | Yang, A. S. & Honig, B. (2000).An integrated approach to the analysis and modeling of protein sequences andstructures. II. On the relationship between sequence and structural similarityfor proteins that are not obviously related in sequence. J. Mol. Biol., 301, 679-689. |
| 32. | Schaffer, A. A., Aravind, L.,Madden, T. L., Shavirin, S., Spouge, J. L. et al. (2001). Improving theaccuracy of PSI-BLAST protein database searches with composition-basedstatistics and other refinements. Nucl. Acids Res., 29, 2994-3005. |
| 33. | Hamosh, A., Scott, A. F., Amberger,J., Bocchini, C., Valle, D. et al. (2002). Online Mendelian Inheritance in Man(OMIM), a knowledgebase of human genes and genetic disorders. Nucl. AcidsRes., 30, 52-5. |
| 34. | Vlahovicek, K., Murvai, J., Barta,E. & Pongor, S. (2002). The SBASE protein domain library, release 9.0: anonline resource for protein domain identification. Nucl. Acids Res., 30, 273-5. |
| 35. | Wu, C. H., Huang, H., Arminski, L.,Castro-Alvear, J., Chen, Y. et al. (2002). The Protein Information Resource: anintegrated public resource of functional annotation of proteins. Nucl. AcidsRes., 30, 35-7. |
| 36. | Silverstein, K. A., Shoop, E.,Johnson, J. E. & Retzel, E. F. (2001). MetaFam: a unified classification ofprotein families. I. Overview and statistics. Bioinformatics, 17, 249-61. |
| 37. | Portugaly, E., Kifer, I. &Linial, M. (2002). Selecting targets for structural determination by navigatingin a graph of protein families. Bioinformatics,18, 899-907. |
| 38. | Krause, A., Haas, S. A., Coward, E.& Vingron, M. (2002). SYSTERS, GeneNest, SpliceNest: exploring sequencespace from genome to protein. Nucl. Acids Res.,30, 299-300. |
| 39. | Rost, B. (1999). Twilight zone ofprotein sequence alignments. Prot. Engin., 12, 85-94. |
| 40. | Nair, R. & Rost, B. (2002).Sequence conserved for sub-cellular localization. Prot. Sci.,in press. |
| 41. | Todd, A. E., Orengo, C. A. &Thornton, J. M. (2001). Evolution of function in protein superfamilies, from astructural perspective. J. Mol. Biol., 307, 1113-1143. |
| 42. | Devos, D. & Valencia, A.(2001). Intrinsic errors in genome annotation. TIGS, 17, 429-431. |
| 43. | Altschul, S. F., Madden, T. L.,Schaffer, A. A., Zhang, J., Zhang, Z. et al. (1997). Gapped BLAST andPSI-BLAST: a new generation of protein database search programs. Nucl. AcidsRes., 25, 3389-402. |
| 44. | Enright, A. J. & Ouzounis, C.A. (2000). GeneRAGE: a robust algorithm for sequence clustering and domaindetection. Bioinformatics, 16, 451-7. |
| 45. | Yona, G., Linial, N. & Linial,M. (2000). ProtoMap: automatic classification of protein sequences andhierarchy of protein families. Nucl. Acids Res.,28, 49-55. |
| 46. | Enright, A. J., Van Dongen, S.& Ouzounis, C. A. (2002). An efficient algorithm for large-scale detectionof protein families. Nucl. Acids Res., 30, 1575-84. |
| 47. | Bolten, E., Schliep, A.,Schneckener, S., Schomburg, D. & Schrader, R. (2001). Clustering proteinsequences--structure prediction by transitive homology. Bioinformatics, 17, 935-41. |
| 48. | Kriventseva, E. V., Fleischmann,W., Zdobnov, E. M. & Apweiler, R. (2001). CluSTr: a database of clusters ofSWISS-PROT+TrEMBL proteins. Nucl. Acids Res.,29, 33-6. |
| 49. | Yona, G. & Levitt, M. (2002).Within the twilight zone: a sensitive profile-profile comparison tool based oninformation theory. J. Mol. Biol., 315, 1257-1275. |
| 50. | Sasson, O., Linial, N. &Linial, M. (2002). The metric space of proteins-comparative study of clusteringalgorithms. Bioinformatics, 18 Suppl 1, S14-21. |
| 51. | Kulikowski, C. A., Muchnik, I.,Yun, H. J., Dayanik, A. A., Zhang, D. et al. (2001). Protein structural domainparsing by consensus reasoning over multiple knowledge sources and methods. Medinfo, 10, 965-969. |
| 52. | George, R. A. & Heringa, J.(2002). Protein domain identification and improved sequence similaritysearching using PSI-BLAST. Proteins, 48, 672-81. |
| 53. | George, R. A. & Heringa, J.(2002). SnapDRAGON: a method to delineate protein structural domains fromsequence data. J. Mol. Biol., 316, 839-851. |
| 54. | Wheelan, S. J., Marchler-Bauer, A.& Bryant, S. H. (2000). Domain size distributions can predict domainboundaries. Bioinformatics, 16, 613-618. |
| 55. | Murvai, J., Vlahovicek, K.,Szepesvari, C. & Pongor, S. (2001). Prediction of protein functionaldomains from sequences using artificial neural networks. Genome Res., 11, 1410-1417. |
| 56. | Miyazaki, S., Kuroda, Y. &Yokoyama, S. (2002). Characterization and prediction of linker sequences ofmulti-domain proteins by a neural network. Journal of Structural andFunctional Genomics, 2, 37-51. |
| 57. | Dunker, A. K., Lawson, J. D.,Brown, C. J., Williams, R. M., Romero, P. et al. (2001). Intrinsicallydisordered protein. J Mol Graph Model, 19, 26-59. |
| 58. | Liu, J., Tan, H. & Rost, B.(2002). Loopy proteins appear conserved in evolution. J. Mol. Biol., 322, 53-64. |
| 59. | Carter, P., Liu, J. & Rost, B.(2002). PEP: Predictions for Entire Proteomes. Nucl. Acids Res.,in press. |
| 60. | Heger, A. & Holm, L. (2001).Picasso: generating a covering set of protein family profiles. Bioinformatics, 17, 272-9. |
| 61. | Abascal, F. & Valencia, A.(2002). Clustering of proximal sequence space for the identification of proteinfamilies. Bioinformatics, 18, 908-21. |
| 62. | Montelione, G. T. (2001).Structural genomics: an approach to the protein folding problem. Proc NatlAcad Sci U S A, 98,13488-9. |
| 63. | Vitkup, D., Melamud, E., Moult, J.& Sander, C. (2001). Completeness in structural genomics. Nat. Struct.Biol., 8, 559-566. |
| 64. | Frishman, D. (2002).Knowledge-based selection of targets for structural genomics. Prot. Engin., 15, 169-183. |
| 65. | Hurley, J. H., Anderson, D. E.,Beach, B., Canagarajah, B., Ho, Y. S. et al. (2002). Structural genomics andsignaling domains. TIBS, 27, 48-53. |
| 66. | Lupas, A. N., Ponting, C. P. &Russell, R. B. (2001). On the evolution of protein folds: are similar motifs indifferent protein folds the result of convergence, insertion, or relics of anancient peptide world? J. Struct. Biol., 134, 191-203. |
| Contact: rost@columbia.edu | Version: Nov 27, 2002 |