bottom - CUBIC-papers - CUBIC

Title: Did evolution leap to create the protein universe?
Author:Burkhard Rost
Quote: B Rost (2002) Current Opinion in Structural Biology, Vol 12:409-416

Did evolution leap to create the protein universe?

Burkhard Rost 1, 2, *

1    CUBIC, Dept. of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
2    Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA
*     rost@columbia.edu http://cubic.bioc.columbia.edu/      Tel: +1-212-305-3773, fax: +1-212-305-7932

Table of contents



 


Abstract

Over 60 organisms from all three kingdoms of life are now entirely sequenced. In many respects the inventory of proteins used in different kingdoms appears surprisingly similar. However, eukaryotes differ from other kingdoms in that they use many long proteins; have more proteins with coiled-coil helices and with regions abundant in regular secondary structure. Particular structural domains are used in many pathways. Nevertheless, one domain tends to occur only once in one particular pathways. Many proteins may not have close homologues in different species (orphans) and there may even be folds that are specific to one species. This view implies that protein fold space is discrete. An alternative model suggests that structure space is continuous in that modern proteins evolved by aggregating fragments of ancient proteins. Either way, after having harvested proteomes by applying standard tools, the challenge now seems to develop better methods for comparative proteomics.

 


Introduction

Natura non facit saltus (Nature does not make leaps)
Attributed to Tito Lucrezio Caro (Titus Lucretius Carus), 34 ?;
Gottfried Wilhelm Leibniz, 1698; Charles Darwin, 1859

Some perceive New York City as a place jammed with humans. However, for those who live here, the metropolis feels like a village because life evolves around neighbourhoods. Scientists explaining our behaviour from a 'systems' perspective may argue that we belong to various groups defined by our address, our work, or even the shape of our noses. We may feel that such classifications do not explain who we really are. In analogy, we may not expect that catalogues of entirely sequenced organisms convey understanding of protein structure, function, or evolution. The literature bursts with examples of detailed studies of how protein structure and function co-evolve [1, 2, 3, 4, 5, 6, 7, 8, 9] . However, while most of these analyses utilise the wealth of biological data, they are not explicitly based on the fact that we have the entire sequences from representatives of all three kingdoms of life: eukaryotes, prokaryotes and archaea. What do we learn from generating lists of parts of the whole [10, 11] ? Here, I focused on some findings from tools that thrive to capture overall-features for entire organisms. I challenge that bioinformatics is slowly but steadily approaching the point at which we can smoothly move through neighbourhoods of protein relations in order to generate an atlas of the fates and functions of proteins in context of the cell.

 


Overview over proteomes: catalogues of structure and function

Eukaryotes have many very long proteins. Genomes differ significantly in their nucleotide composition [12] . In contrast, the amino acid compositions of the entire proteomes for 28 organisms from all three kingdoms are similar [13] ( Fig. 1 A). The three kingdoms differ significantly in protein length ( Fig. 1 B); in particular, about 7% of the eukaryotic proteins are longer than 1000 residues, while less than 2% of all proteins in archaea and prokaryotes are that long. Many proteins are overlooked in the initial annotation of genome projects as we realise through the annual growth of the estimated number of proteins in worm [14] , yeast [15] , or human. Some groups specialise on hunting for overlooked proteins [16] . However, the number of proteins in microbial genomes appears significantly over-estimated [12] . If so, the differences in the protein length distributions for short proteins may not hold up ( Fig. 1 B). However, the corrections in the number of proteins suggested by Skovgaard, Krogh and colleagues [12] alter the distributions for long proteins only marginally (data not shown). Some of the incorrectly annotated short proteins may actually be short RNAs [17] others may be pseudo-genes [18] .



Fig. 1
fig1.gif

Fig. 1. : Comparing proteomes.
(A) Amino acid usage in each of the three kingdoms: the height of the letter is proportional to the frequency of the respective amino acid (overall percentages given on the right). The lower rows show the number of codons for each of the amino acids and the age-rank (1 is oldest, 20 newest) as estimated by Edward Trifonov [88] . The variation of the composition within the kingdoms is as insignificant as that between the kingdoms [13] .
(B) Distribution of protein length: the data is binned in intervals of ten residues. The inlet gives the cumulative length; it illustrates the significantly higher proportion of long proteins in eukaryotes.
(C) Note that the lines give the spread between the minima and maxima found in the respective kingdom. Left panel: percentage of proteins for which we can predict structure through comparative modelling. Three types of models are distinguished by applying different cut-offs for the required level of sequence similarity [89] : at levels above 70% pairwise sequence identity models reach an average accuracy around 2Å rmsd ('Good models', dark blue bars). At levels of 32% pairwise sequence identity over 100 residues (HSSP-distance of 0 [13] ), models differ about 3-5Å ('OK models', light blue bars), and at a PSI-BLAST threshold of 10-3, most models identify the fold correctly ('Model gives fold', blue-gray bars). Right panel: Percentage of proteins predicted with coiled-coil helices (green bars), transmembrane helices (brown bars), signal peptides (yellow bars, note: all these proteins are extra-cellular), and long regions without regular secondary structure (NORS regions [51] , pinkish bars). A and B are based on an analysis of 238,326 proteins in 63 complete proteomes (12 archae, 46 prokaryotes and 5 eukaryotes [90] ), the graphs in C are based on 30 proteomes as given in [13] ).




Despite the complete set of sequences, comparisons remain guesses. Comparing proteomes based on residue composition or protein length constitutes an extremely dumb realisation of comparative genomics. Unfortunately, we cannot benefit from the completeness of sequenced organisms for any other overview features. Statements about protein families, folds, and functions are restricted either to some arbitrary subset of proteins for which we can infer features by homology or are limited by the accuracy of prediction methods. Consider the idea to investigate whether or not particular folds are used more often in some organisms than in others. Firstly, we neither know which fraction of all existing folds we know already (estimates range from 10-50% [19, 3, 13, 20, 21] ), nor whether the types of folds we know are representative. Secondly, we can, at best, infer the folds for half of all proteins in entirely sequenced organisms ( Fig. 1 C) [22, 23, 24, 11, 5, 20, 25, 26] . Thirdly, we have no way to predict novel folds in the context of entire proteomes [1] . It is much easier to infer aspects of structure from sequence similarity than to infer aspects of function. Thus, the classification of proteomes by function relies even more on guesses than that of structure.

Popular folds also most populated in proteomes. Mark Gerstein pioneered the structural census of proteomes [27, 11] by analysing which folds are most often used in organisms. When analysing the universe of protein sequences, we observe that some types of proteins belong to larger families than others [28, 29, 30, 31, 32, 13, 33] . As expected, folds with large families also dominate proteomes [34, 3, 10, 35, 36, 11, 37, 9] . The major differences between the types of proteins that populate the largest sequence- and the largest fold-based families resulted from the under-representation of membrane proteins in known structures [23, 25] .

Kingdoms have similar percentage but different types of membrane proteins. Contrary to speculations prior to completing the sequences of two animals and one plant, it seems that the percentage of membrane helical proteins differs less between multi- and uni-cellular organisms than between different organisms within each of the three kingdoms [13] . Overall, about 16-26% of all proteins have membrane helices [13] ( Fig. 1 C). Most of the membrane helical proteins have fewer than four helices, and about half of all membrane proteins have no globular regions of considerable length [13] . One reason why membrane helix predictions are so valuable in the context of proteomes is that the number of helices is typically related to the type of protein and its function. While proteins with seven transmembrane helices (G protein-coupled receptors, 7-TM) are significantly over-represented in worm and human, proteins with six and 12 helices (transporters) in most prokaryotes [13] . Surprisingly, we found relatively few 7-TM proteins in fly and many in worm. This finding may be explained by the immense difference in the number of olfactory receptors alone: The worm appears to contain 1000 smell receptors, whereas the fly has fewer than 100 [38] . Interestingly, the conservation of protein type does not always span all kingdoms: families spanning all kingdoms do not necessarily have the same number of membrane helices, suggesting that proteins can add or remove helices over the course of evolution [13] . While many methods predict alpha-helical membrane proteins, few methods have addressed the prediction of proteins that insert beta-strand barrels into the membrane [39] . Recently, methods have reached reasonable levels of accuracy [40] . Furthermore, a simple statistical model predicted about 105 potential beta-barrel membrane proteins in the two Gram-negative bacteria Escherichia coli and Pseudomonas aeruginosa [41] .

Coiled-coil proteins are significantly over-represented in eukaryotes. Secondary structure correlates with function [42] , and we can learn about evolution from a taxonomy of secondary structure [43] . However, the content of secondary structure is only of limited value in context of comparative proteomics [44] . One exception is the presence of coiled-coil helices: eukaryotes appear to have significantly more coiled-coil proteins than all other kingdoms ( Fig. 1 C) [13] . Proteins with coiled-coil regions are often insoluble because such regions are often responsible for aggregation; they often indicate structural proteins. However, the high fraction in eukaryotes may originate from the role of coiled-coil regions in protein-DNA interactions and in regulation, transcription and translation. The problem with this explanation is that while we tend to assume that eukaryotes utilise a more complex machinery to control their protein repository, we have no solid data confirming this assumption on the scale of the entire proteome (see below).

Eukaryotes have significantly more 'loopy' proteins. The most outstanding difference between eukaryotes and other kingdoms in terms of protein structure is the high fraction of proteins that appear unlike typical globular structures. It has long held true that proteins fold into a unique three-dimensional structure and that this structure determines protein function. Over the last years evidence has gathered that we have to re-assess this paradigm of structural biology: many long regions appear essentially unstructured in isolation [45, 46] . Such regions may introduce particular flexibility in that they could adopt different shapes through binding (induced fit) [47, 48, 49, 50] . We have recently analysed proteins with long regions (>70 residues) that appear to have no regular secondary structure (NORS regions; [51] ). Confirming neural network-based predictions of disordered regions [45, 52] , we found that eukaryotes had on average 3-5 times more proteins with NORS regions than organisms from other kingdoms ( Fig. 1 C) [51] . Many 'loopy' proteins appear involved in gene regulation. However, experimental results are needed to shed more light on this new class of proteins.

Too many functionally unclassified proteins hampered comparing function. Using EUCLID [53] , we could classify about 45-65% of all proteins from 30 complete proteomes into one of 13 classes of cellular function at a level reported to yield about 70% correct classifications [54, 55] . When grouping the 13 classes into three super-classes (energy, information, and communication), we found similar compositions within the archaen and eukaryotic kingdoms [13] . In contrast, the composition varied significantly between prokaryotic organisms [13] . In detail, we found the following differences: proteins in biosynthesis and energy metabolism were abundant in prokaryotes and human seemed to have a larger portion of the classes related to transport, binding, and regulatory functions [13] . The significant variations between prokaryotic proteomes may reflect the very different environments in which these organisms dwell. However, the most important result is that although accepting classification errors of 30% or more, we still can classify only about half of all proteins. Thus, conclusions about the meaning of the relative proportions remain highly speculative, at best. We also could not verify earlier findings that the subset of proteins with homology to known structures differ in their cellular function from proteins of unknown structure (J. Liu & B. Rost, unpublished).

 


Detailed analysis of evolution: domains, pathways, and orphans

Working hypothesis: domains constitute the atoms of protein structures. Structural genomics aims at experimentally determining one structure for every fold in nature [56, 57, 5, 58] . Proteins of unknown folds are identified by clustering protein sequences [28, 59, 3, 20, 26] . Intuitively, the goal is to group proteins with similar 'structural elements' and to separate these clusters from proteins not containing those structural elements. What is the smallest structural element? Structural biologists tend to take structural domains for the 'atom of structure'. Thus, proteins have to be chopped into domains before clustering the protein universe; all methods that do this use evolutionary relations and thereby implicitly connect ‘atom of structure’ and ‘atom of evolution’ [60, 61, 29, 62, 63, 33, 64, 37, 9, 65] . The principle assumption is that if protein A is similar to B and to C, but B is not similar to C, then B and C constitute domains of A. Applying this scheme to all complete eukaryotes, we found over 17000 'domain-like' fragment clusters [26] . This lower-bound based on complete data generally confirmed earlier estimates based on extrapolations from representative data sets [20] . Even if structural domains are not the atoms of evolution, analyses based on domains are more accurate than those based on entire sequences.

Domain combinations are evolutionarily conserved. Like many other biological relations, domain family relations follow scale-free networks relations [66, 67, 68, 69, 36, 70, 71, 72] , i.e., these relations are explained by statistical models that do not require assumptions about biology. In particular, only large families engage in many types of domain combinations, while small families engage in only a few types of domain combinations. The majority of domain combinations AB involve families of domains A and B spanning across all kingdoms of life [66, 67] . Teichmann and colleagues suggest that evolution creates novel functions predominantly by combining existing domains [66, 67] . There are more repeats of similar domains adjacent to one another in eukaryotes than in other kingdoms [66, 67] ; the most extreme example of this is the giant protein titin [73] . Interestingly, the sequential order of different domains appears evolutionarily conserved [66, 67, 74] . Like many other biological relations, domain family relations follow scale-free networks relations [66, 67, 68, 69, 36, 70, 71, 72] , i.e., these relations are explained by statistical models that do not require assumptions about biology.

Mapping domains onto pathways suggests the image of a mosaic. Although we have information about structure for less than half of all proteomes, almost 90% of the enzymes from the 106 small molecular metabolic pathways in Escherichia coli have domains of known structure [75, 76] . A particular fold is typically used only once in a given pathway. In other words, more homologues are distributed across pathways than within pathways. Interestingly, 75% of all enzymes in metabolic pathways of Escherichia coli appear to be enzymes known to catalyse a single enzymatic reaction and the majority of enzymes used in any metabolic pathway is specific to that particular pathway [77] . The authors concluded that pathways use enzyme mosaics [75, 76] , i.e. they are taken from a limited set of protein families and there are no discernible repetitions. As established in an excellent study of the structural conservation of enzymatic activity [7] , enzyme families of small molecule pathways also often conserve their catalytic or cofactor binding properties, while substrate recognition seems rarely conserved [75, 76] . About half of all protein-protein interactions are between domains form the same family and the only groups of proteins for which two different domains interact within and between proteins appear to be enzymes and proteins with domains interacting within one family [69] . Obviously, the ultimate goal of analysing proteomes is to learn ways of refining our database searches. One particular application that uses the completeness of proteomes combines sequence analysis with structure prediction to find all disulfide oxidoreductases in yeast [78] .

Some folds may have been realised only once in nature. Although the term ‘fold’ is not well defined, it intuitively refers to subunits between 30 to over 700 residues long that let structural biologists recognise a particular protein structure. A few protein folds are used by many different protein sequences; they are often referred to as super-folds [37, 9] . Certain folds are energetically more favourable than others [79] . However, the most surprising result from the advent of genome sequencing is the observation that the number of proteins that have no homologue of very similar sequence is constantly rising, i.e. each species uses some very specific proteins [22] often referred to as orphans. If we believe that cross-species evolution was a major event, we may argue that we simply fail to recognise the similarity between a particular kinase in aquifex and human and therefore incorrectly classify that kinase as an orphan. In other words, we may argue that there is another protein that adopts the same fold thus using a similar mechanism to realise function, but that we simply fail to find it since it has diverged too far in evolution. However, Coulson & Moult [21] recently proposed a somewhat shocking conclusion: most folds are specific to one species, i.e. the aquifex and the human kinase have different structures. They propose a model that assumes three separated regions: uni-folds (realised only once in nature), super-folds (repeated many times), and meso-folds (between uni and super). Coulson & Moult estimate that there are over 10,000 folds in nature. Most of these are uni-folds corresponding to orphan families. Note that this estimate is about three [13] to ten-times [19] higher than previous estimates not considering the reality of orphans. On the other extreme end, the model suggests that 80% of all sequence families adopt one of 400 super-folds, most of which are already known. If this proposition were true, we could speed up structural genomics significantly by identifying the corresponding super-folds for the proposed tens of thousands of targets [20, 26] .

Domains may not constitute the evolutionary atom. Did nature really separate three fold types (uni/meso/super), or are the separations based on a lack of the complete picture? If we believe that structural domains constitute the atoms of evolution, the concept of 'folds' and of three types of distinct folds appears reasonable. However, there is evidence that the working hypothesis of folds or structural domains at the basis of evolution may not be the last word. First, the structurally most conserved regions often appear to be skeletons of active sites [80] . Technically, we can explore this observation by searching known structures with such 3D motifs in order to find similarities obfuscated in sequence [6] . Second, the attempt to cluster sequence space based on putative structural domains results in large clusters of proteins that are connected through a ladder of 10-30 overlapping residues [26] . Third, particular stretches of 10-40 residue fragments are observed often in protein structures [81] . This leads to successes in predicting protein structure based on such fragments [1, 82, 83] . All these findings may be explained by a rather challenging hypothesis argued for in detail by Andrei Lupas, Chris Ponting and Rob Russell [84] : the diversity of today's folds might have evolved from peptide ancestors referred to as 'antecedent domain segments' (ADS). The authors explain how ancient protein structures could have been formed by self-assembling aggregates of short polypeptides. They speculate that subsequently, and perhaps concomitantly with the evolution of higher fidelity DNA replication and repair systems, single polypeptide domains arose from the fusion of ADS genes. While the authors provide ample details for the feasibility of their assumptions, we may never be able to falsify their model. Clearly however, the hypothesis explains why it is so difficult to find similarities, and why the same functional motif is often realised by many structures. The model implies that some modern proteins may have evolved by fusing multiple ADSs or by recombining domains that contain structurally compatible ADSs, these proteins are essentially of poly-phyletic origin. Thus, we would also understand why phylogenetic trees do not always agree [85, 34] . We find many internal repeats that are shorter than structural domains; such repeats may constitute an evolutionary advantage in that they can adopt many functions easily [86] . The study of such repeats supports the ADS-model [86] . Overall, the ADS-model is appealing in the number of observation it explains given a minimum set of assumptions. Unfortunately, we still have to technically solve the difficult problem of identifying these antecedent domain segments from sequence, and sequences drift easily thus possibly erasing the ancient signal…


Conclusions

Is protein structure and/or sequence space continuous, or has nature leaped when inventing folds and functions? If proteins were assembled from fragments, does this imply modularity of sequences and folds, as e.g. seen in short peptide fragments that regulate the targeting of proteins through the cell? Does the existence of short motifs or modules imply fragment assembly? I doubt that we have data unambiguously answering these questions. In fact, the evidence from analyses of entirely sequenced organisms is equally spread between pro and con 'natura non facit saltus'.

The age of comparative proteomics has just begun. Already researchers harvested many fruits by combining tools that had been developed in the last decade. Most papers reviewed here give examples for combining state-of-the-art methods and databases to explore protein function, structure, and evolution. Today the network of databases and methods generated by computational biology and bioinformatics approaches the complexity of organisms. However, we are still a long way from an atlas mapping the activity of a cell in terms of space (localization) and time (interaction history of each protein). Kenta Nakai recently reviewed tools that capture some aspects about the 'in vivo fates' of proteins [87] . In a more abstract way, we can describe this objective by the following concept. (1) Describe all proteins by neighbourhoods in terms of sequence families, structural families, sequence motifs, functional classes, pathways, expression profiles, interaction-networks, and sub-cellular compartments. (2) Extend the similarity measure to a measure for some distance. (3) Combine these distances to enable a database search that simultaneously considers multiple classes of neighbourhoods to find similarities between two proteins. If today's proteins really evolved by fragment-assembly [84] , methods that merge different features will be essential for comparative proteomics.


Acknowledgements

Thanks to Jinfeng Liu (Columbia) for computer assistance, the collection of genome data sets, and for providing preliminary data; to Henry Bigelow (Columbia) for helpful comments on the manuscript. The work of BR was supported by the grants 1-P50-GM62413-01 and RO1-GM63029-01 from the National Institute of Health. Last, not least, thanks to all those who deposit their experimental data in public databases, and to those who maintain these databases.


References

1.Baker, D. & Sali, A. (2001).Protein structure prediction and structural genomics. Science, 294, 93-96.
2.Di Gennaro, J. A., Siew, N.,Hoffman, B. T., Zhang, L., Skolnick, J. et al. (2001). Enhanced functionalannotation of protein sequences via the use of structural descriptors. J.Struct. Biol., 134, 232-245.
3.Dietmann, S. & Holm, L. (2001).Identification of homology in protein structure classification. Nat. Struct.Biol., 8, 953-957.
4.Gaasterland, T. & Oprea, M.(2001). Whole-genome analysis: annotations and updates. Curr. Opin. Str. Biol.,11, 377-381.
5.Teichmann, S. A., Murzin, A. G.& Chothia, C. (2001). Determination of protein function, evolution andinteractions by structural genomics. Curr. Opin. Str. Biol., 11, 354-363.
6.Thornton, J. M. (2001). From genometo function. Science, 292, 2095-2097.
7.Todd, A. E., Orengo, C. A. &Thornton, J. M. (2001). Evolution of function in protein superfamilies, from astructural perspective. J. Mol. Biol., 307, 1113-1143.
8.Lichtarge, O. & Sowa, M. E.(2002). Evolutionary predictions of binding surfaces and interactions. Curr.Opin. Str. Biol., 12, 21-27.
9.Orengo, C. A., Bray, J. E., Buchan,D. W., Harrison, A., Lee, D. et al. (2002). The CATH protein family database: Aresource for structural and functional annotation of genomes. Proteomics, 2,11-21.
10.Hegyi, H. & Gerstein, M.(2001). Annotation transfer for genomics: measuring functional divergence inmulti-domain proteins. Genome Res., 11, 1632-1640.
11.Qian, J., Stenger, B., Wilson, C.A., Lin, J., Jansen, R. et al. (2001). PartsList: a web-based system fordynamically ranking protein folds based on disparate attributes, includingwhole-genome expression and interaction information. Nucl. Acids Res., 29,1750-1764.
12.Skovgaard, M., Jensen, L. J.,Brunak, S., Ussery, D. & Krogh, A. (2001). On the total number of genes andtheir length distribution in complete microbial genomes. TIGS, 17, 425-428.
13.Liu, J. & Rost, B. (2001).Comparing function and structure between entire proteomes. Prot. Sci., 10,1970-1979.
14.Stein, L., Sternberg, P., Durbin,R., Thierry-Mieg, J. & Spieth, J. (2001). WormBase: network access to thegenome and biology of caenorhabditis elegans. Nucl. Acids Res., 29, 82-86.
15.Mewes, H. W., Frishman, D.,Guldener, U., Mannhaupt, G., Mayer, K. et al. (2002). MIPS: a database forgenomes and protein sequences. Nucl. Acids Res., 30, 31-34.
16.Kumar, A., Harrison, P. M., Cheung,K. H., Lan, N., Echols, N. et al. (2002). An integrated approach for findingoverlooked genes in yeast. Nat. Biotechnol., 20, 58-63.
17.Rivas, E., Klein, R. J., Jones, T.A. & Eddy, S. R. (2001). Computational identification of noncoding RNAs inE. coli by comparative genomics. Curr. Biol., 11, 1369-1373.
18.Harrison, P. M., Echols, N. &Gerstein, M. B. (2001). Digging for dead genes: an analysis of thecharacteristics of the pseudogene population in the Caenorhabditis elegansgenome. Nucl. Acids Res., 29, 818-830.
19.Wolf, Y. I., Grishin, N. V. &Koonin, E. V. (2000). Estimating the number of protein folds and families fromcomplete genome data. J. Mol. Biol., 299, 897-905.
20.Vitkup, D., Melamud, E., Moult, J.& Sander, C. (2001). Completeness in structural genomics. Nat. Struct.Biol., 8, 559-566.
21.Coulson, A. F. & Moult, J.(2002). A unifold, mesofold, and superfold model of protein fold use. Proteins,46, 61-71.
22.Fischer, D. & Eisenberg, D.(1999). Predicting structures for genome proteins. Curr. Opin. Str. Biol., 9,208-211.
23.Gough, J., Karplus, K., Hughey, R.& Chothia, C. (2001). Assignment of homology to genome sequences using alibrary of hidden Markov models that represent all proteins of known structure.J. Mol. Biol., 313, 903-919.
24.Pawlowski, K., Rychlewski, L.,Zhang, B. & Godzik, A. (2001). Fold predictions for bacterial genomes. J.Struct. Biol., 134, 219-231.
25.Gough, J. & Chothia, C. (2002).SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequencesearches, alignments and genome assignments. Nucl. Acids Res., 30, 268-272.
26.Liu, J. & Rost, B. (2002).Target space for structural genomics revisited. Bioinformatics, in press.
27.Gerstein, M. & Levitt, M.(1997). A structural census of the current population of protein sequences.Proc. Natl. Acad. Sci. U.S.A., 94, 11911-11916.
28.Linial, M. & Yona, G. (2000).Methodologies for target selection in structural genomics. Prog. Biophys.molec. Biol., 73, 297-320.
29.Yona, G., Linial, N. & Linial,M. (2000). ProtoMap: automatic classification of protein sequences andhierarchy of protein families. Nucl. Acids Res., 28, 49-55.
30.Bejerano, G. & Yona, G. (2001).Variations on probabilistic suffix trees: statistical modeling and predictionof protein families. Bioinformatics, 17, 23-43.
31.Kriventseva, E. V., Biswas, M.& Apweiler, R. (2001). Clustering and analysis of protein families. Curr.Opin. Str. Biol., 11, 334-339.
32.Kriventseva, E. V., Fleischmann,W., Zdobnov, E. M. & Apweiler, R. (2001). CluSTr: a database of clusters ofSWISS-PROT+TrEMBL proteins. Nucl. Acids Res., 29, 33-36.
33.Tatusov, R. L., Natale, D. A.,Garkavtsev, I. V., Tatusova, T. A., Shankavaram, U. T. et al. (2001). The COGdatabase: new developments in phylogenetic classification of proteins fromcomplete genomes. Nucl. Acids Res., 29, 22-28.
34.Lin, J. & Gerstein, M. (2000).Whole-genome trees based on the occurrence of folds and orthologs: implicationsfor comparing genomes on different levels. Genome Res., 10, 808-818.
35.Pearl, F. M., Martin, N., Bray, J.E., Buchan, D. W., Harrison, A. P. et al. (2001). A rapid classificationprotocol for the CATH domain database to support structural genomics. Nucl.Acids Res., 29, 223-227.
36.Qian, J., Luscombe, N. M. &Gerstein, M. (2001). Protein family and fold occurrence in genomes: power-lawbehaviour and evolutionary model. J. Mol. Biol., 313, 673-681.
37.Lo Conte, L., Brenner, S. E.,Hubbard, T. J., Chothia, C. & Murzin, A. G. (2002). SCOP database in 2002:refinements accommodate structural genomics. Nucl. Acids Res., 30, 264-267.
38.Vosshall, L. B., Wong, A. M. &Axel, R. (2000). An olfactory sensory map in the fly brain. Cell, 102, 147-159.
39.Schulz, G. E. (2000). beta-Barrelmembrane proteins. Curr. Opin. Str. Biol., 10, 443-447.
40.Jacoboni, I., Martelli, P. L.,Fariselli, P., De Pinto, V. & Casadio, R. (2001). Prediction of thetransmembrane regions of beta-barrel membrane proteins with a neuralnetwork-based predictor. Prot. Sci., 10, 779-787.
41.Wimley, W. C. (2002). Towardgenomic identification of beta-barrel membrane proteins: Composition and architectureof known structures. Prot. Sci., 11, 301-312.
42.Andersen, C. A. F., Palmer, A. G.,Brunak, S. & Rost, B. (2002). Continuous assignment of secondary structurecorrelates with protein flexibility. Structure, 10, 175-184.
43.Przytycka, T., Aurora, R. &Rose, G. D. (1999). A protein taxonomy based on secondary structure. Nat.Struct. Biol., 6, 672-682.
44.Rost, B. (2001). Protein secondarystructure prediction continues to rise. J. Struct. Biol., 134, 204-218.
45.Dunker, A. K., Lawson, J. D.,Brown, C. J., Williams, R. M., Romero, P. et al. (2001). Intrinsicallydisordered protein. J Mol Graph Model, 19, 26-59.
46.Dunker, A. K. & Obradovic, Z.(2001). The protein trinity-linking function and disorder. Nat. Biotechnol.,19, 805-806.
47.Wright, P. E. & Dyson, H. J.(1999). Intrinsically unstructured proteins: re-assessing the proteinstructure-function paradigm. J. Mol. Biol., 293, 321-331.
48.Uversky, V. N., Gillespie, J. R.& Fink, A. L. (2000). Why are "natively unfolded" proteinsunstructured under physiologic conditions? Proteins, 41, 415-427.
49.Namba, K. (2001). Roles of partlyunfolded conformations in macromolecular self-assembly. Genes Cells, 6, 1-12.
50.Zetina, C. R. (2001). A conservedhelix-unfolding motif in the naturally unfolded proteins. Proteins, 44,479-483.
51.Liu, J., Tan, H. & Rost, B.(2002). Eukaryotes full of loopy proteins? J. Mol. Biol., submitted.
52.Romero, P., Obradovic, Z., Li, X.,Garner, E. C., Brown, C. J. et al. (2001). Sequence complexity of disorderedprotein. Proteins, 42, 38-48.
53.Tamames, J., Ouzounis, C., Casari,G., Sander, C. & Valencia, A. (1998). EUCLID: automatic classification ofproteins in functional classes by their database annotations. Bioinformatics,14, 542-3.
54.Devos, D. & Valencia, A.(2000). Practical limits of function prediction. Proteins, 41, 98-107.
55.Devos, D. & Valencia, A.(2001). Intrinsic errors in genome annotation. TIGS, 17, 429-431.
56.Shapiro, L. & Harris, T.(2000). Finding function through structural genomics. Curr. Opin. Biotech., 11,31-35.
57.Brenner, S. E. (2001). A tour ofstructural genomics. Nature, 2, 801-809.
58.Thornton, J. (2001). Structuralgenomics takes off. TIBS, 26, 88-89.
59.Mallick, P., Goodwill, K. E.,Fitz-Gibbon, S., Miller, J. H. & Eisenberg, D. (2000). Selecting proteintargets for structural genomics of Pyrobaculum aerophilum: validating automatedfold assignment methods by using binary hypothesis testing. Proc. Natl. Acad.Sci. U.S.A., 97, 2450-2455.
60.Corpet, F., Servant, F., Gouzy, J.& Kahn, D. (2000). ProDom and ProDom-CG: tools for protein domain analysisand whole genome comparisons. Nucl. Acids Res., 28, 267-9.
61.Enright, A. J. & Ouzounis, C.A. (2000). GeneRAGE: a robust algorithm for sequence clustering and domaindetection. Bioinformatics, 16, 451-457.
62.Heger, A. & Holm, L. (2001).Picasso: generating a covering set of protein family profiles. Bioinformatics,17, 272-279.
63.Reddy, B. V., Li, W. W.,Shindyalov, I. N. & Bourne, P. E. (2001). Conserved key amino acidpositions (CKAAPs) derived from the analysis of common substructures inproteins. Proteins, 42, 148-63.
64.Wu, C. H., Xiao, C., Hou, Z.,Huang, H. & Barker, W. C. (2001). iProClass: an integrated, comprehensiveand annotated protein classification database. Nucl. Acids Res., 29, 52-54.
65.Yona, G. & Levitt, M. (2002).Within the twilight zone: a sensitive profile-profile comparison tool based oninformation theory. J. Mol. Biol., 315, 1257-1275.
66.Apic, G., Gough, J. &Teichmann, S. A. (2001). An insight into domain combinations. Bioinformatics,17, S83-9.
67.Apic, G., Gough, J. &Teichmann, S. A. (2001). Domain combinations in archaeal, eubacterial andeukaryotic proteomes. J. Mol. Biol., 310, 311-325.
68.Lappe, M., Park, J., Niggemann, O.& Holm, L. (2001). Generating protein interaction maps from incompletedata: application to fold assignment. Bioinformatics, 17, S149-S156.
69.Park, J., Lappe, M. &Teichmann, S. A. (2001). Mapping protein family interactions: intramolecularand intermolecular protein family interaction repertoires in the PDB and yeast.J. Mol. Biol., 307, 929-938.
70.Rzhetsky, A. & Gomez, S. M.(2001). Birth of scale-free molecular networks and the number of distinct DNAand protein domains per genome. Bioinformatics, 17, 988-96.
71.Wuchty, S. (2001). Scale-freebehavior in protein domain networks. Mol Biol Evol, 18, 1694-702.
72.Wolf, Y. I., Karev, G. &Koonin, E. V. (2002). Scale-free networks in biology: new insights into thefundamentals of evolution? Bioessays, 24, 105-109.
73.Amodeo, P., Fraternali, F., Lesk,A. M. & Pastore, A. (2001). Modularity and homology: modelling of the titintype I modules and their interfaces. J. Mol. Biol., 311, 283-296.
74.Bashton, M. & Chothia, C.(2002). The geometry of domain combination in proteins. J. Mol. Biol., 315,927-939.
75.Teichmann, S. A., Rison, S. C.,Thornton, J. M., Riley, M., Gough, J. et al. (2001). The evolution andstructural anatomy of the small molecule metabolic pathways in Escherichiacoli. J. Mol. Biol., 311, 693-708.
76.Teichmann, S. A., Rison, S. C.,Thornton, J. M., Riley, M., Gough, J. et al. (2001). Small-molecule metabolism:an enzyme mosaic. TIBTECH, 19, 482-486.
77.Tsoka, S. & Ouzounis, C. A.(2001). Functional versatility and molecular diversity of the metabolic map ofEscherichia coli. Genome Res., 11, 1503-1510.
78.Fetrow, J. S., Siew, N., DiGennaro, J. A., Martinez-Yamout, M., Dyson, H. J. et al. (2001). Genomic-scalecomparison of sequence- and structure-based methods of function prediction:does structure provide additional insight? Prot. Sci., 10, 1005-1014.
79.Rykunov, D. S., Lobanov, M. Y.& Finkelstein, A. V. (2000). Search for the most stable folds of proteinchains: III. Improvement In fold recognition by averaging over homologoussequences and 3D structures. Proteins, 40, 494-501.
80.Irving, J. A., Whisstock, J. C.& Lesk, A. M. (2001). Protein structural alignments and functionalgenomics. Proteins, 42, 378-382.
81.Bystroff, C., Thorsson, V. &Baker, D. (2000). HMMSTR: a hidden Markov model for local sequence-structurecorrelations in proteins. J. Mol. Biol., 301, 173-190.
82.Jones, D. T. (2001). Predictingnovel protein folds by using FRAGFOLD. Proteins, 45 Suppl 5, S127-S132.
83.de La Cruz, X., Sillitoe, I. &Orengo, C. (2002). Use of structure comparison methods for the refinement ofprotein structure predictions. I. Identifying the structural family of aprotein from low-resolution models. Proteins, 46, 72-84.
84.Lupas, A. N., Ponting, C. P. &Russell, R. B. (2001). On the evolution of protein folds: are similar motifs indifferent protein folds the result of convergence, insertion, or relics of anancient peptide world? J. Struct. Biol., 134, 191-203.
85.Grishin, N. V., Wolf, Y. I. &Koonin, E. V. (2000). From complete genomes to measures of substitution ratevariability within and between proteins. Genome Res., 10, 991-1000.
86.Andrade, M. A., Perez-Iratxeta, C.& Ponting, C. P. (2001). Protein repeats: structures, functions, andevolution. J. Struct. Biol., 134, 117-131.
87.Nakai, K. (2001). Review:prediction of in vivo fates of proteins in the era of genomics and proteomics.J. Struct. Biol., 134, 103-116.
88.Trifonov, E. N., Kirzhner, A.,Kirzhner, V. M. & Berezovsky, I. N. (2001). Distinct stages of proteinevolution as suggested by protein sequence analysis. J. Mol. Evol., 53,394-401.
89.Eyrich, V., Martí-Renom, M.A., Przybylski, D., Fiser, A., Pazos, F. et al. (2001). EVA: continuousautomatic evaluation of protein structure prediction servers. Bioinformatics,17, 1242-1243.
90.Carter, P., Liu, J. & Rost, B.(2002). PEP: database with Predictions for Entire Proteomes. Bioinformatics, inpreparation. 

Contact:    rost@columbia.edu Version:    May 13, 2002
top - CUBIC-papers - CUBIC