| Title: | Cataloguing proteins in cell cycle control |
| Author: | Kazimierez O Wrzeszczynski & Burkhard Rost |
| Quote: | 2003 In Cell Cycle Checkpoint Control Protocols, H Lieberman (ed.), Totowa: Humana Press, 219-233 |
Cataloguing proteins in cell cycle control
| 1 | CUBIC, Dept. of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA |
| 2 | Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA |
| 3 | North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA |
| * | Corresponding authors: email = kaz@cubic.bioc.columbia.edu URL http://cubic.bioc.columbia.edu/ Tel: +1-212-305-4018, fax: +1-212-305-7932 |
Key words: cell cycle control, genome sequence analysis, protein function prediction, multiple alignments.
| ; 3D structure | three-dimensional co-ordinates of protein structure |
| BLAST | fast sequence alignment method [1] |
| PDB | Protein Data Bank of experimentally determined 3D structures of proteins [2] |
| PSI-BLAST | position specific iterated database search [3] |
| SWISS-PROT | data base of protein sequences [5] |
| TrEMBL | translation of the EMBL-nucleotide database coding DNA to protein sequences [5] . |
No direct path from sequence similarity to 'biological' similarity. How can bioinformatics tools help to identify particular types of proteins? In general, the answer depends on the type of protein. Alignment methods can identify similarities between two proteins. However, while database search tools are optimised to finding the best possible superposition between two proteins, they fail in answering questions such as: Does the query protein Q perform the same function as the protein in the database H for which we have some experimental data about function? In fact, alignment methods typically provide some statistical score evaluating the probability that the similarity between Q and H happened by chance [1, 3] . The precise function relating such a statistical score for sequence similarity to the actual 'biological' similarity of two proteins, i.e. similarity in terms of their three-dimensional (3D) structure and/or function depends on the problem. For example, if the PSI-BLAST expectation value for the similarity between Q and H is below 10-5, then this typically implies that H and Q have similar local 3D structure [6] . However, less than 70% of all pairs of enzymes that have this level of sequence similarity have exactly the same enzymatic activity [7] and over 90% of all pairs with so similar sequences are observed in the same sub-cellular compartment [8] . Establishing these estimates typically requires solving three different tasks: (1) define biological similarity (3D, enzyme activity, sub-cellular localization), (2) build unbiased data sets of experimentally reliable information, (3) and establish thresholds that relate sequence to biologically similarity. These steps have been completed for a variety of biological features such as structure [9, 10, 11, 12, 6] , enzymatic activity [13, 14, 15, 16, 17, 7] , active sites [14] , binding sites [14] , functional keywords [14] , functional classes [14, 16] , sub-cellular localization [8, 18] . However, there is no way to infer from these results at which level of sequence similarity we can conclude that two homologous proteins play the same role in processes such as cell cycle control.
The field of proteomics has evolved into various levels of biological and computational techniques that identify and classify proteins in the context of entire genomes and proteomes. These techniques include a broad spectrum of approaches; from detailed literature searches [19, 20] or text analysis of database annotations [21] , database mining [22, 23, 24, 25, 26, 27, 28, 20] , multiple sequence alignments [29, 30, 31, 3, 32, 33] , protein family clustering [34, 35, 36, 37, 38, 39] , methods predicting aspects of protein function and structure [40, 41, 42, 43, 44, 45, 46] and computational modelling of the cell cycle [47] to gene microarray or 'chip' expression techniques [48, 49, 50, 51, 52] , yeast two-hybrid systems [53, 54] and recently mass spectroscopy of protein complexes [55, 56] . The process of unifying these techniques from an assortment of cataloguing tools into a more eloquent analysis of the cell cycle and specifically cell cycle control proteins is only beginning to take shape. Here, we present a first step for this process using database mining and literature searches to evaluate the current status of cell cycle control proteins present in various databases, combined with sequence alignment evaluation to identify cell cycle control proteins in various proteomes. We began by archiving proteins known to be involved in cell cycle control through database and literature searches. Then, we established levels of sequence similarity that imply similarity in function. Finally, we attempted identifying cell cycle control proteins through homology in entirely sequenced eukaryotic proteomes.
Curated, well-formatted and annotated databases comprise one of the most important resources for bioinformatics. A few public databases contain information about cell cycle proteins ( Table 1 ); from these we built a resource that identifies the general register of cell cycle information currently available. To create this repository, we collected about 3811 records from MEDLINE [57] . Using SRS [4] , we retrieved about 364 proteins from SWISS-PROT [5] , and 98 proteins of known structure from PDB [2] . Only seven of these 98 were classified as 'cell cycle control' proteins. A closer inspection of the SWISS-PROT dataset revealed 534 proteins with the keyword 'cell cycle', and 940 with the keyword 'cell division'. ProtoNet [36, 58] is a tool that clusters all proteins from SWISS-PROT into somehow related families. ProtoNet identified 1476 clusters with a total of 512 proteins for the SWISS-PROT keyword 'cell cycle' and 887 proteins in 1983 clusters with the keyword 'cell division'. The obvious next task was to peel out a catalogue of unique families of proteins related to cell cycle (Methods).
All human sequences were extracted from SWISS-PROT and TrEMBL [5] . We retrieved all other proteome sequences from the respective public sites: Drosophila melanogaster: http://www.fruitfly.org/, Caenorhabditis elegans: ftp://ncbi.nlm.nih.gov/genbank/genomes/, Saccharomyces cerevisiae from the yeast genome directory [59] , Arabidopsis Thaliana: http://www.arabidopsis.org/, and Mus Musculus: http://www.ensembl.org.
| Databases | |
| The Suiseki Information Extraction System | www.pdg.cnb.uam.es/suiseki/ |
| Yeast Cell Cycle Analysis Project | genome-www.stanford.edu/cellcycle/data/rawdata/ |
| SCPD: Promoter Database of Saccharomyces cerevisiae | cgsigma.cshl.org/jian/ |
| Mouse Genome Informatics | www.informatics.jax.org |
| The Interactive Fly - Cell Cycle in Drosophila | sdb.bio.purdue.edu/fly/aimain/aadevinx.htm |
| Transfac & Transpath | transfac.gbf.de/TRANSFAC/ |
| Mitosis World | www.bio.unc.edu/faculty/salmon/lab/mitosis/mitosis.html |
| TRRD - Transcription Regulatory Regions Database | www.bionet.nsc.ru/trrd/ |
| The Ubiquitin System for Protein Modification and Degradation | www.nottingham.ac.uk/biochemcourses/students/ub/ubindex.html |
| KEGG: Kyoto Encyclopedia of Genes and Genomes | www.genome.ad.jp/kegg/ www.genome.ad.jp/kegg/pathway/hsa/hsa04110.html |
| The p53 web site | p53.curie.fr/ |
| The Kinesin Home Page | www.proweb.org/kinesin/ www.proweb.org/kinesin//KinesinTree.html |
| The Database for Interacting Proteins | dip.doe-mbi.ucla.edu/ |
| The Forsburg Lab pombe Pages | pingu.salk.edu/~forsburg/lab.html |
| Protonet - Automatic Hierarchical Classification of Proteins | www.protonet.cs.huji.ac.il/protonet/index.php |
| MIPS Ð Comprehensive Yeast Genome Database | mips.gsf.de/proj/yeast/ |
| Protein Information Resource | pir.georgetown.edu/ |
| PDB: database of protein structures | www.rcsb.org/pdb |
| SWISS-PROT (annotated proteins) | www.expasy.ch/sprot/sprot-top.html |
| Tools | |
| PSI-BLAST (database search) | www.ncbi.nlm.nih.gov/BLAST |
| Predictions of post-translational modifications | www.cbs.dtu.dk/services/ |
| PredictProtein (sequence analysis + structure prediction) | cubic.bioc.columbia.edu/predictprotein |
| META-PP (interface to variety of tools) | cubic.bioc.columbia.edu/predictprotein/submit_meta.html |
| ExPasy (tools, databases, links) | www.expasy.ch/ |
| WWW links for molecular biology | cubic.bioc.columbia.edu/doc/links_index.html |
a Note1: we dropped the string 'http://' from the URL, e.g. to access KEGG you mayhave to type 'http://www.genome.ad.jp/kegg' in some browsers
Keyword search
in SWISS-PROT. First, we searched for proteins of
trusted experimental information about cell cycle control in SWISS-PROT. Most
proteins retrieved thus control the g1/s and g2/m transitions, or are related
to the m and s phases. In total, we found 361 proteins ( Table 2 ) that were
distributed amongst various species. Next, we clustered these proteins into
families. Sequence-unique data sets. In order to reduce the bias from too similar sequences, we
generated sequence-unique subsets for all types of proteins under
consideration. 'Sequence-unique' was defined by that no pair in the set had
more than 33% identical residues over more than 100 residues aligned
(HSSP-threshold of 0 [6] ). Given an all-against-all pairwise alignment for
the biased set, we simply used a greedy search to find the largest subset that
fulfilled the above condition. This reduced the entire set of 361 to 42 unique
proteins or protein families. Extending
simple keyword-based search. 42 unique proteins
did not suffice to develop any statistical criteria for determining levels of
significant sequence similarity and also implying similarity in the cell cycle
process. We expanded our original data set by including searches for other cell
cycle controlling factors such as ubiquitin, and those in the ras super-family,
plus other proteins annotated for cell division control. This extensive search
for cell cycle control proteins increased the list to a total of 595 proteins;
97 of these had multiple, conflicting annotations ( Table 2 ); 113 were
sequence-unique, i.e. we increased the numbers of families from 42 to 113
through the extended keyword-based search. The entire dataset of cell cycle
control proteins is in the preparation of being made available online at the
CUBIC website: cubic.bioc.columbia.edu.
1 Eukaryotic proteins presented, the remainder ofproteins in the set of 595 cell cycle proteins are involved in the prokaryoticcell cycle process.
Establishing threshold for significant
sequence similarity. If we want to find proteins
that have similar roles in the cell cycle as the proteins for which we have
experimental information in public databases, we have to first establish a
threshold for 'significant sequence similarity', i.e. we have to address the
question: at which level of sequence similarity can we infer similarity in the
specific functional role of that protein. Obviously, such thresholds have to
find a balance between accuracy and coverage, in other words, we have to
navigate between the Skylla of 'high selectivity/low sensitivity', i.e. finding
very few homologues all of which are right, and the Charibdis of 'low
selectivity/high sensitivity', i.e. finding many putative homologues, most of
which are wrong. Cumulative accuracy and coverage were calculated as:  
;  
; with the thresholds for sequence
similarity specified below. Aligning proteins. We generated alignments for all sequences from the cell cycle
unique dataset (595) against a set of non-nuclear (but including cytoplasmic)
proteins of known function other than those functions in cell cycle control
(total of 6728 proteins) using pairwise BLAST [1] . To refine the analysis,
we also generated PSI-BLAST profiles using a filtered version of all currently
known sequences with three iterations [60] . These profiles were then aligned
against our 'cell cycle control plus all other proteins' dataset. Sequence
similarity was defined by percentage identity, BLAST E-values, and the distance
from the HSSP-threshold which relates percentage sequence identity to alignment
length thus accounting for the fact that 80% pairwise identity is not
significant when achieved over a stretch of 15 consecutive residues, however,
it is highly informative when achieved over entire proteins [61] . Accuracy and
coverage of inferring cell cycle role by homology. When we aligned all trusted cell cycle proteins (595) against all
true negatives (6116 non-cell cycle proteins), we found that at HSSP-distances
of 15 (corresponding to 48% pairwise sequence identity for more than 100
aligned residues), we could seemingly infer the role in the cell cycle at an
accuracy of 95% ( Fig. 1 ). However, when using the unbiased, sequence-unique
subset of 113 cell cycle proteins to evaluate accuracy, we found levels of only
60% accuracy. In order to reach a level of 95% accuracy, we had to increase the
HSSP-distance from 15 to 40 ( Fig. 2 ), i.e. have to require over 70% pairwise
sequence identity). Replacing the HSSP-distance by the expectation values from
BLAST or PSI-BLAST (E-values) did not yield a more accurate distinction between
true and false positives. This finding confirmed our previous results on
establishing thresholds for sequence similarity implying similarity in 3D
structure and sub-cellular localisation [8, 7] . Fig. 1. : Sequence
conservation of all trusted cell cycle control proteins. We aligned all trusted cell cycle proteins (595) against all true
negatives (6116 non-cell cycle proteins) using BLAST. Solid lines with filled
squares describe cumulative accuracy (percentage of correctly identified cell
cycle proteins at given threshold, Eqn. 1); dotted lines with open circles
describe cumulative coverage (cell cycle proteins found at threshold/all
cell-cycle proteins, Eqn. 2). We measured sequence similarity in three different
ways: (A) by the percentage pairwise sequence identity (left graph), (B) the
distance from the HSSP-threshold accounting for the length of the alignment
(central graph), and (C) by the negative logarithm of the BLAST E-values (note:
log to the base of 10) (right graph). For example, the accuracy exceeded 80%
for levels > 60% pairwise sequence identity (left), HSSP-distances above 3
(centre), and BLAST expectation values below 10-12 (right). At all
levels of accuracy ³ 80, the HSSP-distance performed best in terms of coverage.
Note that these estimates were based on large data sets, however, they
constituted over-estimates, since the bias in the data sets was not removed.
Fig. 2. :
Estimating accuracy and coverage for BLAST and PSI-BLAST. In order to correctly estimate the likely accuracy and coverage,
we had to remove the bias from our initial data sets by aligning the subset of
113 sequence unique trusted cell-cycle proteins against all trusted cell-cycle
proteins and against all true negatives. For this, we compared the performance
of pairwise BLAST (open symbols) to that of PSI-BLAST (filled symbols).
Accuracy (solid lines) and coverage (dashed lines with circles) were as in Fig
1. In general, PSI-BLAST clearly outperformed BLAST. For example, at HSSP-distances
> 40 the accuracy of PSI-BLAST searches was above 95%. Note that these
estimates were sufficiently lower than those that would have been obtained
using the biased data (Fig. 1). Using only the E-values taken from PSI-BLAST
and BLAST alignments required very high cut-off thresholds: even at levels of
10-10, implying that only one in ten million hits occurred by
chance, less than 70% of the inferences were correct. The residual problem with
the data resulted from the small set sizes (rigged curves).
Identifying
cell cycle control proteins from entirely sequenced proteomes. We used a variety of thresholds for inferring the role of cell
cycle control proteins by homology as to confer the annotations about these
roles from our trusted data set to homologues in entirely sequenced eukaryotes.
In particular, we scanned the proteomes of human (Homo sapiens), mouse (Mus musculus), fly (Drosophila
melanogaster), worm (Caenorhabditis elegans), weed (Arabidopsis thaliana), and
yeast (Saccharomyces cerevisiae). At levels of
around 95% accuracy, we could extend the number of proteins known to be
involved in cell cycle control from 284 for the six completely sequenced
organisms to about 747 ( Table 3 ). Our analysis also pulled out about 500-1300
additional proteins (difference between columns D=40 and D=25 and D=15 in Table
3) that may constitute candidates for unknown cell-cycle control proteins. On
the other extreme end, our data illustrated that over 10000 proteins in any of
these six proteomes have similar 3D structures to one of the known cell-cycle
proteins. Supposedly most of these are not related to cell-cycle control,
illustrating the variety of functions that can be adopted by proteins of
similar structure.
1 Distance from HSSP-Threshold chosen as seen inFig. 2 for various levels of percent accuracy using the PSI-BLAST curve. Levelsof accuracy are estimated according to Fig. 2, e.g. at a threshold of D=40 morethan 95% of the proteins for which we infer the involvement in cell cyclecontrol by homology are supposedly correctly inferred.2 The number of previously known annotated cellcycle control proteins represented in each specific proteome as used in ourtrusted data set is given for comparison.
Everyday biologists are searching with
their protein Q of interest by standard alignment methods to uncover putative
homologies to their protein. Due to large-scale sequencing efforts, these
database searches retrieve more and more often proteins without any annotation
other than 'hypothetical protein'. To initiate hypotheses about function such
results are obviously not very informative. More difficult are the 'helpful'
cases when a protein with experimental annotation about function H is similar
to Q. The number of pitfalls that can lead to incorrect hypotheses based on
database searches are manifold [62, 63, 64, 14, 65, 28] . Nevertheless, an
increasing number of publications in modern biology is based on some beneficial
hints obtained from database searches. How can we separate the chaff from the
wheat? Certainly, it is a sine qua non to
establish thoroughly evaluated, statistically significant estimates for which
level of sequence similarity implies what [13, 14, 66, 16, 17, 7] . In the
context of cell cycle proteins, our approach aims at identifying commonalities
in the evolutionary conservation of a selected group of functions. On the one
hand, it appears evident that all proteins involved in cell cycle and cell
cycle control have common evolutionary constraints. If true we can infer the
involvement of a protein in the cell cycle process based on sequence
similarity. On the other hand, we may suspect that two kinases such as pyruvate
dehydrogenase kinase and Cdk1 are more similar than the two cell cycle proteins
Cdk1 kinase and the E2F transcription factor. If true, we have to define all
types of function related to cell cycle and have to establish thresholds for
each functional type; in other words, our inference of cell cycle roles based
on homology is rather limited. Arguably reality falls between these two
extremes. Therefore, our ability to discover new proteins in cell cycle control
through homology works to some extent, but is rather restricted.
Jones & Sgouros [67] studied
cohesion complex proteins through sequence motifs and database searches. They
used PSI-BLAST to identify all homologues of the SMC (Structural Maintenance of
Chromosomes) and the SCC (Sister-Chromatid Cohesion) proteins from yeast (Smc1, Smc3, Scc1, Scc2, Scc3, and Scc4), as well as
four proteins interacting with cohesion proteins (Trf4,
Prp11, Tid3, Esp1). Next, the authors aligned the putative homologues
identified by PSI-BLAST using the dynamic programming based method ClustalX
[32, 33] , and constructed putative evolutionary trees from these ClustalX
alignments using the program PHYLIP [68] . Finally, the study identified
possible binding partners from the complete two-hybrid screens available
through the Yeast Proteome Database and putative sequence motifs through the
program Teiresias [69] . The study resulted
in the establishment of five families of SMC proteins, a cohesion interaction
network of 17 proteins and the identification of possible common sequence motifs
for binding and a kinase active site. Kel and colleagues [70] combined
experimental and theoretical techniques in a comprehensive study identifying
the 5' regulatory regions of cell-cycle related genes. First, the group
developed a program that identifies context-specific binding sites for the E2F
transcription factors. All these sites were identified in entirely sequenced
genomes with the aim to identify new genes that play a role in controlling
cell-proliferation, differentiation, and apoptosis. Finally, the predictions
were verified by chromatin immunoprecipitation assays. The study resulted in a
total of 313 new potential E2F targets found, 8 of which were verified through
the in vivo experimentation. Blaschke & Valencia [19] developed
a text analysis system (SUISEKI: System for Information Extraction on
Interactions) that automatically identifies cell-cycle related protein-protein
interactions from scientific literature, i.e. from MEDLINE abstracts. At the
heart of the system, text searches are defined into frames that capture the
various language constructs used to convey protein interactions. The authors
selected 5,283 abstracts that included the word Òcell-cycleÓ, the system
detected 6,778 protein interactions from all of the abstracts, resulting
finally in 4,657 distinct interactions from a total of 1,471 abstracts. The
data is currently available at www.pdg.cnb.uam.es/suiseki/. Thanks to Jinfeng Liu (Columbia) for
computer assistance and the collection of genome data sets; to Jinfeng Liu,
Dariusz Przybylski (Columbia), and Rajesh Nair (Columbia) for providing
preliminary information and programs. Particular thanks to Volker Eyrich
(Columbia) for programming and maintaining most of the immensely valuable
software that runs the EVA and META-PredictProtein servers! The work of JL and
BR was supported by the grants 1-P50-GM62413-01 and RO1-GM63029-01 from the
National Institute of Health. Last, not least, thanks to all those who deposit
their experimental data in public databases, and to those who maintain these
databases.
Note 2: we will make all our data along with a novel cell cycle specificdatabase available through our website cubic.bioc.columbia.eduMethods
Cell cycle and cell cycle control proteins in public databases
Species cell cyclecontrol g1/s g2/m m phase s phase other multiple Eukaryotes 582 135 86 66 156 229 90 Homo sapiens 99 28 11 23 41 24 28 Mus musculus 68 25 8 10 30 18 23 Drosophila melanogaster 15 5 3 2 4 3 2 Caenorhabditis elegans 10 1 4 1 2 2 0 Arabidopsis thaliana 5 0 1 0 0 4 0 Saccharomyces cerevisiae 87 20 11 5 19 46 14 Cell cycle control protein identification through sequence similarity
(1)
(2)
Proteome Known cellcycle control proteins 2 Predicted cell cycle control proteins D=0
(55%)D=15
(65%)D= 25
(90%)D= 40
(95%)Homosapiens 99 3073 782 476 299 Musmusculus 68 3162 574 310 203 Drosophilamelanogaster 15 970 181 96 50 Caenorhabditiselegans 10 1005 185 87 32 Arabidopsisthaliana 5 1888 303 148 63 Saccharomycescerevisiae 87 513 148 119 100 Sum 284 10611 2173 1236 747 Notes
Limits of inferring function through homology
Other tools targeting cell cycle proteins
Acknowledgements
References
1. Altschul, S. F. & Gish, W.(1996). Local alignment statistics. Methods in Enzymology, 266, 460-480. 2. Berman, H. M., Westbrook, J., Feng,Z., Gilliland, G., Bhat, T. N. et al. (2000). The Protein Data Bank. NucleicAcids Res, 28,235-42. 3. Altschul, S., Madden, T., Shaffer,A., Zhang, J., Zhang, Z. et al. (1997). Gapped Blast and PSI-Blast: a newgeneration of protein database search programs. Nucleic Acids Research, 25, 3389-3402. 4. Etzold, T., Ulyanov, A. & Argos,P. (1996). SRS: Information retrieval system for molecular biology data banks. Methodsin Enzymology, 266,114-128. 5. Bairoch, A. & Apweiler, R.(2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in2000. Nucleic Acids Res, 28, 45-8. 6. Rost, B. (1999). Twilight zone ofprotein sequence alignments. Protein Eng, 12, 85-94. 7. Rost, B. (2002). Enzyme functionless conserved than anticipated. Journal of Molecular Biology, 318, 595-608. 8. Nair, R. & Rost, B. (2002).Sub-cellular localisation surprisingly conserved in sequence. ProteinScience,submitted. 9. Sander, C. & Schneider, R.(1991). Database of homology-derived structures and the structural meaning ofsequence alignment. Proteins: Structure, Function, and Genetics, 9, 56-68. 10. Abagyan, R. A. & Batalov, S.(1997). Do aligned sequences share the same fold? Journal of MolecularBiology, 273,355-368. 11. Alexandrov, N. N. & Soloveyev,V. V. (1998). Statistical significance of ungapped sequence alignments. InHICCS' 98: Pacific Symposium on Biocomputing' 98 (Altman, R. B., Dunker, A. K.,Hunter, L. & Klein, T. E., eds.), pp. 463-472, World Scientific, Maui,Hawaii, U.S.A.. 12. Brenner, S. E., Chothia, C. &Hubbard, T. J. P. (1998). Assessing sequence comparison methods with reliablestructurally identified distant evolutionary relationships. Proceedings ofthe National Academy of Sciences, 95, 6073-6078. 13. Shah, I. & Hunter, L. (1997).Predicting enzyme function from sequence: a systematic appraisal. In FifthInternational Conference on Intelligent Systems for Molecular Biology(Gaasterland, T., Karp, P., Karplus, K., Ouzounis, C., Sander, C. et al.,eds.), pp. 276-283, AAAI Press, Halkidiki, Greece. 14. Devos, D. & Valencia, A.(2000). Practical limits of function prediction. Proteins: Structure,Function, and Genetics, 41, 98-107. 15. Jaroszewski, L., Rychlewski, L.& Godzik, A. (2000). Improving the quality of twilight-zone alignments. ProteinScience, 9,1487-1496. 16. Wilson, C. A., Kreychman, J. &Gerstein, M. (2000). Assessing annotation transfer for genomics: quantifyingthe relations between protein sequence, structure and function throughtraditional and probabilistic scores. Journal of Molecular Biology, 297, 233-249. 17. Todd, A. E., Orengo, C. A. &Thornton, J. M. (2001). Evolution of function in protein superfamilies, from astructural perspective. Journal of Molecular Biology, 307, 1113-1143. 18. Wrzeszczynski, K. O. & Rost, B.(2002). Retention signals for Endoplasmic reticulum and Golgi apparatus motifsinaccurate. Proteins: Structure, Function, and Genetics,in preparation. 19. Blaschke, C. & Valencia, A.(2001). The potential use of SUISEKI as a protein interaction discovery tool. GenomeInform Ser Workshop Genome Inform, 12, 123-34. 20. Valencia, A. (2002). Search andretrieve: Large-scale data generation is becoming increasingly important inbiological research. But how good are the tools to make sense of the data? EMBOReports, 3, 396-400. 21. Nair, R. & Rost, B. (2002).Inferring sub-cellular localisation through automated lexical analysis. Bioinformatics,in press. 22. Walker, D. R. & Koonin, E. V.(1997). SEALS: a system for easy analysis of lots of sequences. In FifthInternational Conference on Intelligent Systems for Molecular Biology(Gaasterland, T., Karp, P., Karplus, K., Ouzounis, C., Sander, C. et al.,eds.), pp. 333-339, AAAI Press, Halkidiki, Greece. 23. Schmitt, A. O., Specht, T.,Beckmann, G., Dahl, E., Pilarsky, C. P. et al. (1999). Exhaustive mining of ESTlibraries for genes differentially expressed in normal and tumour tissues. NucleicAcids Research, 27,4251-60. 24. Andrade, M. A. & Bork, P.(2000). Automated extraction of information in molecular biology. FEBS Lett, 476, 12-7. 25. Gaasterland, T., Sczyrba, A.,Thomas, E., Aytekin-Kurban, G., Gordon, P. et al. (2000). MAGPIE/EGRETannotation of the 2.9-Mb Drosophila melanogaster Adh region. Genome Res., 10, 502-510. 26. Galperin, M. Y. & Koonin, E. V.(2000). Who's your neighbor? New computational approaches for functionalgenomics. Nature Biotechnology, 18, 609-613. 27. Gaasterland, T. & Oprea, M.(2001). Whole-genome analysis: annotations and updates. Curr. Opin. Str.Biol., 11, 377-381. 28. Koonin, E. V. (2001). Computationalgenomics. Curr Biol, 11, R155-8. 29. Smith, T. F., Waterman, M. S. &Burks, C. (1985). The statistical distribution of nucleic acid similarities. Nucl.Acids Res., 13,645-656. 30. Higgins, D. G., Thompson, J. D.& Gibson, T. J. (1996). Using CLUSTAL for multiple sequence alignments. Meth.Enzymol., 266,383-402. 31. Pearson, W. R. (1996). Effectiveprotein sequence comparison. Methods in Enzymology, 266, 227-258. 32. Jeanmougin, F., Thompson, J. D.,Gouy, M., Higgins, D. G. & Gibson, T. J. (1998). Multiple sequencealignment with Clustal X. Trends in Biochemical Sciences, 23, 403-405. 33. Higgins, D. G. & Taylor, W. R.(2000). Multiple sequence alignment. Methods Mol Biol, 143, 1-18. 34. Enright, A. J. & Ouzounis, C.A. (2000). GeneRAGE: a robust algorithm for sequence clustering and domaindetection. Bioinformatics, 16, 451-457. 35. Gerstein, M. & Jansen, R.(2000). The current excitement in bioinformatics-analysis of whole-genomeexpression data: how does it relate to protein structure and function? CurrOpin Struct Biol, 10,574-584. 36. Linial, M. & Yona, G. (2000).Methodologies for target selection in structural genomics. Progress inBiophysics and Molecular Biology, 73, 297-320. 37. Heger, A. & Holm, L. (2001).Picasso: generating a covering set of protein family profiles. Bioinformatics, 17, 272-279. 38. Rehmsmeier, M. & Vingron, M.(2001). Phylogenetic information improves homology detection. Proteins:Structure, Function, and Genetics, 45, 360-371. 39. Liu, J. & Rost, B. (2002).Target space for structural genomics revisited. Bioinformatics,in press. 40. Jones, D. T. (1997). Progress inprotein structure prediction. Current Opinion in Structural Biology, 7, 377-387. 41. Rost, B. & Sander, C. (2000).Third generation prediction of secondary structure. Methods in MolecularBiology, 143, 71-95. 42. Thornton, J. W. & DeSalle, R.(2000). Gene family evolution and homology: genomics meets phylogenetics. AnnuRev Genomics Hum Genet, 1, 41-73. 43. Baker, D. & Sali, A. (2001).Protein structure prediction and structural genomics. Science, 294, 93-96. 44. Pawlowski, K., Rychlewski, L.,Zhang, B. & Godzik, A. (2001). Fold predictions for bacterial genomes. Journalof Structural Biology, 134, 219-231. 45. Rost, B. (2001). Protein secondarystructure prediction continues to rise. Journal of Structural Biology, 134, 204-218. 46. Rost, B. (2002). Did evolution leapto create the protein universe? Current Opinion in Structural Biology, 12, 409-416. 47. Tyson, J. J. & Novak, B.(2001). Regulation of the eukaryotic cell cycle: molecular antagonism,hysteresis, and irreversible transitions. J Theor Biol, 210, 249-63. 48. Gaasterland, T. & Bekiranov, S.(2000). Making the most of microarray data. Nature Genetics, 24, 204-206. 49. Brazma, A., Hingamp, P.,Quackenbush, J., Sherlock, G., Spellman, P. et al. (2001). Minimum informationabout a microarray experiment (MIAME)-toward standards for microarray data. Nat.Gen., 29, 365-371. 50. Cho, R. J., Huang, M., Campbell, M.J., Dong, H., Steinmetz, L. et al. (2001). Transcriptional regulation andfunction during the human cell cycle. Nat Genet,27, 48-54. 51. Sherlock, G., Hernandez-Boussard,T., Kasarskis, A., Binkley, G., Matese, J. C. et al. (2001). The StanfordMicroarray Database. Nucleic Acids Res, 29, 152-5. 52. Shedden, K. & Cooper, S.(2002). Analysis of cell-cycle-specific gene expression in human cells asdetermined by microarrays and double-thymidine block synchronization. ProcNatl Acad Sci U S A, 99, 4379-84. 53. Cagney, G., Uetz, P. & Fields,S. (2000). High-throughput screening for protein-protein interactions usingtwo-hybrid assay. Methods Enzymol, 328, 3-14. 54. Tucker, C. L., Gera, J. F. & Uetz,P. (2001). Towards an understanding of complex protein networks. Trends CellBiol, 11, 102-6. 55. Gavin, A. C., Bosche, M., Krause,R., Grandi, P., Marzioch, M. et al. (2002). Functional organization of theyeast proteome by systematic analysis of protein complexes. Nature, 415, 141-7. 56. Ho, Y., Gruhler, A., Heilbut, A.,Bader, G. D., Moore, L. et al. (2002). Systematic identification of proteincomplexes in Saccharomyces cerevisiae by mass spectrometry. Nature, 415, 180-3. 57. Airozo, D., Allard, R., Brylawski,B., Canese, K., Kenton, D. et al. (1999). MEDLINE. 1999, . 58. Bilu, Y. & Linial, M. (2002).The advantage of functional prediction based on clustering of yeast genes andits correlation with non-sequence based classifications. Journal ofComputational Biology, 9, 193-210. 59. (1997). The yeast genome directory.Nature, 387, 5. 60. Przybylski, D. & Rost, B.(2002). Alignments grow, secondary structure prediction improves. Proteins:Structure, Function, and Genetics, 46, 195-205. 61. Nair, R., Cokol, M. & Rost, B.(2000). PredictNLS: prediction of nuclear localisation signals. 2000, . 62. Bork, P. & Gibson, T. J.(1996). Applying motif and profile searches. Methods in Enzymology, 266, 162-184. 63. Rost, B. & Valencia, A. (1996).Pitfalls of protein sequence analysis. Current Opinion in Biotechnology, 7, 457-461. 64. Eisenhaber, F. & Bork, P.(1998). Wanted: subcellular localization of proteins based on sequence. Trendsin Cell Biology, 8,169-170. 65. Devos, D. & Valencia, A.(2001). Intrinsic errors in genome annotation. Trends in Genetics, 17, 429-431. 66. Pawlowski, K., Jaroszewski, L.,Rychlewski, L. & Godzik, A. (2000). Sensitive sequence comparison asprotein function predictor. Pac Symp Biocomput,8, 42-53. 67. Jones, S. & Sgouros, J. (2001).The cohesin complex: sequence homologies, interaction networks and sharedmotifs. Genome Biology, 2, RESEARCH0009.1-0009.12. 68. Felsenstein, J. (1988). PHYLIP:phylogeny inference package. Cladistics, 5, 355-356. 69. Rigoutsos, I., Floratos, A.,Ouzounis, C., Gao, Y. & Parida, L. (1999). Dictionary building viaunsupervised hierarchical motif discovery in the sequence space of naturalproteins. Proteins: Structure, Function, and Genetics, 37, 264-277. 70. Kel, A. E., Kel-Margoulis, O. V.,Farnham, P. J., Bartley, S. M., Wingender, E. et al. (2001). Computer-assistedidentification of cell cycle-related genes: new targets for E2F transcriptionfactors. J Mol Biol, 309, 99-120.
Contact: rost@columbia.edu
Version: Feb 24, 2003