| Title: | Automatic prediction of protein function |
| Author: | Burkhard Rost , Jinfeng Liu , Rajesh Nair , Kazimierz O. Wrzeszczynski and Yanay Ofran |
| Quote: | Cellular Molecular Life Sciences, 2003, 60:2637-2650 |
Automatic prediction of protein function
| 1 | CUBIC, Dept. of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA |
| 2 | Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA |
| 3 | North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA |
| 4 | Dept. of Pharmacology, Columbia Univ., 630 West 168th Street, New York, NY 10032, USA |
| 5 | Dept. of Physics, Columbia Univ., 538 West 120th Street, New York, NY 10027, USA |
| 6 | Dept. of Medical Informatics, Columbia Univ., 630 West 168th Street, New York, NY 10032, USA |
| * | Corresponding author: cubic@cubic.bioc.columbia.edu URL http://cubic.bioc.columbia.edu/ Tel: +1-212-305-4018, fax: +1-212-305-7932 |
Most methods annotating protein function utilise sequence homology to proteins of experimentally known function. Such a homology-based annotation transfer is problematic and limited in scope. Therefore, computational biologists have begun to develop ab initio methods that predict aspects of function, including sub-cellular localization, post-translational modifications, functional type, and protein-protein interactions. For the first two cases, the most accurate approaches rely on identifying short signalling motifs, while the most general methods utilise tools of artificial intelligence. An outstanding new method predicts classes of cellular function directly from sequence. Similarly, promising methods have been developed predicting protein-protein interaction partners at acceptable levels of accuracy for some pairs in entire proteomes. No matter how difficult the task, successes over the last few years have clearly paved the way for ab initio prediction of protein function.
Key words: genome analysis; protein function prediction; ab initio prediction; neural networks; multiple alignments; sequence analysis; sub-cellular localization; post-translational modifications; protein-protein interactions; bioinformatics
| 3D | three-dimensional; 3D structure, three-dimensional (co-ordinates of protein structure) |
| PDB | Protein Data Bank of experimentally determined 3D structures of proteins [1] |
| SWISS-PROT | data base of protein sequences [2] |
| TrEMBL | translation of the EMBL-nucleotide database coding DNA to protein sequences [2] . |
'Protein function' is an operational concept. Proteins perform most important tasks in organisms, such as catalysis of biochemical reactions, transport of nutrients, recognition and transmission of signals. The plethora of aspects of the role of any particular protein is referred to as its 'function'. However, protein function is not a well-defined term, instead function is a complex phenomenon that is associated with many mutually overlapping levels: biochemical, cellular, organism mediated, developmental, and physiological. These overlapping levels are intertwined in complex ways, for example, protein kinases can be related to different cellular functions (such as cell cycle), and to a chemical function (transferase); the same kinase may also 'mis-function' thereby causing disease. Here, we use the generalised, operational notion that 'function is everything that happens to or through a protein'.
Sequence-structure and sequence-function gaps. The first entire genome (DNA) sequence of a free-living organism, Haemophilus influenzae, was published in 1995 [3] . Now, we know the genomes for over 100 organisms; for over 60, the data is publicly available and contributes about 250K protein sequences, i.e. one fourth of all known protein sequences [4, 5, 6, 7] . This explosion of sequence information has widened the gap between the number of protein sequences and the number of experimentally characterised proteins [8, 9, 10, 4] . Computational biology plays a central role in bridging this gap [11, 12, 13, 14, 15, 16] . For about 10-40% of all sequences, we can deduce structure from homology to known structures [17, 18, 19, 20, 4, 21, 22] . For more than 40-60% of all sequences from current genome projects, sequence homology suggests some aspects of function [23, 24, 25, 10] . However, a firm conclusion about function is not always clear, as predictions can be anything from cellular function (e.g. ATPase or ion channel) to details about cofactor binding sites (e.g. ATP binding sites).
Transfer of function based on sequence homology. Querying MEDLINE [26] with 'predict protein function' retrieves over 1000 papers from one year. The vast majority describes single-case studies in which experts combine many tools to guess aspects of function for a particular protein or protein family. Recently, James Whisstock and Arthur Lesk have focused on these aspects in an excellent, comprehensive review [27] . Here, we focus mainly on ab initio methods that predict function in absence of experimental annotations for homologues. We discuss some problems of homology transfer. We ignore methods that successfully identify functionally important residues from multiple alignments and/or protein structures [28, 29, 30, 31, 32, 33, 34, 35] . Arguably the most successful approaches combine tools from artificial intelligence (neural networks, Hidden Markov models, Support Vector machines) with evolutionary information contained in multiple alignments and aspects of protein structure.
Molecular biology databases with functional information. Information about protein sequences is stored in public databases such as SWISS-PROT and TrEMBL (Table 1). SWISS-PROT [2] is a curated database of protein sequences that also contains annotations about function added by a team of experts who extract this information primarily from journal publications [36] . TrEMBL [2] consists of entries that are derived from the translation of all coding sequences in the EMBL nucleotide sequence database [37] and are not in SWISS-PROT. Unlike SWISS-PROT records, those in TrEMBL are awaiting manual annotation. SWISS-PROT currently contains 122,564 sequence entries while the TrEMBL database contains over 821,014 sequence entries [2] . Many databases of protein families are derived from these original resources [12, 38, 39, 10, 40, 41, 42, 43, 44] . An issue that becomes increasingly important is the redundancy in original and derived databases. Such redundancy causes problems for database search techniques (alignments) and complicates estimates for the accuracy of annotation transfer [45] . A few resources address this problem by maintaining non-redundant sub-sets like KIND [46] , CluSTR [47] , or BLOCKS+ [48] databases, others provide tools to address the problem [49, 50, 51] .
| Name | Description | URL |
|---|---|---|
| ALIGN=LEFTGeneral databases | ||
| SWISS-PROT | annotated protein sequences | http://www.ebi.ac.uk/swissprot/ |
| TrEMBL | translated protein sequences | http://www.ebi.ac.uk/trembl/ |
| Gene Ontology (GO) | ontology of protein function | http://www.geneontology.org/ |
| MIPS | annotation and ontology of function | http://mips.gsf.de/ |
| Ensembl | proteins from human and mouse | http://www.ensembl.org/ |
| Post-translational modification | ||
| RESID | database of post-translational modifications | http://www.nbrf.georgetown.edu/pirwww/dbinfo/resid.html |
| PROSITE | database of protein motifs | http://www.expasy.ch/prosite/ |
| PlantsP | database of phosphorylation for plants | plantsp.sdsc.edu |
| NetPhos | predict protein phosphorylation | http://www.cbs.dtu.dk/services/NetPhos/ |
| NetOGlyc | predict O- a-GlcNAc glycosylation | http://www.cbs.dtu.dk/services/NetOGlyc/ |
| DictyOGlyc | predict O-GalNAc glycosylation | http://www.cbs.dtu.dk/services/DictyOGlyc/ |
| YinOYang | predict O-b-GlcNAc glycosylation and Yin-Yang sites | http://www.cbs.dtu.dk/services/YinOYang/ |
| GPI-predict | predict GPI-anchored proteins | http://mendel.imp.univie.ac.at/gpi/gpi_prediction.html |
| The Sulfinator | predict tyrosine sulfation | http://us.expasy.org/tools/sulfinator/ |
| Sub-cellular localization | ||
| HMMTOP | predict transmembrane helices | http://www.enzim.hu/hmmtop/ |
| TMHMM | predict transmembrane helices | http://www.cbs.dtu.dk/services/TMHMM/ |
| PHDhtm | predict transmembrane helices | http://cubic.bioc.columbia.edu/predictprotein/ |
| PredictNLS | nuclear localization signals | http://cubic.bioc.columbia.edu/predictNLS/ |
| LOC3d | localization for eukaryotic structures | http://cubic.bioc.columbia.edu/db/LOC3d/ |
| PSORT II | predict localization | http://psort.nibb.ac.jp/ |
| NNPSL | predict localization | http://www.doe-mbi.ucla.edu/cgi/astrid/nnpsl_mult.cgi |
| TargetP | combination of signal, chloroplast, and mitochondrial targeting signals | http://www.cbs.dtu.dk/services/TargetP/ |
| No Name | predict localization of yeast proteins | http://bioinfo.mbb.yale.edu/genome/localize/ |
| ProtComp | predict localization for plants | http://www.softberry.com/berry.phtml?topic=proteinloc |
| Predotar | predict mitochondrial and plastid targeting | http://www.inra.fr/Internet/Produits/Predotar/ |
| Processing, degradation, and antigen presentation | ||
| MEROPS | database of proteases | http://www.merops.co.uk |
| IMGT | immunogenetics database | http://imgt.cines.fr/ |
| FIMM | database of functional immunology | http://sdmc.krdl.org.sg:8080/fimm/ |
| MHCPEP | database of MHC-binding peptides | http://wehih.wehi.edu.au/mhcpep/ |
| SYFPEITHI | database of MHC ligands and peptide motifs; also includes the prediction service | http://syfpeithi.bmi-heidelberg.com/scripts/MHCServer.dll/home.htm |
| BIMAS | predict HLA peptide binding | http://bimas.dcrt.nih.gov/molbio/hla_bind/ |
| NetChop | predict human proteasome cleavage sites | http://www.cbs.dtu.dk/services/NetChop/ |
| Functional class | ||
| ProtFun | predict cellular, enzyme, and GO class | http://www.cbs.dtu.dk/services/ProtFun/ |
| Meta servers | ||
| Pedant | proteome predictions and analysis | http://pedant.gsf.de/methods.html |
| PEP | predictions for entire proteomes | http://cubic.bioc.columbia.edu/db/PEP/ |
Transfer of annotation through homology. Experimentally determining protein function continues to be a laborious task that may take enormous resources. For instance, more than a decade after its discovery, we still do not know the precise and entire functional role of the prion protein [52] . The automatic elucidation of protein function is therefore an appealing challenge [53, 38, 54, 27] . Bioinformatics exploits that two proteins with similar sequence often have similar function. Albeit, this concept appears straightforward, in practice, there are many hurdles to over-come. First, function is not well defined, hence it is very difficult to create controlled vocabularies [55, 56, 57] . Second, the precise values for thresholds of significant sequence similarity (T) are actually specific to particular aspects of function and have to be re-established for any given task [58, 59, 60, 61, 62, 56, 63, 45, 64, 65] ( Fig. 1 ). The problem of annotating function was illustrated immediately after the release of the first genome [3] : 148 amendments were published a few weeks after the original publication [66] . Similar amendments followed most papers presenting entirely sequenced genomes [67, 68, 69] . Several pitfalls in transferring annotations of function have been reported, e.g. inadequate knowledge of thresholds for 'significant sequence similarity', or using only the best database hit, or ignoring the domain organisation of proteins [70, 68, 71, 69, 72, 73, 74] . However, Eugene Koonin and colleagues turned the issue of annotation transfer errors around by collecting a few examples for which subsequent experiments showed that theoretical predictions had been more accurate than previous experiments [75] .
Problem 1: Multiple levels of description. Several groups and associations have ventured to introduce numerical schemata to define function. The first attempt was the introduction of Enzyme Classification numbers (EC [76] ); this classification uses four digits to classify enzymatic activity [56] . The MIPS database attempts extending this idea to a wider perspective of more proteins and more roles through their classification catalogue [77] . Another characterisation of protein function originates from the Gene Ontology (GO) consortium [55] . GO distinguishes three levels of protein function. (1) Molecular function: at the molecular level, the protein can, for example, catalyse a metabolic reaction, recognize or transmit a signal. (2) Biological process: a set of many co-operating proteins is responsible for achieving broad biological goals, for example, mitosis or purine metabolism, or signal transduction cascades (3) Cellular component: this category includes the structure of sub-cellular compartments, the localization of proteins, and macromolecular complexes. Examples include nucleus, telomere, and origin recognition complex. The sub-cellular localization of a protein is an essential attribute for this level. The totality of the physiological sub-systems of the cell and their interplay with various environmental stimuli determines properties of the phenotype, the morphology and physiology of the organism and its behaviour. Although not complete, GO constitutes the best set of definitions available today.
Problem 2: Functional information not machine-readable. Nearly all databases present the protein sequence in formats that are more or less straightforward to parse by computers. However, annotations are mostly written in free text using a rich biological vocabulary that often varies in different areas of research. Such annotations are primarily meant for the eyes of human experts, hence, they are not machine-readable [78] . Another problem that hampers automatic annotations is the quality of database annotations: only few database groups attempt a quality control of curated annotations [79] .
Establish accuracy of homology transfer. The reliability of transfer by homology depends on the particular feature of function/structure considered. In order to estimate the accuracy in transferring function given a particular threshold in sequence similarity, we have to complete the following three steps ( Fig. 1 sketch on top): (1) build data sets that have experimental annotations about the presence (true, e.g. all proteins experimentally known to be nuclear) and absence (false, e.g. all proteins experimentally known NOT to be nuclear) of a certain aspect of function. (2) In order to avoid estimates that are incorrectly biased by the distribution of today's experimental information [45] , a representative sub-set of proteins from the true data has to be extracted and aligned against all proteins in the true set (minus the representative sub-set) and false set. (3) For all alignments, we then have to count how many true and false we find at every given threshold for sequence similarity. How to measure sequence similarity? The most popular way is the level of pairwise sequence identity, i.e. the percentage of residues that are identical in an alignment of two proteins (R on R -> 1, R on K -> 0). The major problem with such a score in the context of automatic annotations is that it does not reflect the length of the alignment. For example, 11 identical residues may differ in both function and structure [80, 63, 45] . On the other hand, levels of pairwise sequence identity like 33% for alignments longer than 100 residues, or 22% for alignments longer than 250 residues imply similarity in structure [80] . This observation is used to compile an empirical threshold for significant sequence similarity as a function of alignment length [81, 82, 80] . We refer to this threshold as the HSSP-value; it is empirically chosen such that any pair of proteins A, B have similar structure if HSSP-value(A,B)>0. Another measure of sequence similarity is the expectation value built into the popular PSI-BLAST [83] alignment program. An important point to realise for BLAST and PSI-BLAST users is that the expectation value depends on the database used to search for related proteins. This implies the following. Assume we align proteins A and B by pairwise BLAST in two ways: (i) by searching with A against SWISS-PROT, and (ii) by searching with A against SWISS-PROT + PDB [1] . Even if the resulting alignments between A and B are identical, the expectation values may differ significantly due to the difference in size of the two databases. Unfortunately, the accuracy of transferring different aspects of function differs substantially ( Fig. 1 A-C illustrate this for the case of localization and enzymatic activity).
Fig. 1 : Accuracy and power of homology transfer. Thresholds for sequence similarity implying functional similarity depend on the particular aspect of function that we want to infer. For example, transfer of annotations for enzymatic activity (thick lines with open plus signs A-C) requires higher levels of similarity than transfer for annotations about sub-cellular localization (thin lines with diamonds A-C). Even at levels above 80 <<<10-150, we still make mistakes in transferring EC numbers. For which fraction of entirely sequenced organisms can we transfer annotations? An upper limit is provided by the fraction of proteins that have sequence similarity to proteins from SWISS-PROT (E-G). If we want the transfer at error levels <10% (arrows A-C), maximally 60% of all proteins from 62 entirely sequenced organisms can be annotated (arrow F). This estimate provides in upper-limit since its two basic assumptions are likely over-optimistic: (1) not all SWISS-PROT proteins have reliable and detailed experimental annotations about function and (2) the accuracy of homology transfer for details of the functional role may be much lower for mechanisms that are less local than enzymatic activity.
Most annotations of function through homology transfer. In general, the inference of function is reliable only for very high levels of sequence similarity [60, 63, 45] . Although some perceive the estimate that 30% of the annotations may contain errors as particularly high [73] , our analysis of the sequence conservation of enzymatic activity suggested that this value may be rather over-optimistic [45] : if we want to transfer the full enzymatic activity with less than 30% errors, we have to require levels of >60% pairwise sequence identity and for errors below 10% >75 sequence identity ( Fig. 1 <<<10%), the HSSP-value must be >5 (Fig. 1>-48 (Fig. 1C). How many proteins from entire proteomes can we annotate at such a level of accuracy? We aligned all proteins from 62 entirely sequenced organisms [5] by PSI-BLAST (protocol described in detail elsewhere [84] , basically three iterations at 10-10 thresholds) against a database containing all proteins from SWISS-PROT, TrEMBL, and PDB. Then we monitored at which level of sequence similarity we found the most similar protein in SWISS-PROT or PDB. If we assume that all proteins in SWISS-PROT and PDB have complete annotations about function, and that the accuracy of homology transfer for all aspects of function is similar to that for enzymatic activity, we simply have to mark the points of 90% accuracy ( Fig. 1 D-F arrows). Maximally - when using the HSSP-value for annotation - we can thus transfer annotation for about 60% of all proteins in the 62 proteomes. When we require less than 5% errors, the number drops to about 35% of all proteins, and when permitting 40% errors, it rises to above 70% of all proteins. The latter (70%) also constitutes the saturation: for about 25-30% of the proteins from proteomes we find no protein in SWISS-PROT or PDB even at thresholds of sequence similarity that are far too permissive to transfer annotations ( Fig. 1 ). These estimates are likely to constitute upper limits since the assumption that all proteins in SWISS-PROT and PDB are fully annotated experimentally is over-optimistic. Nevertheless, we currently know over 1.4 million protein sequences. Even if we pessimistically expect the ratio of reliable transfers to be only 10%, we still conclude that most annotations about function result from homology transfer. Furthermore, all these numbers ignore the capability of experts who can increase accuracy by combining many resources to annotate families [27] as realised, for instance, in Pfam-A [85] and TIGRFAMS [86] .
Basic concept. Bacterial cells generally consist of a single intracellular compartment surrounded by a plasma membrane. In contrast, eukaryotic cells are elaborately subdivided into functionally distinct, membrane-bounded compartments. Most eukaryotic proteins are encoded in the nuclear genome and synthesised in the cytosol, and many need to be further sorted into other sub-cellular compartments. The sorting signals that direct the movement of a protein through the cell, and thereby determine its eventual sub-cellular localization, are contained in its amino acid sequence [87, 88] . Proteins that remain in the cytosol do not have sorting signals. Many others, however, have specific sorting signals that direct their transport from the cytosol into the nucleus, the ER, mitochondria, plastids (in plants), or peroxisomes; sorting signals can also direct the transport of proteins from the ER to other destinations in the cell [89] . Proteins must be localised in the same sub-cellular compartment to cooperate towards a common physiological function. Thus, the native sub-cellular localization of a protein is one indicator of protein function. Aberrant sub-cellular localization of proteins has been observed in the cells of several diseases, such as cancer and AlzheimerÕs disease. Attempts to predict sub-cellular localization have become a central task in bioinformatics [78, 90] . The main methods are based on homology transfer, motif recognition, or the correlation between sequence features and localization ( Fig. 2 ).
Fig. 2 : Methods predicting sub-cellular localization. Four types of methods currently predict sub-cellular localization. (1) Transfer by homology: if we know that protein A is nuclear and we find protein B very similar in sequence to A, we can usually infer that B is also nuclear (Fig. 1 A-C, thin lines). (2) Identification by motifs: many proteins are shuttled between different compartments by carrier proteins that recognise short sequence motifs. Some of these motifs are consecutive in sequence (signal peptide, nuclear localization signal), others are discernible only from the folded structure (lysosomal retention signals). (3) Ab initio methods exploit the correlation between sequence features and localization. (4) Protein-protein interactions are another mechanism to shuttle proteins between compartments. Assume that two interacting proteins A and B are nuclear and that A has a nuclear localization signal that is recognised by an importin that carries A into the nucleus; B could be imported into the nucleus by binding to the complex A-importin. Recently, we combined the first three methods with another method that automatically recognises keywords in SWISS-PROT annotations [180] to annotate the localization of all eukaryotic proteins of known structure [97, 112] . The vast majority of all annotations resulted from homology transfer or lexical analysis (inner circle of top pie-chart). When applying the same methods to the entire proteome of C. elegans, this picture changed completely: about 87% of all proteins could only be handled by ab initio methods. Interestingly, 43% of all eukaryotic proteins of known structure appear to be extra-cellular (lower pie).
Prediction of localization through sequence motifs. One means for predicting localization is the identification of local sequence motifs such as signal peptides or nuclear localization signals (NLS). A number of neural network-based tools identify signal peptides that target proteins to the secretory pathway and the mitochondria [91, 92] . In a recent benchmark study [93] , these tools predicted signal peptide cleavage site at >80% accuracy. A particular problem for methods detecting N-terminal signals is that start codons are predicted with less than 70% accuracy by genome projects [94, 95] . We have collected a data set of experimental and potential NLS motifs that predict nuclear localization at 100% accuracy [96, 97] ; the downside of this look-up library is that it is not complete: most proteins have no known NLS. Either the motif remains to be discovered, or the protein is imported into the nucleus through binding to another protein that has an NLS. Overall, known and predicted sequence motifs enable annotating about 30% of the proteins in six eukaryotic proteomes [22, 5] .
Ab initio methods predict localization for all proteins at lower accuracy. Another approach to predicting localization has been suggested by the observation that the total amino acid composition correlates with the sub-cellular localization [98, 99, 100, 101, 102, 103] . This observation has led to the development of a variety of prediction methods based solely on composition [104, 94, 105, 106] . With the availability of large numbers of completely sequenced genomes, phylogenetic profiles have been employed to identify sub-cellular localization [107] . So far, this approach has been much less accurate in predicting localization than methods based solely on composition. Other methods have tried to integrate rules based on amino acid composition with databases of known signal sequences, e.g., PSORT II is a knowledge-based expert system that integrates the two kinds of information [108] . In particular, PSORT II uses other original prediction methods such as SignalP [109] , ChloroP [110] , and NNPSL [94] as input. Consequently, we may expect that PSORT II would improve if these original methods were improved. Drawid & Gerstein have proposed a Bayesian system based on a diverse range of 30 different features [111] . They applied their method to predicting localization of the full Saccharomyces cerevisiae proteome and provide estimates of the fraction of all yeast proteins found in different compartments. We have recently combined homology transfer, motifs-based and ab initio predictions to annotate all eukaryotic proteins of known structure ( Fig. 2 ). We learned that combining evolutionary and structural information yielded the most accurate predictions and that prediction methods appeared far less accurate when presented with fragments of the native protein sequence [112] .
Basic concept. Over 325 structural and regulatory post-translational modifications in proteins are known today [113] ; prediction methods are currently constrained to a few most relevant of these. These tools typically employ highly conserved sequence motifs, more complex sequence patterns, or structural properties such as solvent accessibility. Prominent post-translational modifications targeted for prediction include: N-terminal signal peptide cleavage sites [114, 102, 91, 115, 116, 117, 93, 118] , proteolytic cleavage and more specifically proteasome cleavage sites [119, 120, 117, 121, 118, 122, 123, 124] , phosphorylation sites [125, 126] , lipid modification [127] and N- and O-glycosylations [128, 129] .
Archiving known sequence motifs and predicting modifications. PhosphoBase [126] , includes information on over 400 phosphorylated proteins, their phosphorylation sites and the specific kinase of action. These data were used to develop an ab initio method that predicts phosphorylation sites (NetPhos [125] ); predictions for serine, threonine, and tyrosine residues reach 69-96% sensitivity. The method uses information about sequence and structure. Given the difficulty in predicting structure around the phosphorylation site and the considerable variation of consensus sequences for kinase substrate specificity, the prediction of phosphorylation remains a difficult task. A similar neural network approach based on charged residues within glycosylation sites together with sequence context and surface accessibility is used to identify O-glycosylation modifications at about 80% accuracy [128] . The limited substrate specificity for both N-glycosylation and O-glycosylation currently limits progress [130] . Predictions of lipid modifications are currently restricted to glycosyl-phosphatidyl-inositol (GPI) anchors [127] . C-terminal motifs (omega-site) and physical properties of GPI-anchors enabled accurate predictions for the effects of mutations on known anchors. N-terminal motifs apparently allow for accurate predictions of N-myristoyltransferase (NMT) substrate sites [131] . Finally, a comprehensive study of proteasome digestion data yields a method that accurately predicts MHC class I ligand boundaries after proteasomal degradation: 65% of the cleavage sites and 85% of the non-cleavage sites appeared to be predicted correctly [124] .
Basic concept. Monica Riley introduced the most widely used schema for classes of cellular function to annotate E. coli [132] . TIGR (The Institute for Genome Research) [3] and many other genome centres have adopted this schema with minor modifications. Transferring annotations of cellular function by homology has for long been almost the only field in which methods were developed. In fact, many researchers consider exclusively such methods when referring to the prediction of protein function. However, recently groups have begun developing methods that predict functional classes in the absence of experimental annotations.
Functional classes can be predicted from sequence. An interesting hybrid system uses inductive logic programming to predict functional classes with and without homology to experimentally annotated proteins [133] . While it is not clear how successful the system is in ab initio prediction, the levels of accuracy published on average appear promising. Genes located in a close neighbourhood on the genome may have some functional commonalities. While such neighbourhood relations sometimes enable predicting aspects such as classes of cellular function, the average signal is very weak, i.e. most often neighbours are not related in function [134, 135, 136] . The most recent breakthrough in the field of predicting protein function came through a collaboration of the groups from S¿ren Brunak (CBS Copenhagen) and Alfonso Valencia (CNB Madrid). Their ends are to predict cellular function from sequence alone. Their means are complex, elaborate, and hierarchical systems of neural networks [137] . A first group of networks is used to identify 'sequence features' (like protein length or amino acid composition) that optimally separate between any two types of functional classes. These basic predictions are then combined into a final prediction step, again through neural networks. The authors applied their method to annotating functional classes for all human proteins. For example the prion protein is predicted to belong to the 'transport and binding category' and to 'not have enzymatic activity'. This appears compatible with the observation that prion binds and transports copper while no catalytic activity has ever been observed [138] . Recently, the Brunak group have applied their new concepts to identifying novel enzymes in archae [139] and to predicting the functional type of al human proteins according to the GO classification [140] . The most impressive news from these ground-breaking methods is that aspects of function can be predicted without homology, i.e. for completely uncharacterised proteins.
Basic concept. Every protein has a biological function, yet most of the biological functions are carried out by groups of proteins interacting in complex networks. Interactions between proteins can be physical, i.e. by chemically binding each other or by binding together to a third substrate, or they can be functional, e.g. by controlling each others expression or by participating in the same biochemical pathway. To fully understand the molecular mechanism that underlies a certain biological function (or malfunction) we need to decipher the meticulous networks of protein interactions that underlie these mechanisms. Therefore, an extensive research effort is invested in both experimental and computational methods that unravel protein-protein interactions [141, 142, 143, 144, 145, 146, 147, 148, 149, 32, 150, 151, 152, 16, 153, 154, 155, 156, 157] . Particularly, many methods and databases attempt to draw complete maps of interactions for entire proteomes. Once it is known with which other proteins a newly discovered protein interacts, it will be easier to predict its function. Furthermore, it is hoped that these interaction maps will surrender the secrets of biological processes, and enhance the understanding of the underlying molecular mechanisms. A complete picture of all the proteins that are involved in a certain biological process would also break new grounds in drug development by identifying new targets for drugs.
Databases and data-mining techniques compile existing information. A vast amount of information about protein-protein interactions already exists in the literature. However, this information is scattered across millions of text pages of scientific publications. A few different enterprises are aimed at extracting this information from the literature [158, 159, 160, 161, 16, 162, 154, 155, 156] . The DIP database [162] is an example of a database that is dedicated to protein interactions. The curators of DIP manually survey the literature to find experimentally determined interactions. They also employ automatic techniques to obtain data from other databases. Other approaches to this problem use natural language processing algorithms, as well as other computational methods, to automatically extract interaction information from scientific papers [163, 158, 164, 160, 165, 161, 15, 16] . SUISEKI, a system for information extraction on interactions, [158] , is reported to successfully extract 70-80% of the interactions in a large corpus of scientific abstracts.
Computational approaches predict protein-protein interactions. Many groups attempt to develop computational methods that predict protein-protein interactions in-silico [136, 16] ( Fig. 3 ). Although not all proteins that come from neighbouring genes on the genome interact with one-another, gene location occasionally reveals true protein-protein interactions [166, 167] . Another approach screens genomes for sequences that appear as two different chains in one genome, and are fused to create a single protein-chain in another genome, which is evolutionarily younger [168, 169] . The assumption is that evolution fused these two proteins into a single one because they interact with one another. Another comparative method searches for pairs of proteins that always occur together in all known genomes, i.e., there is no genome in which only one of the two proteins occurs [170, 169] . These types of protein pairs are very likely to interact. Using these two methods, Eisenberg and his co-workers proposed thousands of protein pairs that may interact. However, there is no confirmed statistics regarding the reliability of the predictions of these methods. The assumption that interacting proteins co-evolve gave rise to other prediction methods. One specific implementation uses the observation that interacting proteins sometimes have phylogenetic trees that are mirror images of each other [171, 172, 173] . Alfonso Valencia and his group introduced an approach based on the assumption that interacting proteins evolve together, hence the mutations that occur in two interacting proteins along evolution should be correlated. First, they demonstrated that such correlated mutations could distinguish between correct and incorrect docking solutions [174] . Then they developed a method that predicts protein-protein interaction partners by analysing the correlation between the mutations in different proteins across different species [175] . Preliminary results indicate that the predictions of these methods have a low false negative rate. Sprinzak and Margalit [176] predicted protein-protein interactions based on a very simple concept: assume we experimentally know that proteins P1 and P2 interact, and that both contain particular motifs or domains M1 and M2, if we find the same motifs in proteins P1' and P2', we might suspect that P1' and P2' also interact. The method can be improved by adding filters that take into account entire networks by skewing the probability for the prediction of the interaction between P1' and P2' according to how often in an organism this combination is observed [177] . Aloy and Russell use alignments and 3D structures of known interactions to predict possible binding partners [178] . Given a 3D structure of a complex, they assess the likelihood that homologues are involved in similar interactions. An extension of this concept has been proposed by Skolnick and co-workers, who applied algorithms developed to detect more distant sequence relations (threading) to identify binding partners given an experimentally known 3D complex [179] . However, the major restriction of all these methods is that each of them is applicable only to a limited set of proteins. Another shortcoming of most methods is that they merely indicate whether a pair of proteins is in interaction, but they do not identify the interaction sites - a crucial piece of information for molecular research.
Fig. 3 : Methods predicting protein-protein interactions. (A) Genomic profiles [170, 169] : Entire genome are searched for the presence of each protein; the table represents the presence (1) or absence (0) of a certain protein in a given genome. If two proteins have an identical pattern, i.e. neither of them appears in any of the genomes without the other, the method assumes that they interact. (B) Rosetta stone [168, 169] : If two separate proteins 1 and 2 are in genome A, and the same two are merged as one single protein in genome B, the method assumes that protein 1 and 2 interact with one another. (C) Correlated mutations [174] : If a particular pair of interacting residues is important to maintain the interaction between protein 1 and 2, we might expect to find examples of mutations that are correlated between 1 and 2, e.g. a positive acid in protein 1 salt-bridging a negative acid in 2 might be mutated to a negative acid; this could be compensated by a reverse mutations of the negative into a positive acid in protein 2. A refined version of this basic idea is implemented in methods that predict protein-protein interaction sites and interaction pairs based on correlated mutations. (D) Sequence signatures [177, 176] : sequence motifs are marked in pairs of proteins that interact. The likelihood of interaction between every pair of motifs is used to predict interactions between the proteins carrying these motifs.
Homology transfer of function: use with extreme caution! No matter what definition of 'the function' or 'the fold' of a protein you may have, function clearly involves fewer residues directly. Therefore, random mutations are more likely to influence function than structure. In other words, when proteins diverge they will - on average - first loose their function than their fold. This makes it more difficult for computational biology to predict function than to predict structure. Despite this problem, the most accurate means of predicting function for particular proteins undoubtedly is the expert-controlled transfer of experimental annotations through homology [27] . However, in the context of automatic, non-expert, and/or proteome-wide searches homology transfer becomes problematic: On the one hand, we need very high levels of sequence similarity to reliably infer aspects of function through homology ( Fig. 1 A). On the other hand, the likelihood of finding close homologues is exponentially smaller than that of finding more diverged homologues ( Fig. 1 D). Thus, we find relatively few homologues of known function that allow transfer at very high accuracy. Many estimates for the functional coverage of entire genomes appear over-optimistically high by accepting very high error rates. An additional complication is that computational biology has to establish thresholds for accuracy of transferring function by homology for any aspect of function. Due to the lack of experimental data in general, and of machine-readable data in particular, such analyses have just begun over the last few years. What we can learn from the large-scale analyses for the few aspects is extreme caution in transferring function by homology!
Ab initio prediction of function: first successes scored. For some aspects of function like sub-cellular localization ab initio prediction methods have been pursued since a while. However, for most aspects of function the transfer by homology has long been the only means. Thus, overall the field of predicting function in silico is still at its infancy. Nevertheless, a few very promising methods have been proposed recently. Methods that predict sub-cellular localization are becoming increasingly accurate; methods that predict post-translational modifications increasingly useful and comprehensive - although the vast majority of post-translational modifications experimentally observed has not been covered, yet. The first breakthroughs have been made in predicting protein-protein interactions and cellular function from sequence. In combination, all those novel methods may aid the advance of molecular biology considerably. Given that the appetite of molecular and medical biologists for functional annotations grows with the exponential increase in the number of known proteomes, the recent advances in computational biology are falling on fertile ground.
Particular thanks to Arthur Lesk (MRC, Cambridge, England) for essential comments and for making his master review available to us before publication. Our work was supported by the grants 1-P50-GM62413-01 and RO1-GM63029-01 from the National Institute of Health (NIH), and the grant DBI-0131168 from the National Science Foundation (NSF). Last, not least, thanks to the GeneOntology team around Michael Ashburner (Cambridge, England) for their Gargantuan effort, to Amos Bairoch (SIB, Geneva), Rolf Apweiler (EBI, Hinxton), Phil Bourne (San Diego Univ.), and their crews for maintaining excellent databases and to all experimentalists who enable computational biology by making their data publicly available.
| 1. | Berman, H. M.,Westbrook, J., Feng, Z., Gillliland, G., Bhat, T. N. et al. (2000). The ProteinData Bank. Nucl. Acids Res., 28, 235-242. |
| 2. | Boeckmann, B.,Bairoch, A., Apweiler, R., Blatter, M. C., Estreicher, A. et al. (2003). TheSWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucl.Acids Res., 31, 365-370. |
| 3. | Fleischmann,R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F. et al. (1995).Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science, 269, 496-512. |
| 4. | Liu, J. &Rost, B. (2001). Comparing function and structure between entire proteomes. Prot.Sci., 10, 1970-1979. |
| 5. | Carter, P.,Liu, J. & Rost, B. (2003). PEP: Predictions for Entire Proteomes. Nucl.Acids Res., 31, 410-413. |
| 6. | Liu, J. &Rost, B. (2003). CHOP proteins into structural domain-like fragments. J. Mol.Biol.,submitted 2003-03-25. |
| 7. | Pruess, M.,Fleischmann, W., Kanapin, A., Karavidopoulou, Y., Kersey, P. et al. (2003). TheProteome Analysis database: a tool for the in silico analysis of wholeproteomes. Nucl. Acids Res., 31, 414-417. |
| 8. | Andrade, M. A.& Bork, P. (2000). Automated extraction of information in molecularbiology. FEBS Lett.,476, 12-17. |
| 9. | Lewis, S.,Ashburner, M. & Reese, M. G. (2000). Annotating eukaryote genomes. Curr.Opin. Str. Biol., 10, 349-354. |
| 10. | Koonin, E. V.(2001). Computational genomics. Curr. Biol., 11, R155-8. |
| 11. | Fleischmann,W., Moller, S., Gateau, A. & Apweiler, R. (1999). A novel method forautomatic functional annotation of proteins. Bioinformatics, 15, 228-233. |
| 12. | Holm, L.& Sander, C. (1999). Protein folds and families: sequence and structurealignments. Nucl. Acids Res., 27, 244-247. |
| 13. | Luscombe, N.M., Laskowski, R. A. & Thornton, J. M. (2001). Amino acid-baseinteractions: a three-dimensional analysis of protein-DNA interactions at anatomic level. Nucl. Acids Res., 29,2860-2874. |
| 14. | Thornton, J.M. (2001). From genome to function. Science, 292, 2095-2097. |
| 15. | Valencia, A.(2002). Bioinformatics: biology by other means. Bioinformatics, 18, 1551-1552. |
| 16. | Valencia, A.& Pazos, F. (2002). Computational methods for the prediction of proteininteractions. Curr. Opin. Str. Biol., 12, 368-373. |
| 17. | Gerstein, M.& Levitt, M. (1997). A structural census of the current population ofprotein sequences. Proc. Natl. Acad. Sci. U.S.A., 94, 11911-11916. |
| 18. | Teichmann, S.A., Chothia, C. & Gerstein, M. (1999). Advances in structural genomics. Curr.Opin. Str. Biol., 9, 390-399. |
| 19. | Wolf, Y.,Brenner, S., Bash, P. & Koonin, E. (1999). Distribution of protein folds inthe three superkingdoms of life. Genome Res., 9, 17-26. |
| 20. | Moult, J.& Melamud, E. (2000). From fold to function. Curr. Opin. Str. Biol., 10, 384-389. |
| 21. | Vitkup, D.,Melamud, E., Moult, J. & Sander, C. (2001). Completeness in structuralgenomics. Nat. Struct. Biol., 8, 559-566. |
| 22. | Liu, J. &Rost, B. (2002). Target space for structural genomics revisited. Bioinformatics, 18, 922-933. |
| 23. | Bork, P.,Ouzounis, C., Sander, C., Scharf, M., Schneider, R. et al. (1992). What's in agenome? Nature, 358, 287. |
| 24. | Andrade, M.A., Brown, N. P., Leroy, C., Hoersch, S., de Daruvar, A. et al. (1999).Automated genome sequence analysis and annotation. Bioinformatics, 15, 391-412. |
| 25. | Iliopoulos,I., Tsoka, S., Andrade, M. A., Janssen, P., Audit, B. et al. (2001). Genomesequences and great expectations. Genome Biol., 2, interactions 2000. |
| 26. | Airozo, D.,Allard, R., Brylawski, B., Canese, K., Kenton, D. et al. (1999). MEDLINE. 1999, . |
| 27. | Whisstock, J.C. & Lesk, A. M. (2003). Prediction of protein function from protein sequenceand structure. Quart. Rev. Biophys.,in press. |
| 28. | Casari, G.,Sander, C. & Valencia, A. (1995). A method to predict functional residuesin proteins. Nat. Struct. Biol., 2, 171-178. |
| 29. | Lichtarge,O., Bourne, H. R. & Cohen, F. E. (1996). An evolutionary trace methoddefines binding surfaces common to protein families. J. Mol. Biol., 257, 342-358. |
| 30. | Mizuguchi,K., Deane, C. M., Blundell, T. L., Johnson, M. S. & Overington, J. P.(1998). JOY: protein sequence-structure representation and analysis. Bioinformatics, 14, 617-623. |
| 31. | Andersen, C.A. F., Palmer, A. G., Brunak, S. & Rost, B. (2002). Continuum secondarystructure captures protein flexibility. Structure, 10, 175-184. |
| 32. | Lichtarge, O.& Sowa, M. E. (2002). Evolutionary predictions of binding surfaces andinteractions. Curr. Opin. Str. Biol., 12, 21-27. |
| 33. | Pupko, T.,Bell, R. E., Mayrose, I., Glaser, F. & Ben-Tal, N. (2002). Rate4Site: analgorithmic tool for the identification of functional regions in proteins bysurface mapping of evolutionary determinants within their homologues. Bioinformatics, 18, S71-S77. |
| 34. | del Sol Mesa,A., Pazos, F. & Valencia, A. (2003). Automatic methods for predictingfunctionally important residues. J. Mol. Biol., 326, 1289-1302. |
| 35. | Glaser, F.,Pupko, T., Paz, I., Bell, R. E., Bechor-Shental, D. et al. (2003). ConSurf:identification of functional regions in proteins by surface-mapping ofphylogenetic information. Bioinformatics, 19, 163-164. |
| 36. | Junker, V.,Contrino, S., Fleischmann, W., Hermjakob, H., Lang, F. et al. (2000). The roleSWISS-PROT and TrEMBL play in the genome research environment. J Biotechnol, 78, 221-234. |
| 37. | Stoesser, G.,Baker, W., van Den Broek, A., Camon, E., Garcia-Pastor, M. et al. (2001). TheEMBL nucleotide sequence database. Nucl. Acids Res., 29, 17-21. |
| 38. | Apweiler, R.(2000). Protein sequence databases. Adv Protein Chem, 54, 31-71. |
| 39. | Xu, D., Xu,Y. & Uberbacher, E. C. (2000). Computational tools for protein modeling. CurrProtein Pept Sci, 1, 1-21. |
| 40. | Kriventseva,E. V., Biswas, M. & Apweiler, R. (2001). Clustering and analysis of proteinfamilies. Curr. Opin. Str. Biol., 11, 334-339. |
| 41. | Frishman, D.,Kaps, A. & Mewes, H. W. (2002). Online genomics facilities in the newmillennium. Pharmacogenomics, 3, 265-271. |
| 42. | Sigrist, C.J., Cerutti, L., Hulo, N., Gattiker, A., Falquet, L. et al. (2002). PROSITE: adocumented database using patterns and profiles as motif descriptors. BriefingBioinf., 3, 265-274. |
| 43. | Tamames, J.,Clark, D., Herrero, J., Dopazo, J., Blaschke, C. et al. (2002). Bioinformaticsmethods for the analysis of expression arrays: data clustering and informationextraction. J Biotechnol,98, 269-283. |
| 44. | Liu, J. &Rost, B. (2003). Domains, motifs, and clusters in the protein universe. Curr.Opin. Chem. Biol., 7, 5-11. |
| 45. | Rost, B.(2002). Enzyme function less conserved than anticipated. J. Mol. Biol., 318, 595-608. |
| 46. | Kallberg, Y.& Persson, B. (1999). KIND-a non-redundant protein database. Bioinformatics, 15, 260-261. |
| 47. | Kriventseva,E. V., Servant, F. & Apweiler, R. (2003). Improvements to CluSTr: thedatabase of SWISS-PROT+TrEMBL protein clusters. Nucl. Acids Res., 31, 388-389. |
| 48. | Henikoff, S.,Henikoff, J. G. & Pietrokovski, S. (1999). Blocks+: a non-redundantdatabase of protein alignment blocks derived from multiple compilations. Bioinformatics, 15, 471-479. |
| 49. | O'Donovan,C., Martin, M. J., Glemet, E., Codani, J. J. & Apweiler, R. (1999).Removing redundancy in SWISS-PROT and TrEMBL. Bioinformatics, 15, 258-259. |
| 50. | Li, W.,Jaroszewski, L. & Godzik, A. (2001). Clustering of highly homologoussequences to reduce the size of large protein databases. Bioinformatics, 17, 282-283. |
| 51. | Mika, S.& Rost, B. (2003). UniqueProt: creating representative protein sequencesets. Nucl. Acids Res.,31, in press. |
| 52. | Harrison, P.M., Bamborough, P., Daggett, V., Prusiner, S. & Cohen, F. E. (1997). Theprion folding problem. Curr. Opin. Str. Biol., 7, 53-59. |
| 53. | Gaasterland,T. & Sensen, C. W. (1996). Fully automated genome analysis that reflectsuser needs and preferences. A detailed introduction to the MAGPIE systemarchitecture. Biochimie,78, 302-310. |
| 54. | Eisenberg,D., Marcotte, E. M., Xenarios, I. & Yeates, T. O. (2000). Protein functionin the post-genomic era. Nature, 405, 823-826. |
| 55. | Ashburner,M., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M. et al. (2000). Geneontology: tool for the unification of biology. The gene ontology consortium. Nat.Gen., 25, 25-29. |
| 56. | Todd, A. E.,Orengo, C. A. & Thornton, J. M. (2001). Evolution of function in proteinsuperfamilies, from a structural perspective. J. Mol. Biol., 307, 1113-1143. |
| 57. | O'Donovan,C., Martin, M. J., Gattiker, A., Gasteiger, E., Bairoch, A. et al. (2002).High-quality protein knowledge resource: SWISS-PROT and TrEMBL. BriefingBioinf., 3, 275-284. |
| 58. | Shah, I.& Hunter, L. (1997). Predicting enzyme function from sequence: a systematicappraisal. In Fifth International Conference on Intelligent Systems forMolecular Biology (Gaasterland, T., Karp, P., Karplus, K., Ouzounis, C.,Sander, C. et al., eds.), pp. 276-283, AAAI Press, Halkidiki, Greece. |
| 59. | Ouzounis, C.,Perez-Irratxeta, C., Sander, C. & Valencia, A. (1998). Are binding residuesconserved? Pac Symp Biocomput, 3, 399-410. |
| 60. | Devos, D.& Valencia, A. (2000). Practical limits of function prediction. Proteins, 41, 98-107. |
| 61. | Pawlowski,K., Jaroszewski, L., Rychlewski, L. & Godzik, A. (2000). Sensitive sequencecomparison as protein function predictor. Pac Symp Biocomput, 8, 42-53. |
| 62. | Wilson, C.A., Kreychman, J. & Gerstein, M. (2000). Assessing annotation transfer forgenomics: quantifying the relations between protein sequence, structure andfunction through traditional and probabilistic scores. J. Mol. Biol., 297, 233-249. |
| 63. | Nair, R.& Rost, B. (2002). Sequence conserved for sub-cellular localization. Prot.Sci., 11, 2836-2847. |
| 64. | Wrzeszczynski, K. O. & Rost, B. (2003). In silico anaysis of retentionsignals for Endoplasmic reticulum and Golgi apparatus. Proteins,submitted. |
| 65. | Wrzeszczynski, K. O. & Rost, B. (2003). Cataloguing proteins in cell cyclecontrol. In Cell cycle checkpoint control protocols (Lieberman, H., eds.), pp.in press, Humana Press, Totowa, NJ. |
| 66. | Casari, G.,Andrade, M. A., Bork, P., Boyle, J., Daruvar, A. et al. (1995). Challengingtimes for bioinformatics. Nature, 376, 647-648. |
| 67. | Ouzounis, C.,Casari, G., Sander, C., Tamames, J. & Valencia, A. (1996). Computationalcomparisons of model genomes. TIBTECH, 14, 280-285. |
| 68. | Kyrpides, N.C. & Ouzounis, C. A. (1998). Errors in genome reviews. Science, 281, 1457. |
| 69. | Kyrpides, N.C. & Ouzounis, C. A. (1999). Whole-genome sequence annotation: 'Going wrongwith confidence'. Mol. Microbiol., 32, 886-887. |
| 70. | Galperin, M.Y. & Koonin, E. V. (1998). Sources of systematic error in functionalannotation of genomes: domain rearrangement, non-orthologous gene displacementand operon disruption. In Silico Biol. , 1, 55-67. |
| 71. | Brenner, S.E. (1999). Errors in genome annotation. TIGS, 15, 132-133. |
| 72. | Mushegian, A.R. (2000). Annotations of biochemically uncharacterized open reading frames(ORFs). Mol. Microbiol.,35, 697-698. |
| 73. | Devos, D.& Valencia, A. (2001). Intrinsic errors in genome annotation. TIGS, 17, 429-431. |
| 74. | Tamames, J.,Gonzalez-Moreno, M., Mingorance, J., Valencia, A. & Vicente, M. (2001).Bringing gene order into bacterial shape. TIGS, 17, 124-126. |
| 75. | Iyer, L. M.,Aravind, L., Bork, P., Hofmann, K., Mushegian, A. R. et al. (2001). Quod eratdemonstrandum? The mystery of experimental validation of apparently erroneouscomputational analyses of protein sequences. Genome Biol., 2, RESEARCH0051. |
| 76. | Webb, E. C.(1992). Enzyme Nomenclature 1992. Recommendations of the Nomenclature committeeof the International Union of Biochemistry and Molecular Biology. AcademicPress, New York. |
| 77. | Mewes, H. W.,Frishman, D., Guldener, U., Mannhaupt, G., Mayer, K. et al. (2002). MIPS: adatabase for genomes and protein sequences. Nucl. Acids Res., 30, 31-34. |
| 78. | Eisenhaber,F. & Bork, P. (1998). Wanted: subcellular localization of proteins based onsequence. TICB, 8, 169-170. |
| 79. | Tsoka, S.& Ouzounis, C. A. (2001). Functional versatility and molecular diversity ofthe metabolic map of Escherichia coli. Genome Res., 11, 1503-1510. |
| 80. | Rost, B.(1999). Twilight zone of protein sequence alignments. Prot. Engin., 12, 85-94. |
| 81. | Sander, C.& Schneider, R. (1991). Database of homology-derived structures and thestructural meaning of sequence alignment. Proteins, 9, 56-68. |
| 82. | Nielsen, H.,Engelbrecht, J., von Heijne, G. & Brunak, S. (1996). Defining a similaritythreshold for a functional protein sequence pattern: the signal peptidecleavage site. Proteins,24, 165-177. |
| 83. | Altschul, S.,Madden, T., Shaffer, A., Zhang, J., Zhang, Z. et al. (1997). Gapped Blast andPSI-Blast: a new generation of protein database search programs. Nucl. AcidsRes., 25, 3389-3402. |
| 84. | Przybylski,D. & Rost, B. (2002). Alignments grow, secondary structure prediction improves.Proteins, 46, 195-205. |
| 85. | Bateman, A.,Birney, E., Cerruti, L., Durbin, R., Etwiller, L. et al. (2002). The Pfamprotein families database. Nucl. Acids Res., 30, 276-280. |
| 86. | Haft, D. H.,Selengut, J. D. & White, O. (2003). The TIGRFAMs database of proteinfamilies. Nucl. Acids Res.,31, 371-373. |
| 87. | Schatz, G.& Dobberstein, B. (1996). Common principles of protein translocation acrossmembranes. Science,271, 1519-1526. |
| 88. | Mattaj, I. W.& Englmeier, L. (1998). Nucleocytoplasmic transport: the soluble phase. AnnuRev Biochem, 67, 265-306. |
| 89. | Pelham, H. R.& Rothman, J. E. (2000). The debate about transport in the Golgi--two sidesof the same coin? Cell,102, 713-719. |
| 90. | Nakai, K.(2000). Protein sorting signals and prediction of subcellular localization. AdvProtein Chem, 54, 277-344. |
| 91. | Nielsen, H.,Brunak, S. & von Heijne, G. (1999). Machine learning approaches for theprediction of signal peptides and other protein sorting signals. Prot.Engin., 12, 3-9. |
| 92. | Emanuelsson,O., von Heijne, G. & Schneider, G. (2001). Analysis and prediction ofmitochondrial targeting peptides. Methods Cell Biol, 65, 175-187. |
| 93. | Menne, K. M.,Hermjakob, H. & Apweiler, R. (2000). A comparison of signal sequenceprediction methods using a test set of signal peptides. Bioinformatics, 16, 741-742. |
| 94. | Reinhardt, A.& Hubbard, T. (1998). Using neural networks for prediction of thesubcellular location of proteins. Nucl. Acids Res., 26, 2230-2235. |
| 95. | Gaasterland,T. & Oprea, M. (2001). Whole-genome analysis: annotations and updates. Curr.Opin. Str. Biol., 11, 377-381. |
| 96. | Cokol, M.,Nair, R. & Rost, B. (2000). Finding nuclear localisation signals. EMBORep., 1, 411-415. |
| 97. | Nair, R.,Carter, P. & Rost, B. (2003). NLSdb: database of nuclear localizationsignals. Nucl. Acids Res.,31, 397-399. |
| 98. | Nishikawa, K.& Ooi, T. (1982). Correlation of the amino acid composition of a protein toits structural and biological characteristics. J. Biochem., 91, 1821-1824. |
| 99. | Nakai, K.& Kanehisa, M. (1991). Expert system for predicting protein localizationsites in gram-negative bacteria. Proteins, 11, 95-110. |
| 100. | Nakai, K.& Kanehisa, M. (1992). A knowledge base for predicting protein localizationsites in eukaryotic cells. Genomics, 14, 897-911. |
| 101. | Nakashima,H. & Nishikawa, K. (1992). The amino acid composition is different betweenthe cytoplasmic and extracellular sides in membrane proteins. FEBS Lett., 303, 141-146. |
| 102. | Claros, M.G. & Vincens, P. (1995). Computational method to predict mitochondriallyimported proteins and their transit peptides. Eur. J. Biochem., 241, 779-786. |
| 103. | Horton, P.& Nakai, K. (1996). A probabilistic classification system for predictingthe cellular localization sites of proteins. In Fourth International Conferenceon Intelligent Systems for Molecular Biology (States, D., Agarwal, P.,Gaasterland, T., Hunter, L. & Smith, R. F., eds.), pp. 109-115, AAAI Press,St. Louis, M.O., U.S.A.. |
| 104. | Cedano, J.,Aloy, P., PŽrez-Pons, J. A. & Querol, E. (1997). Relation between aminoacid composition and cellular location of proteins. J. Mol. Biol., 266, 594-600. |
| 105. | Hua, S.& Sun, Z. (2001). Support vector machine approach for protein subcellularlocalization prediction. Bioinformatics, 17, 721-728. |
| 106. | Mott, R.,Schultz, J., Bork, P. & Ponting, C. P. (2002). Predicting protein cellularlocalization using a domain projection method. Genome Res., 12, 1168-1174. |
| 107. | Marcotte, E.M., Xenarios, I., van Der Bliek, A. M. & Eisenberg, D. (2000). Localizingproteins in the cell from their phylogenetic profiles. Proc. Natl. Acad.Sci. U.S.A., 97, 12115-12120. |
| 108. | Nakai, K.& Horton, P. (1999). PSORT: a program for detecting sorting signals inproteins and predicting their subcellular localization. TIBS, 24, 34-6. |
| 109. | Nielsen, H.,Engelbrecht, J., Brunak, S. & von Heijne, G. (1997). Identification ofprokaryotic and eukaryotic signal peptides and prediction of their cleavagesites. Prot. Engin.,10, 1-6. |
| 110. | Emanuelsson,O., Nielsen, H. & von Heijne, G. (1999). ChloroP, a neural network-basedmethod for predicting chloroplast transit peptides and their cleavage sites. Prot.Sci., 8, 978-984. |
| 111. | Drawid, A.& Gerstein, M. (2000). A Bayesian system integrating expression data withsequence patterns for localizing proteins: comprehensive application to theyeast genome. J. Mol. Biol., 301,1059-1075. |
| 112. | Nair, R.& Rost, B. (2003). Better prediction of sub-cellular localization bycombining evolutionary and structural information. Proteins,in press. |
| 113. | Garavelli,J. S. (2003). The RESID Database of Protein Modifications: 2003 developments. Nucl.Acids Res., 31, 499-501. |
| 114. | Ladunga, I.,Czak—, F., Csabai, I. & Geszti, T. (1991). Improving signal peptideprediction accuracy by simulated neural network. CABIOS, 7, 485-487. |
| 115. | Schneider,G. (1999). How many potentially secreted proteins are contained in a bacterialgenome? Gene, 237, 113-121. |
| 116. | Emanuelsson,O., Nielsen, H., Brunak, S. & von Heijne, G. (2000). Predicting subcellularlocalization of proteins based on their N-terminal amino acid sequence. J.Mol. Biol., 300, 1005-1016. |
| 117. | Jagla, B.& Schuchhardt, J. (2000). Adaptive encoding neural networks for therecognition of human signal peptide cleavage sites. Bioinformatics, 16, 245-250. |
| 118. | Nakai, K.(2001). Prediction of in vivo fates of proteins in the era of genomics andproteomics. J. Struct. Biol., 134, 103-116. |
| 119. | Cai, Y. D.,Yu, H. & Chou, K. C. (1998). Artificial neural network method forpredicting HIV protease cleavage sites in protein. J Protein Chem, 17, 607-615. |
| 120. | Wrede, P.,Landt, O., Klages, S., Fatemi, A., Hahn, U. et al. (1998). Peptide design aidedby neural networks: biological activity of artificial signal peptidase Icleavage sites. Biochem.,37, 3588-3593. |
| 121. | Jarmer, H.,Larsen, T. S., Krogh, A., Saxild, H. H., Brunak, S. et al. (2001). Sigma Arecognition sites in the Bacillus subtilis genome. Microbiology, 147, 2417-2424. |
| 122. | Nussbaum, A.K., Kuttler, C., Hadeler, K. P., Rammensee, H. G. & Schild, H. (2001).PAProC: a prediction algorithm for proteasomal cleavages available on the WWW. Immunogenetics, 53, 87-94. |
| 123. | Graber, J.H., McAllister, G. D. & Smith, T. F. (2002). Probabilistic prediction ofSaccharomyces cerevisiae mRNA 3'-processing sites. Nucl. Acids Res., 30, 1851-1858. |
| 124. | Kesimir, C.,Nussbaum, A. K., Schild, H., Detours, V. & Brunak, S. (2002). Prediction ofproteasome cleavage motifs by neural networks. Prot. Engin., 15, 287-296. |
| 125. | Blom, N.,Gammeltoft, S. & Brunak, S. (1999). Sequence and structure-based predictionof eukaryotic protein phosphorylation sites. J. Mol. Biol., 294, 1351-1362. |
| 126. | Kreegipuu,A., Blom, N. & Brunak, S. (1999). PhosphoBase, a database ofphosphorylation sites: release 2.0. Nucl. Acids Res., 27, 237-9. |
| 127. | Eisenhaber,B., Bork, P. & Eisenhaber, F. (2001). Post-translational GPI lipid anchormodification of proteins in kingdoms of life: analysis of protein sequence datafrom complete genomes. Prot. Engin., 14, 17-25. |
| 128. | Hansen, J.,Lund, O., Tolstrup, N., Gooley, A. A., Williams, K. L. et al. (1998). NetOglyc:Prediction of mucin type O-glycosylation sites based on sequence context andsurface accessibility. Glycoconjugate Journal, 15, 115-130. |
| 129. | Gupta, R.& Brunak, S. (2002). Prediction of glycosylation across the human proteomeand the correlation to protein function. Pac Symp Biocomput,310-322. |
| 130. | Christlet,T. H., Biswas, M. & Veluraja, K. (1999). A database analysis of potentialglycosylating Asn-X-Ser/Thr consensus sequences. Acta Crystallogr D BiolCrystallogr, 55, 1414-1420. |
| 131. | Maurer-Stroh, S., Eisenhaber, B. & Eisenhaber, F. (2002). N-terminalN-myristoylation of proteins: prediction of substrate proteins from amino acidsequence. J. Mol. Biol.,317, 541-557. |
| 132. | Riley, M.(1993). Function of the gene products in Escherichia coli. Microbiol. Rev., 57, 862-952. |
| 133. | Clare, A.& King, R. D. (2002). Machine learning of functional class from phenotypedata. Bioinformatics,18, 160-166. |
| 134. | Tamames, J.,Casari, G., Ouzounis, C. & Valencia, A. (1997). Conserved clusters offunctionally related genes in two bacterial genomes. J. Mol. Evol., 44, . |
| 135. | Overbeek,R., Fonstein, M., D'Souza, M., Pusch, G. D. & Maltsev, N. (1999). Use ofcontiguity on the chromosome to predict functional coupling. In Silico Biol., 1, 93-108. |
| 136. | Galperin, M.Y. & Koonin, E. V. (2000). Who's your neighbor? New computationalapproaches for functional genomics. Nat. Biotechnol., 18, 609-613. |
| 137. | Jensen, L.J., Gupta, R., Blom, N., Devos, D., Tamames, J. et al. (2002). Prediction ofhuman protein function from post-translational modifications and localizationfeatures. J. Mol. Biol.,319, 1257-1265. |
| 138. | Brown, D. R.(2002). Copper and prion diseases. Biochem Soc Trans, 30, 742-745. |
| 139. | Jensen, L.J., Skovgaard, M. & Brunak, S. (2002). Prediction of novel archaeal enzymesfrom sequence-derived features. Prot. Sci., 11, 2894-2898. |
| 140. | Jensen, L.J., Gupta, R., Staerfeldt, H. H. & Brunak, S. (2003). Prediction of humanprotein function according to Gene Ontology categories. Bioinformatics, 19, 635-642. |
| 141. | Marcotte, E.M. (2000). Computational genetics: finding protein function by nonhomologymethods. Curr. Opin. Str. Biol., 10, 359-365. |
| 142. | Uetz, P.,Giot, L., Cagney, G., Mansfield, T. A., Judson, R. S. et al. (2000). Acomprehensive analysis of protein-protein interactions in Saccharomycescerevisiae. Nature,403, 623-627. |
| 143. | Mann, M.,Hendrickson, R. C. & Pandey, A. (2001). Analysis of proteins and proteomesby mass spectrometry. Annu. Rev. Biochem., 70, 437-473. |
| 144. | Michnick, S.W. (2001). Exploring protein interactions by interaction-induced folding ofproteins from complementary peptide fragments. Curr. Opin. Str. Biol., 11, 472-477. |
| 145. | Teichmann,S. A., Murzin, A. G. & Chothia, C. (2001). Determination of proteinfunction, evolution and interactions by structural genomics. Curr. Opin.Str. Biol., 11, 354-363. |
| 146. | Xenarios,I., Fernandez, E., Salwinski, L., Duan, X. J., Thompson, M. J. et al. (2001).DIP: the database of interacting proteins: 2001 update. Nucl. Acids Res., 29, 239-241. |
| 147. | DeLano, W.(2002). Unravelling hot spots in binding interfaces: progress and challenges. Curr.Opin. Str. Biol., 12, 14-20. |
| 148. | Gavin, A.C., Bosche, M., Krause, R., Grandi, P., Marzioch, M. et al. (2002). Functionalorganization of the yeast proteome by systematic analysis of protein complexes.Nature, 415, 141-147. |
| 149. | Ho, Y.,Gruhler, A., Heilbut, A., Bader, G. D., Moore, L. et al. (2002). Systematicidentification of protein complexes in Saccharomyces cerevisiae by massspectrometry. Nature,415, 180-183. |
| 150. | Sheinerman,F. B. & Honig, B. (2002). On the role of electrostatic interactions in thedesign of protein-protein interfaces. J. Mol. Biol., 318, 161-177. |
| 151. | Smith, G. R.& Sternberg, M. J. (2002). Prediction of protein-protein interactions bydocking methods. Curr. Opin. Str. Biol., 12, 28-35. |
| 152. | Tong, A. H.,Drees, B., Nardelli, G., Bader, G. D., Brannetti, B. et al. (2002). A combinedexperimental and computational strategy to define protein interaction networksfor peptide recognition modules. Science, 295, 321-324. |
| 153. | Aloy, P.& Russell, R. B. (2003). InterPreTS: protein Interaction Prediction throughTertiary Structure. Bioinformatics, 19, 161-162. |
| 154. | Bader, G.D., Betel, D. & Hogue, C. W. (2003). BIND: the Biomolecular InteractionNetwork Database. Nucl. Acids Res., 31, 248-250. |
| 155. | Bock, J. R.& Gough, D. A. (2003). Whole-proteome interaction mining. Bioinformatics, 19, 125-134. |
| 156. | Ng, S. K.,Zhang, Z., Tan, S. H. & Lin, K. (2003). InterDom: a database of putativeinteracting protein domains for validating predicted protein interactions andcomplexes. Nucl. Acids Res., 31, 251-254. |
| 157. | von Mering,C., Huynen, M., Jaeggi, D., Schmidt, S., Bork, P. et al. (2003). STRING: adatabase of predicted functional associations between proteins. Nucl. AcidsRes., 31, 258-261. |
| 158. | Blaschke, C.& Valencia, A. (2001). The potential use of SUISEKI as a proteininteraction discovery tool. Genome Inform Ser Workshop Genome Inform, 12, 123-134. |
| 159. | Gromiha, M.M. & Selvaraj, S. (2001). Comparison between long-range interactions andcontact order in determining the folding rate of two-state proteins:application of long-range order to folding rate prediction. J. Mol. Biol., 310, 27-32. |
| 160. | Marcotte, E.M., Xenarios, I. & Eisenberg, D. (2001). Mining literature forprotein-protein interactions. Bioinformatics, 17, 359-363. |
| 161. | Krauthammer,M., Kra, P., Iossifov, I., Gomez, S. M., Hripcsak, G. et al. (2002). Of truthand pathways: chasing bits of information through myriads of articles. Bioinformatics, 18, S249-S257. |
| 162. | Xenarios,I., Salwinski, L., Duan, X. J., Higney, P., Kim, S. M. et al. (2002). DIP, theDatabase of Interacting Proteins: a research tool for studying cellularnetworks of protein interactions. Nucl. Acids Res., 30, 303-305. |
| 163. | Blaschke,C., Oliveros, J. C. & Valencia, A. (2001). Mining functional informationassociated with expression arrays. Funct Integr Genomics, 1, 256-268. |
| 164. | Friedman,C., Kra, P., Yu, H., Krauthammer, M. & Rzhetsky, A. (2001). GENIES: anatural-language processing system for the extraction of molecular pathwaysfrom journal articles. Bioinformatics, 17, S74-S82. |
| 165. | Blaschke,C., Hirschman, L. & Valencia, A. (2002). Information extraction inmolecular biology. Brief Bioinform, 3, 154-165. |
| 166. | Dandekar,T., Snel, B., Huynen, M. & Bork, P. (1998). Conservation of gene order: afingerprint of proteins that physically interact. TIBS, 23, 324-328. |
| 167. | Huynen, M.,Snel, B., Lathe, W. & Bork, P. (2000). Predicting protein function bygenomic context. Genome Res., 4, 1204-1210. |
| 168. | Enright, A.J., Ilipoulos, I., Kyrpides, N. C. & Ouzounis, C. A. (1999). Proteininteraction maps for complete genomes based on gene fusion events. Nature, 402, 86-90. |
| 169. | Marcotte, E.M., Pellegrini, M., Ng, H. L., Rice, D. W., Yeates, T. O. et al. (1999).Detecting protein function and protein-protein interactions from genomesequences. Science,285, 751-753. |
| 170. | Gaasterland,T. & Ragan, M. A. (1998). Constructing multigenome views of whole microbialgenomes. Microb Comp Genomics, 3, 177-192. |
| 171. | Goh, C. S.,Bogan, A. A., Joachimiak, M., Walther, D. & Cohen, F. E. (2000).Co-evolution of proteins with their interaction partners. J. Mol. Biol., 299, 283-293. |
| 172. | Pazos, F.& Valencia, A. (2001). Similarity of phylogenetic trees as indicator ofprotein-protein interaction. Prot. Engin., 14, 609-614. |
| 173. | Goh, C. S.& Cohen, F. E. (2002). Co-evolutionary analysis reveals insights intoprotein-protein interactions. J. Mol. Biol., 324, 177-192. |
| 174. | Pazos, F.,Helmer-Citterich, M., Ausiello, G. & Valencia, A. (1997). Correlated mutationscontain information about protein-protein interaction. J. Mol. Biol., 271, 511-523. |
| 175. | Pazos, F.& Valencia, A. (2002). In silico two-hybrid systemfor the selection ofphysically interacting protein pairs. Proteins, 47, 219-227. |
| 176. | Sprinzak, E.& Margalit, H. (2001). Correlated sequence-signatures as markers ofprotein-protein interaction. J. Mol. Biol., 311, 681-692. |
| 177. | Gomez, S.M., Lo, S. H. & Rzhetsky, A. (2001). Probabilistic prediction of unknownmetabolic and signal-transduction networks. Genetics, 159, 1291-1298. |
| 178. | Aloy, P.& Russell, R. B. (2002). Interrogating protein interaction networks throughstructural biology. Proc. Natl. Acad. Sci. U.S.A., 99, 5896-5901. |
| 179. | Lu, L., Lu,H. & Skolnick, J. (2002). MULTIPROSPECTOR: an algorithm for the predictionof protein-protein interactions by multimeric threading. Proteins, 49, 350-364. |
| 180. | Nair, R.& Rost, B. (2002). Inferring sub-cellular localisation through automatedlexical analysis. Bioinformatics, 18, S78-S86. |
| Contact: rost@columbia.edu | Version: Jun 3, 2003 |