| Title: | UniqueProt: creating representative protein sequence sets |
| Author: | Sven Mika & Burkhard Rost |
| Quote: | Nucl Acids Res, 2003, 31, 3642-3644 |
UniqueProt: creating representative protein sequence sets
| 1 | CUBIC, Dept. of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA |
| 2 | Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA |
| 3 | North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA |
| 4 | Institute of Physical Biochemistry, University Witten/Herdecke, Stockumer Strasse 10, 58448 Witten, Germany |
| * | Corresponding author: email = mika@cubic.bioc.columbia.edu URL http://cubic.bioc.columbia.edu/ Tel: +1-212-305-4018, fax: +1-212-305-7932 |
This article is published in (Nucleic Acids Research, issue, 2003 and pages) © copyright Oxford University Press (2003). OUP is the only authorised source. All copying of this article including placing on another website requires the written permission of the copyright owner.
UniqueProt is a practical and easy to use web-service designed to create representative, unbiased data sets of protein sequences. The largest possible representative sets are found through a simple greedy algorithm using the HSSP-value to establish sequence similarity. UniqueProt is not a real clustering program in the sense that the 'representatives' are not at the centres of well-defined clusters since the definition of such clusters is problem-specific. Overall, UniqueProt is a reasonable fast solution for bias in data sets. The service is accessible at http://cubic.bioc.columbia.edu/services/uniqueprot; a command-line version for Linux is downloadable from this website.
Key words: sequence-unique data sets; sequence homology; clustering; HSSP distance; bioinformatics
| BIG | non-identical merger of SWISS-PROT [1] , TrEMBL [1] and PDB [2] |
| BLAST | fast sequence alignment method [3, 4] |
| HSSP | Homology derived structures of proteins [5] |
| PDB | Protein Data Bank of experimentally determined 3D structures of proteins [2] SWISS-PROT, database of protein sequences [1] |
| TrEMBL | translation of the EMBL-nucleotide database coding DNA to protein sequences [1] . |
The problem of biased data sets.Increasingly often experimentalists face the problem of searching for some 'significant' motifs or features in a set of proteins retrieved from common database searches. When we simply use the sequences with today's bias, we risk to over-estimate significance [6] . The bias has two potential sources: (1) certain families could be missing or (2) could be over-represented. Such bias may hinder finding sequence-patterns that are related to protein structure and/or function. We cannot solve the first problem since we do not have any insight into the still undiscovered and missing sequences of the protein universe. However, we can discard over-represented sequences by grouping similar proteins.
Inferring functional similarity from sequence similarity. Supposedly, the mostly desired criterion for grouping two proteins into one 'family' is that the two share a common function. This is by far not an easy task considering the many different levels of functional roles any particular protein orchestrates within a living cell. In fact, while such inferences are accurate for high levels of pairwise sequence similarity, they become accurate rather rapidly with the level of divergence between the two proteins [7, 6] . If we consider two proteins to have similar function by the token that both participate in cell cycle control, we need to establish different thresholds for pairwise sequence similarity that allows to infer this feature by homology [8] . We need to apply yet a different battery of thresholds to infer that (1) two proteins dwell in the same sub-cellular compartment [9, 10] , (2) that they belong to the same groups of cellular function [11] , have similar binding sites [12] , or belong to similar descriptions according to the GeneOntology [13, 14] .
Inferring structural similarity from sequence similarity.Arguably, the feature that is most conserved with evolutionarily diverging sequences is protein structure [15, 16, 17] . If we consider protein sequences as simple strings of letters, mathematics suggests that the probability of finding 10 in 20 aligned residues (50%) is much higher than that of finding 100 in 200 (also 50%) [18] . Sander & Schneider accounted for this obvious reality of sequence analysis by introducing an empirical threshold that related alignment length and pairwise sequence identity in a way allowing to automatically determine families of proteins with similar structure in their HSSP database [5] . A refined version of this original HSSP curve proved to better discriminate between proteins of similar and non-similar structure than expectation values from pairwise BLAST searches [16] . Since the functional form of this curve also appears to rather accurately reflect similarity in sub-cellular localisation [9, 8, 10] and enzymatic activity [6] , we based our bias-reduction tool UniqueProt on this curve. UniqueProt removes the bias of sequence-redundant proteins from a given data set in the hope of acquiring unique sub-sets that constitute more accurate approximations to the goal of analysing sets representative for the protein universe. However, users should be careful about submitting data sets with very heterogeneous domain architectures since the UniqueProt algorithm may completely remove domain-representatives. Especially the submission of sequence-fragments is not recommended.
Fig. 1. : HSSP-curve for different values u. The curves illustrate different HSSP-values u from the original HSSP-curve (Eqn. 1). Every pairwise alignment can be represented as a point in the graph above. Any naturally evolved two proteins for which the similarity falls above the curve u=0 are expected to have similar structures. Higher values provide more cautious estimates about features common to two proteins and larger sequence-unique sub-sets.
Input. The program accepts either a set of sequences in FASTA format or a list of identifiers from either of the following protein databases: SWISS-PROT [1] , PDB [2] or TrEMBL [1] . Alternatively, one of the following alignment-file formats is accepted to bypass the first step of the algorithm (see below): BLAST, PSIBLAST, pair, markx0, markx1, markx2, markx3, markx10 or srspair.
HSSP-value to measure sequence similarity. First all sequences are compared with BLAST [3, 4] . Then the percentage of identical residues and the length (L) of the BLAST-derived alignment (without including the gaps) are converted into the HSSP-value (HV) according to Eq 1. Here PID is the number of identical residues in the BLAST alignment times 100 and divided by L. The HSSP-value reflects whether an alignment is above the HSSP-curve [5, 16] (HSSP-value >0) or below (<0). For the first case (>0) the HSSP-value can be seen as a degree of sequence-proximity whereas for the latter case (<0) it gives an estimate about the distance between two compared sequences. For the case that an alignment file instead of a FASTA file or list of identifiers is submitted the HSSP-value is directly derived from the alignment information without performing a BLAST comparison first.
|
(eqn. 1) |
Algorithm. In order to find the largest sub-set of proteins that fulfil the constraint that no pair in that set has an HSSP-value > u (u = user defined threshold), we applied a simple greedy algorithm similar to that employed toward this end by Hobohm and Sander [19] : For each protein P in the submitted set, the algorithm counts the number of proteins NP that share an HSSP-value with P larger than u. We consider all proteins {NP} with HV > u as belonging to the family F(P). Next, we store the number and identifiers of all neighbours for each protein and sort the entire data set by the size of the families {F}. Finally, the greedy algorithm simply works down that list by starting at protein P' and excluding all members of family F(P'). We start either with the largest or the smallest family (option selected by user). In particular, the algorithm is as follows. (1) Take singletons: If the family F(P') contains only one sequence, P' is added to the unique list. (2) Non-singletons: all family members {F(P')} except for P' are erased from the list. (3) Overlap to previously identified proteins: If P' has one family-member Q that has already been included in the unique list at a previous step, the representative P' and all other family members {F(P')} except for Q will be removed from the stack. Note that this situation may have two reasons: (i) because of the asymmetrical nature of the distance-network generated by BLAST, and (ii) due to some overlap between domains that invalidates the triangular relation (e.g. A similar to B and A similar to C does not imply that B is also similar to C). The algorithm completes if no protein remains in the stack.
User options. The user-defined parameter Òsmallest firstÓ or Òlargest firstÓ influences the final set of representatives in the following way: Assume a set of three proteins with A and B being single domain non-homologous proteins and with C being a two-domain fusion of A+B. For a certain HSSP-value the setting Òlargest firstÓ would yield one group (A, B, C) whereas the setting Òsmallest firstÓ yields two (A,C) and (B,C). Sequence-space-hopping is a procedure to enlarge protein families by applying a triangular equation: if HV(A,B)>0, HV(B,C)> 0 <<<0 this usually implies that we cannot infer the similarity between A and C directly [20, 16] . Sequence space hopping (or intermediate sequence searches) explore the fact that B is an intermediate common to families A and C to infer the similarity between A and C. We enable user to apply this concept until no more new homologue sequences are found. ÒSmallest firstÓ often leads to families that can be connected via sequence-space-hopping. In our example an alignment of A would lead to sequence C and the second-round alignment of C would bring us back to A but also to B. Note: the default setting for the algorithm is Òlargest firstÓ.
Output. Since our server accepts a range of HSSP-values instead of a single value in order to better exploit a once done BLAST-run on a submitted set, one output-file is produced for each HSSP-threshold processed by the program. Those output-files are simple FASTA-files each one of them holding a single representative set. When using the internet-version of UniqueProt, the output will be downloadable from our server in a compressed format (zip or tar) once the job has been finished. To get a better overview, user-friendly html-files with links to the mentioned FASTA-files can be obtained additionally and will be included in the compressed archive. These files will also contain the HSSP-values for each submitted protein-pair.
Although the program treats sequences as a whole rather than considering domains, the UniqueProt algorithm is a convenient and relatively fast way to thin out some set of sequences by removing bias originating from redundancy without losing the most important representatives. A data set containing about 1000 sequences submitted to our server takes on average 15 minutes to complete. There is a restriction on the amount of data (500kb for FASTA-files, 20kb for ID-files, 10Mb for alignment-files) in order to prevent over-load of our CPU resources. Users who want to process larger sets can download the software and run it on their local LINUX/UNIX machines.
UniqueProt constitutes a level in between a relatively slow and careful clustering algorithm as used for example in GeneRAGE [21] and between the extremely fast and crude bias-reduction scheme CD-HI [22] . We compared UniqueProt to the clustering method on a single data set of 187 nuclear-matrix associated proteins taken from SWISSPROT. GeneRAGE grouped these proteins into 27 clusters. We grouped the same set through UniqueProt using different HSSP-values and both algorithm-modes (Òsmallest firstÓ and Òlargest firstÓ). We found the highest overlap between the two methods at an HSSP-values of 10 and with the mode Òlargest firstÓ. 17 of 27 GeneRAGE clusters contained at least one representative in the mentioned UniqueProt set. The reason for the rather high value for the best-fit proximity threshold (HSSP-value of +10) was that GeneRAGE grouped half the proteins in the data set into one cluster and split the remaining proteins into many small clusters. Although, we have no good reason to assume that our single test is representative for all possible data sets, we were encouraged that UniqueProt is an alternative that works fast, is accessible and probably accurate enough if the proteins have similar domain architectures. We plan to investigate to which extend we could apply the fast algorithm employed in CD-HI [22] to achieve a first, fast grouping of our results in the future.
Thanks to Jinfeng Liu and Megan Restuccia (Columbia) for computer assistance and to Avner Schlessinger (Columbia) for testing the program over and over again. Thanks also to the anonymous reviewers for their help to improve manuscript and tool. This work was supported by the grants RO1-GM63029-01 from the National Institute of Health (NIH) and 1-R01-LM07329-01 from the National Library of Medicine (NLM). Last, not least, thanks to Amos Bairoch (SIB, Geneva), Rolf Apweiler (EBI, Hinxton), Phil Bourne (San Diego Univ.), and their crews for maintaining excellent databases and to all experimentalists who enabled this tool by making their data publicly available.
| 1. | Bairoch, A. & Apweiler, R.(2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in2000. Nucleic Acids Research, 28, 45-48. |
| 2. | Berman, H. M., Westbrook, J., Feng,Z., Gillliland, G., Bhat, T. N. et al. (2000). The Protein Data Bank. NucleicAcids Research, 28, 235-242. |
| 3. | Altschul, S. F. & Gish, W.(1996). Local alignment statistics. Methods in Enzymology, 266, 460-480. |
| 4. | Altschul, S., Madden, T., Shaffer,A., Zhang, J., Zhang, Z. et al. (1997). Gapped Blast and PSI-Blast: a newgeneration of protein database search programs. Nucleic Acids Research, 25,3389-3402. |
| 5. | Sander, C. & Schneider, R.(1991). Database of homology-derived structures and the structural meaning ofsequence alignment. Proteins: Structure, Function, and Genetics, 9, 56-68. |
| 6. | Rost, B. (2002). Enzyme functionless conserved than anticipated. Journal of Molecular Biology, 318, 595-608. |
| 7. | Todd, A. E., Orengo, C. A. &Thornton, J. M. (2001). Evolution of function in protein superfamilies, from astructural perspective. Journal of Molecular Biology, 307, 1113-1143. |
| 8. | Wrzeszczynski, K. O. & Rost, B.(2003). Cataloguing proteins in cell cycle control. In Cell cycle checkpointcontrol protocols (Lieberman, H., eds.), pp. submitted, Humana Press, Totowa,NJ. |
| 9. | Nair, R. & Rost, B. (2002).Sequence conserved for sub-cellular localization. Protein Science, 11,2836-2847. |
| 10. | Wrzeszczynski, K. O. & Rost, B.(2003). In silico anaysis of retention signals for Endoplasmic reticulum andGolgi apparatus. Proteins: Structure, Function, and Genetics, submitted. |
| 11. | Tamames, J., Ouzounis, C., Casari,G., Sander, C. & Valencia, A. (1998). EUCLID: automatic classification ofproteins in functional classes by their database annotations. Bioinformatics,14, 542-3. |
| 12. | Devos, D. & Valencia, A.(2000). Practical limits of function prediction. Proteins: Structure, Function,and Genetics, 41, 98-107. |
| 13. | Ashburner, M., Blake, J. A.,Botstein, D., Butler, H., Cherry, J. M. et al. (2000). Gene ontology: tool forthe unification of biology. The gene ontology consortium. Nature Genetics, 25,25-29. |
| 14. | Wilson, C. A., Kreychman, J. &Gerstein, M. (2000). Assessing annotation transfer for genomics: quantifyingthe relations between protein sequence, structure and function throughtraditional and probabilistic scores. Journal of Molecular Biology, 297,233-249. |
| 15. | Brenner, S. E., Chothia, C. &Hubbard, T. J. P. (1998). Assessing sequence comparison methods with reliablestructurally identified distant evolutionary relationships. Proceedings of theNational Academy of Sciences, 95, 6073-6078. |
| 16. | Rost, B. (1999). Twilight zone ofprotein sequence alignments. Protein Engineering, 12, 85-94. |
| 17. | Yang, A. S. & Honig, B. (2000).An integrated approach to the analysis and modeling of protein sequences andstructures. II. On the relationship between sequence and structural similarityfor proteins that are not obviously related in sequence. Journal of MolecularBiology, 301, 679-689. |
| 18. | Alexandrov, N. N. & Soloveyev,V. V. (1998). Statistical significance of ungapped sequence alignments. InHICCS' 98: Pacific Symposium on Biocomputing' 98 (Altman, R. B., Dunker, A. K.,Hunter, L. & Klein, T. E., eds.), pp. 463-472, World Scientific, Maui,Hawaii, U.S.A.. |
| 19. | Hobohm, U. & Sander, C. (1994).Enlarged representative set of protein structures. Protein Science, 3, 522-524. |
| 20. | Park, J., Teichmann, S. A.,Hubbard, T. & Chothia, C. (1997). Intermediate sequences increase thedetection of distant sequence homologies. Journal of Molecular Biology, 273,349-354. |
| 21. | Enright, A. J. & Ouzounis, C.A. (2000). GeneRAGE: a robust algorithm for sequence clustering and domaindetection. Bioinformatics, 16, 451-457. |
| 22. | Li, W., Jaroszewski, L. &Godzik, A. (2001). Clustering of highly homologous sequences to reduce the sizeof large protein databases. Bioinformatics, 17, 282-283. |
| Contact: rost@columbia.edu | Version: Mar 20, 2003 |