bottom - TOC - CUBIC-papers - CUBIC

Title: Automatic target selection for structural genomics on eukaryotes
Author:Jinfeng Liu , Hedi Hegyi , Thomas B. Acton , Gaetano T. Montelione &Burkhard Rost
Quote: Proteins, 2004, 56(2):188-200

Automatic target selection for structural genomics on eukaryotes

Jinfeng Liu 1,3,4, Hedi Hegyi 3, Thomas B Acton 5,6, Gaetano T Montelione 5,6 and & Burkhard Rost 1,2,3,*

1 CUBIC, Dept. of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
2 Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA
3 North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
4 Dept. of Pharmacology, Columbia Univ., 630 West 168th Street, New York, NY 10032, USA
5 Center for Advanced Biotechnology and Medicine (CABM), Rutgers University, and Department of Biochemistry, Robert Wood Johnson Medical School, Piscataway, NJ 08854-5638
6 Northeast Structural Genomics Consortium (NESG), Department of Molecular Biology and Biochemistry, Rutgers University, Piscataway, NJ 08854-5638
* Corresponding author: cubic@cubic.bioc.columbia.edu URL http://cubic.bioc.columbia.edu/  Tel: +1-212-305-4018, fax: +1-212-305-7932

 

This article is published in (Proteins, issue, 2003 and pages) copyright Proteins: Structure, Function, and Genetics Wiley (2003). Wiley is the only authorized source. All copying of this article including placing on another website requires the written permission of the copyright owner.

 

Table of contents


Abstract

A central goal of structural genomics is to experimentally determine representative structures for all protein families. At least 14 structural genomics pilot projects are currently investigating the feasibility of high-throughput structure determination; nine of these in the USA are NIH funded. Initiatives differ in the particular subset of 'all families' on which they focus. At the NorthEast Structural Genomics consortium (NESG), we target eukaryotic protein domain families. The automatic target selection procedure has three aims: (1) Identify all protein domain families from currently five entirely sequenced eukaryotic target organisms based on their sequence homology. (2) Discard those families that can be modelled based on structural information already present in the PDB. (3) Target representatives of the remaining families for structure determination. In order to guarantee that all members of one family share a common fold-like region, we had to begin by dissecting proteins into structural domain-like regions before clustering. Our hierarchical approach, CHOP, utilising homology to PrISM, Pfam-A, and SWISS-PROT chopped the 103,796 eukaryotic proteins/ORFs into 247,222 fragments. 122,999 of these fragments appeared suitable targets that were grouped into over 27,000 singleton and over 18,000 multi-fragment clusters. Thus, our results suggested that it might be necessary to determine over 40,000 structures to minimally cover the subset of five eukaryotic proteomes.

 

Key words: structural genomics, target selection, protein structure family, cluster, domains, proteome analysis.

 

Abbreviations used:

3D structurethree-dimensional co-ordinates of protein structure
CHOPdissection into structural domain-like fragments [1]
CLUPsimple clustering algorithm for CHOP fragments [1]
COILSprediction of coiled-coil regions from sequence based on statistics and expert rules [2]
NORSsegment of more than 70 consecutive residues of NO Regular Secondary structure, i.e. without helix or strand (more precisely, we required that less than 12% of the residues in the respective region were in helix or strand and that at least one region of more than 10 residues was exposed to solvent) [3, 4]
ORFopen reading frame (for simplicity we usually refer to ORFs from genome sequencing projects as 'proteins')
PDBProtein Data Bank of experimentally determined 3D structures of proteins [5]
Pfam-Aexpert curated database of protein families [6]
PrISMautomatic method assigning sequence-consecutive structural domains from PDB co-ordinates [7, 8, 9]
SEGprogram detecting low-complexity regions [10]
SignalPmethod predicting signal peptides [11, 12]
SWISS-PROTdata base of protein sequences [13]
TMHtransmembrane helices.


Notations used:

protein sequenceswe refer to all sequences as 'proteins' although some are ORFs.
proteomeAll the proteins in an organism as the 'proteome' of that organism.
sequence-structure familiesGroup of proteins that are sufficiently similar in sequence to recognise a common fold by reliable cut-off thresholds in database searches. Usually, our criterion to consider proteins as member of one sequence-structure family is a PSI-BLAST E-value < 10-3), Note that this particular definition implies that two different families may share the same fold, however, this is not apparent without knowing the structure of both.
sequence-uniqueWe refer to all proteins within one sequence-structure family as 'not sequence-unique' (note that each sequence-structure family has only one sequence-unique representative).
target proteomesFive entire eukaryotic proteomes currently targeted by NESG (yeast: Saccharomyces cerevisiae, fruit-fly: Drosophila melanogaster, worm: Caenorhabditis elegans, and human: Homo sapiens, weed: Arabidopsis thaliana).
reagent proteomesProteomes from which NESG determines structures (note: these include bacterial and archael proteins that map to eukaryotic target clusters).

 

 

Introduction

Structural genomics: determine a structure for each sequence-structure family. In 2000, the National Institute of Health (NIH) in the USA began to finance pilot projects for large-scale protein structure determination (structural genomics) [14] . One goal of structural genomics [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28] is to determine at least one structure for each representative protein family for which a structure cannot be inferred by comparative modelling. An important benefit is the basic understanding of biology and biological processes that will result from the determination of structural scaffolds for most basic functional elements. Nevertheless, with the advances of many pilot projects, it becomes increasingly apparent that 'structures for all families' is only one contribution of structural genomics. Structural genomics also pioneers high-throughput projects targeting proteins rather than genes. This challenge requires the development of techniques and protocols for large-scale expression, purification, crystallisation and structure-determination. Such techniques semi or fully automating the work with proteins are likely to simplify many aspects of, e.g., biochemistry and cell biology and to add many techniques from biophysics to the standard battery of tools. Thus, these tools may ultimately become the most profound impact of structural genomics on everyday wet lab biology.



Table . 1
Table 1 : Structural genomics initiatives.

Acronym

Name

Country

URL*

BSGC

Berkeley Structural Genomics Center

USA

www.strgen.org/

CESG

Center for Eukaryotic Structural Genomics

USA

www.uwstructuralgenomics.org/

JCSG

The Joint Center for Structural Genomics

USA

www.jcsg.org/

MCSG

The Midwest Center for Structural Genomics

USA

www.mcsg.anl.gov/

NYSGRC

New York Structural Genomics Research Consortium

USA

www.nysgrc.org/

NESG

Northeast Structural Genomics Consortium

USA

www.nesg.org/

SECSG

The Southeast Collaboratory for Structural Genomics

USA

www.secsg.org/

SGPP

Structural Genomics of Pathogenic Protozoa Consortium

USA

depts.washington.edu/sgpp/

TB

TB Structural Genomics Consortium

USA

www.doe-mbi.ucla.edu/TB/

S2F

Sequence To Function

USA

s2f.umbi.umd.edu/

 

 

 

BSGI

Montreal-Kingston Bacterial Structural Genomics Initiative

Canada

euler.bri.nrc.ca/brimsg/bsgi.html

SGC

Structural Genomics Consortium

Canada

www.uhnres.utoronto.ca/proteomics/

 

 

 

SPINE

Structural Proteomics in Europe

Europe

www.spineurope.org/

ASG

After Sequencing Genomes

France

BIGS

Bacterial targets Genomics and Structural Information

France

igs-server.cnrs-mrs.fr/Str_gen/

NWSGC

North West Structural Genomics Centre

England

www.nwsgc.ac.uk/

OPPF

Oxford Protein Production Facility

England

www.oppf.ox.ac.uk

PSB

Partnership for Structural Biology

France

psb.esrf.fr/

PSF

Protein Structure Factory

Germany

www.rzpd.de/psf/

SGM

Structural Genomics of Micobacteria

France

feu.sis.pasteur.fr/cgi-bin/

WebObjects/MINISGP

WSPC

Weizmann Structural Proteomics Center

Israel

www.weizmann.ac.il/~wspc/

YSG

Yeast Structural genomics

France

genomics.eu.org/

 

 

 

BIRC

Biological Information Research Center

Japan

www.aist.go.jp/aist_e/

ressearch_units/research_center/

birc/birc_main.html

RSGI

RIKEN Structural Genomics Initiative

Japan

www.rsgi.riken.go.jp/

* URL without 'http://', e.g. http://www.nesg.org



One structure per family: simple concept, tough task! The goal to choose one structure per sequence-structure family ('Notations used') appears conceptually trivial: (1) establish thresholds for levels of sequence similarity that accurately imply structural similarity [29, 30, 31, 32, 33, 34, 35, 7, 36] , (2) begin from any protein and pull in all those proteins from the sequence universe that have the same structure. Unfortunately, this simple concept fails in practice [1, 37] . The detailed reasoning for our conclusion is beyond the scope of this manuscript. However, the two major points are that (i) we have to begin the clustering from fragments that resemble structural domains, and that (ii) the task at hand is to cluster entire proteomes, rather than to build sequence-structure families for selected proteins as in the spirit of the HSSP database [29, 38] , or the PIR [39, 40] , CATH [41] , and SCOP [42] super-families. Therefore, we have to first chop proteins as reliably as possible into structural domain-like fragments, and to then cluster these fragments before we can systematically choose 'one fold per sequence-structure family'.

NESG focus on eukaryotic domain families. The NorthEast Structural Genomics consortium (NESG http://www.nesg.org/), one of the NIH funded structural genomics pilot projects, has focused on proteins from the fully sequenced eukaryotic model organisms Saccharomyces cerevisiae, Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster and Homo sapiens. We initially decided to focus on protein targets shorter than 340 residues to reduce the problem of multi-domain proteins: over 90% of the structural domains in SCOP [42] and PrISM [7] are shorter than 340 residues [1] . The primary goal is to experimentally determine one structure per sequence-structure family not represented in the PDB from these eukaryotic organisms. In this aim, we target, clone, and express proteins from these eukaryotic model proteomes (target proteomes) and in some cases, those from a number of bacterial and archeal reagent proteomes that belong to the eukaryotic sequence-structure families.

First stage automatic target selection at NESG. The first stage of automated target selection has to solve the following four tasks: (1) Cluster all proteins from the eukaryotic model organisms such that each cluster represents one particular fold. (2) For each cluster, pull in those proteins from the non-eukaryotic reagent proteomes that share the fold represented by this cluster. (3) Exclude all clusters for which the fold is known. (4) Mark regions that are likely to hamper experimental progress, namely helical membrane proteins (PHDhtm [43, 44] , note that we most likely do not make any major mistake by ignoring beta-membrane proteins in eukaryotes [45] ), signal peptides (SignalP [11] ), proteins dominated by coiled-coil regions (COILS [2] ), low-complexity regions (SEG [10] ), or long regions without regular secondary structure (NORS [3, 4] ). All proteins that have less than 50 residues left after steps 3 (PDB) and 4 (difficult/unwanted) were excluded from further analysis; consequently, entire clusters may be excluded. Note that our procedure does not exclude secreted or membrane proteins, rather it only excludes proteins that are structurally characterised except for 'un-wanted' regions such as signal peptides or membrane helices. The resulting clusters comprise the NESG target list. One advantage gained from working with these clusters is realised in the protein production aspect of our project: The differences in sequence between members of a cluster can result in proteins that differ crucially in their characteristics of expression, solubility, crystallisability and amenability to structure determination by NMR. This multiplex scheme in which many members of each cluster are cloned and expressed in parallel increases the chances of eventually obtaining a sample suitable for structure determination. For example, while a target protein from yeast might not be soluble when over-expressed in Escherichia coli, a homologue from Aquifex aeolicus might prove soluble in the same system. If purification succeeds for more than one member of a cluster, a second stage of target selection is invoked the details of which will be described elsewhere (Diana Murray et al., in preparation). For each selected target the amino acid sequence, nucleic acid sequence, and other key information required for the cloning process is organised for our molecular biology and protein production efforts in the web-based ZebaView database [46] . Details of progress in cloning, expression, purification, and structure determination for each NESG protein target are then tracked by the SPiNE laboratory information management system [47] . All target clusters are linked to public databases and information about protein structure and function through our PEP database [48] .

Here, we describe the results from the first stage automatic target selection. To group proteins into sequence-structure families, we first chopped all proteins from the eukaryotic target proteomes and from the reagent proteomes into structural domain-like fragments by the procedure CHOP [1] . CHOP imposes a hierarchy beginning from the most reliable information about structure domains (PrISM [8, 9] domains for proteins of known structure), continues to families that are well characterised by experts (Pfam-A [6] ), and finally explores the reliable information about N- and C-terminal ends of proteins that are characterised by experimentalists (SWISS-PROT [13, 49] ). The objective is not to obtain all domain boundaries, but rather to identify only those boundaries for which we are confident. Our clustering strategy identified 18-21,000 non-singleton clusters representing the five entirely sequenced eukaryotes. These 18-21,000 domain-family clusters can be viewed as the minimal set of protein domains that structural genomics has to determine experimentally for eukaryotes.

 

 

Results

Most proteins had more than one fragment. For the 103,796 proteins in the five eukaryotic target proteomes, the CHOP algorithm [1] generated 247,222 fragments. 47% of these resulted from sequence similarity to PrISM, Pfam-A or to SWISS-PROT termini ( Fig. 1 A). This fraction was considerably lower than that obtained for a set of 62 entirely sequenced proteomes, although these were mostly non-eukaryotic [1] . Only 14% of the final fragments were full-length proteins that remained untouched by our algorithm. To illustrate these percentages by numbers: 34,914 proteins (14% of all fragments and 34% of all proteins) were not chopped while 115,601 fragments (47% of fragments) originated directly from similarities to PrISM, Pfam-A, or SWISS-PROT; another 96,707 fragments (39% of fragments) were left over after chopping. The subset of these untouched 'left-over' fragments differed significantly in its length-distribution from all full-length proteins ( Fig. 1 B). The distribution of the number of fragments per protein differed slightly from that for 62 entirely sequenced proteomes. In particular for the subset of all proteins that were chopped by our algorithm, 28% had only one fragment in the set of all 62 proteomes [1] , and about 19% in the five eukaryotic proteomes; about 1% of the chopped eukaryotic proteins - corresponding to 1026 proteins - had more than 10 fragments ( Fig. 1 C). On average, the number of CHOP fragments was directly proportional to the protein length (fit: Average length of protein = 202 + 103 * Number of fragments, Fig. 1 D open circles). This linear fit for the average length of a structural domain-like fragment unravelled two rather remarkable features. The first was that most fragments extend over about 100 residues. Although the current PDB is biased toward certain types of proteins, in particular, proteins that are shorter than the averages observed from entire genome sequencing, the corresponding fit for structurally known protein domains - defined by PrISM - was parallel (fit: Average length of protein = 65 + 97 * Number of fragments, Fig. 1 D crosses). Thus, most domains in multi-domain proteins in PDB were also about 100 residues long. The second remarkable feature was that the linear fit for both the eukaryotic proteomes and the PDB did not begin at 0 but at 65 for PrISM/PDB domains and at 202 for our eukaryotic fragments. Thus, N-1 domains in proteins with N domains extend over 100 residues while one extends over 160-300 residues. In order to establish that this unexpected finding was not caused by the particular way of presenting the data (average length vs. number of domains), we pooled all domain-like fragments and randomly 'assembled proteins' according to the observed distributions for the number of fragments per protein (data not shown). As expected, this control experiment yielded a line passing through 0. Thus, our finding is not explained by the particular presentation of the data. Thus, the detailed fit is likely to constitute a more precise estimate of the average length of a structural domain than the one that is obtained from compiling a simple average over all domains currently annotated in PDB.



Fig. 1
fig1.gif

Fig. 1 : Statistics on eukaryotic CHOP fragments. (A) Percentage of all CHOP fragments (including remaining regions): 47% of all the final CHOP fragments originated from cuts according to PrISM, Pfam-A, or SWISS-PROT; 14% of the 'fragments' were full-length proteins untouched by CHOP. (B) Note that all curves described the CHOP fragments, e.g. the thick black line with filled triangles showed the distributions for fragments that were chopped through similarity to PrISM domains. 'Remain' marks those fragments that remained N- and/or C-terminal from a region cut out according to similarity to either PrISM domains, Pfam-A regions, or SWISS-PROT termini; 'Full-length' mark proteins that were not touched at all by the CHOP algorithm. All CHOP fragments (grey with open circles) were similar to the subset of Pfam-A fragments. Fragments cut according to SWISS-PROT termini (5%, Fig. 2) were more similar to the distribution of all full-length proteins from the eukaryotic proteomes (not shown) than the subset of proteins that remained untouched. (C) Distribution of number of CHOP fragments per protein (as percentage of all proteins that were chopped). For example, we found that less than 20% of the eukaryotic proteins had a single structural domain-like fragment. (D) Relation between number of fragments and protein length: on average, the number of CHOP fragments appeared to increase linearly with protein length (open circles). The basic functional form for this plot was similar for domains from proteins of known structures (crosses, taken from PrISM). The lines fit the data with L = a + b N with L being the length of the protein and N the number of fragments.




Half of all fragments were not suitable for structural genomics. For each of the 247,222 eukaryotic structural domain-like fragments generated by CHOP, we searched for similarities to PDB structures (Methods) and applied a variety of prediction methods. The objective of this step was to exclude fragments from further analysis that either had known structure or constituted low-priority targets for structural genomics. 167,717 fragments did not match to any known structure. In order to filter out potentially difficult cases for structure determination (low-priority targets), we discarded all fragments that were dominated by predicted membrane helices [44] , coiled-coil helices [2] , low-complexity regions [10] , long regions of low secondary structure contents ( NORS regions [3, 4] ), and those of insufficient length (<50 residues). The precise criterion to accept a fragment was that we found at least 50 consecutive residues without known structure and without any of the 'problematic' regions listed; this step left 122,999 globular eukaryotic fragments. Albeit conceptually easy, our criterion for exclusion of fragments made it impossible to directly investigate the relative contribution of each of the possible reasons for exclusion. For example, a fragment might have residues 1-40 in a signal peptide, 60-65, 70-75, 80-85 in SEG, and a membrane helix close to the C-terminus. However, we could measure the per-residue contribution of each reason for exclusion: overall our filtering procedure considered about 51% of the residues in all eukaryotic proteins as either 'known structure' or as 'unwanted'; this percentage was 46 percentage points higher for fragments excluded before clustering (77%) than for the 122,999 fragments considered further (31%, Fig. 2 ). Most 'unwanted' residues originated from similarity to very short segments of known structure (58% of the residues in excluded fragments). The second most common reason for exclusion was the presence of a long region lacking regular secondary structure (NORS). The contribution from signal peptides (SignalP), membrane helices (TMH) and coiled-coil helices (COILS) was rather small in comparison: NORS+PDB accounted for 95% of the 'unwanted' residues in excluded fragments). Since our condition for inclusion was that we could find at least one region of 50 consecutive residues without unwanted residues, the 122,999 fragments used for clustering still contained considerable fractions of such residues (31%).



Fig. 2
fig2.gif

Fig. 2 : Statistics on exclusion of CHOP fragments for NESG. While we cannot predict from sequence which proteins are the 'best' targets for structural genomics, we can predict which ones are likely not suitable. Most obviously, these are proteins with known structure, and proteins with many regions that hamper high-throughput structure determination since they have long coiled-coil (COILS) or membrane helices (TMH), long regions without regular secondary structure (NORS), or with low-complexity regions (SEG), or fragments that basically contain only a signal peptide (SignalP). Before clustering CHOP fragments, we excluded all segments that would clearly not be suitable. Our criterion for exclusion was simply that we could not find at least 50 consecutive residues without any unwanted region. Note that this definition implies that we did choose targets with 'problematic regions'. In particular, many targets originated from secreted proteins, only the signal peptides were excluded. Here, we compared the percentage of residues in such unwanted regions for all eukaryotic target proteins (grey bars) with that of the excluded fragments (stippled bars), and the target fragments clustered in this work (black bars).




Simple clustering on CHOP fragments yielded reasonable groups. We grouped all 122,999 eukaryotic fragments that constituted potential targets for structural genomics by a simple clustering procedure (CLUP [1] ). Most resulting clusters (27,669) were singletons, i.e. contained only one fragment; 21,309 of the clusters contained multiple eukaryotic fragments. Of these 21,309 non-singleton clusters, about one half contained three or more members, and about 13% had more than ten eukaryotic fragments ( Fig. 3 A, dark grey). While NESG aims at experimentally determining structures for eukaryotic proteins (target proteomes), we also target homologues of these protein domain families from non-eukaryotic reagent proteomes ( Table 2 ). Prokaryotic members of these domain families are often easier to produce in E. coli expression systems, and often correspond to full-length versions of domains that occur within large multi-domain eukaryotic proteins. On average, the non-eukaryotic reagent proteomes contributed fewer members to each cluster than the target proteomes ( Fig. 3 A, light grey gives reagent + target proteomes). 143 clusters contained over 100 fragments; the largest cluster contained 643 fragments; the seed of this cluster was the worm protein caeel_fr26007 annotated by Pfam-A (PF00153) as 'Mitochondrial carrier protein'. By definition, all members of one CLUP cluster share one region that might constitute a common fold. This implies that the same fragment can belong to more than one cluster ( Fig. 3 B). One reason could be that the fragment actually consists of two structural domains that were not recognised as such by CHOP. The majority of all fragments mapped to a single cluster (74%), and only 1% were associated with more than five clusters ( Fig. 3 B). The non-singleton clusters united a total of 94,678 fragments, 21,290 of these constituted full-length proteins that had been left untouched by CHOP, and only 2,817 of these were fragments from proteins for which CHOP identified a single domain-like region. When we applied a more

 



Fig. 3
fig3.gif

Fig. 3 : Statistics on target clusters. (A) Number of fragments per target cluster for the eukaryotic target proteomes (dark) and the target + reagent proteomes (light): Over one-third of the non-singleton clusters contain two fragments, about half of all clusters have more than three members (inlet: cumulative percentages). (B) Degeneracy of clusters. Most CHOP fragments (74%) are associated with a single cluster, while ~1% of the proteins are associated with more than 5 clusters. Note that only those clusters were considered that constitute valid targets for structural genomics.




 

 

Discussion and Conclusions

Do structural domains constitute the 'atom of evolution'? While we failed to cluster full-length proteins [1] , even our simple clustering strategy yielded a reasonable grouping for domain-like fragments. All members of one cluster share a region that is likely to constitute a common fold. The number of fragments that belong to more than one cluster, i.e. the degeneracy of our clusters, reflects the success of combining CHOP and CLUP: 74% of all eukaryotic fragments amendable to structural genomics mapped exclusively to one eukaryotic cluster ( Fig. 3 B). When we applied CLUP to known structural domains (PrISM), the degeneracy was negligible [1] ; these data may suggest that the remaining degeneracy is largely caused by CHOP being incomplete: 56% of all fragments did not match to known domain-like regions ( Fig. 1 A). However, the lack of degeneracy for PrISM domains might also just be a size effect, i.e., we might observe a higher degeneracy if PDB were ten times larger. We also evaluated the degeneracy when we merged all clusters that had at least 2 in 3 fragments in common (data not shown). While such a permissive merging decreased the degeneracy considerably (91% only in one cluster, and only 1% in more than 3). However, after such merging, we can no longer ascertain that all members in one cluster share a common fold. Our combined chopping and clustering strategy implicitly assumed that structural domains constitute something like the 'atoms of evolution'. This becomes evident when considering the alternative: If evolution proceeded by a cut-and-paste mechanism of units shorter than structural domains - as recently suggested [50, 51] - a simple clustering would not separate the groups in the sense of 0 degeneracy. Presumably, the degeneracy that we observe to some extent originated from the partial error in our initial assumption and to some extent from the fact that often does move by cut-and-past of sub-domains [50, 51] . Structural domains may not be 'the atom of evolution', however, our data suggested that even our incomplete chopping procedure constituted a rather successful starting point on the quest for evolutionary units. Since most protein-protein interactions are between single domains [52, 53, 54, 55, 56] , our structural domain-like fragments may also help reducing noise in two-hybrid experiments by probing protein-protein interactions between fragments rather than between full-length proteins.

Thresholds optimised for high rate of sequence-unique structures. Although our automatic target selection strategy is conceptually rather simple, it required choosing many more or less arbitrary thresholds about when to consider a similarity sufficient to chop or to cluster and when to exclude fragments. Overall, our results were rather stable with respect to minor changes in these thresholds with one prominent exception: the sequence similarity to known PDB structures that exclude fragments. Toward this end, we applied a threshold (PSI-BLAST expectation value of 1 or an HSSP-value of 0 [34] ) that is fairly conservative in the sense that we are likely to exclude 'trivial' similarities. The success of this choice is apparent in the high percentage of structures determined by NESG for proteins that could not have been modelled by homology: 59% of the solved structures constituted sequence-unique proteins ( Fig. 4 ); this was over seven times higher than the corresponding percentage for the entire PDB from the same period (and it was the second highest for all structural genomics consortia, surpassed only by the S2F Structure-to-function consortium headed by John Moult [57] ). Although successful to avoid overlap with known structures, these thresholds are far too permissive to ascertain that all protein pairs with this level of similarity are within homology-modelling distance of each other. Therefore, we had to choose a different threshold (PSI-BLAST E-value <10-3) to assign proteins to a particular cluster. At this level, comparative modelling correctly predicts about 30-50% of the proteins at a main chain root mean square deviation of about 2-5 (Marc Marti-Renom, data unpublished [58, 59] ). Assume that a structural biologist determines the structure for one of the fragments in our cluster X. Should we automatically remove that cluster from our list? Obviously, our threshold for grouping is still too permissive to rule out that we would benefit from determining additional members of X. The NESG, therefore carries out an additional step, namely an expert-driven manual detailed examination of cluster X that aims at providing a more reliable answer to the question of whether multiple structures are required to characterize the entire cluster (Diana Murray, Cornell, unpublished).



Fig. 4
fig4.gif

Fig. 4 : Early success of structural genomics. One goal of structural genomics is to determine structures for proteins for which we cannot predict structure through comparative modelling. The lower panel shows the number of structures deposited into the PDB (light grey) by structural genomics consortia (list Table 1); the dark and stippled bars show the subsets of these for which comparative modelling was not applicable at the time of deposition assuming that we can build all models when a homologue in PDB is similar at a BLAST expectation value of 10-3 (black) or 10-10 (stippled); the numbers on top of the black bars give the percentage of sequence-unique proteins at 10-3. About 49% of all the structures solved by all 10 US-consortia could not have been modelled (for comparison: the corresponding number for the entire PDB for the same time period was about 8 <<<10-3; stippled bars mark E-values <10-10); the upper panel shows the same information for the number of residues modelled in all these proteins. These leverage values demonstrate the impact of a single well-chosen experimental structure: all consortia determined 189 unique structures that might yield new models - at least at low resolution - for about 23,000 proteins and over 5 million residues, i.e. on average each unique structure gave rise to 120 new models. However, the number of sequence-structure families with over 600 members is rather limited; less than 40% of all families have over 100 members [64, 71] . Consequently, the leverage values will decrease with increasing success of structural genomics in structurally covering all sequence-structure families. Thus, leverage values may not constitute the most reasonable measures of success for structural genomics.




Structural genomics already contributed many sequence-unique structures. About 8000 structures containing a total of about 14000 chains were added to the PDB while structural genomics consortia have existed; 1100 of these chains (~8%) were sequence-unique [5, 60] , based on the criteria described above. One way of evaluating the success of structural genomics pilot projects in the USA over the first three years of funding is by measuring the percentage of new sequence-unique structures determined by all pilot projects (49%, Fig. 4 ). To illustrate the impact of this high number: although structural genomics projects in the US contributed fewer than 3% of all structures, they solved almost 20% of the sequence-unique structures deposited into PDB in that period. Nevertheless, given that for instance the NESG threshold for excluding proteins with similarities to known structures from the target list is very permissive (E-value of 1), it seems surprising that the rate of sequence-unique structures from NESG was - albeit the highest for all consortia - 'only' 67%. Part of the reason is a legacy of initial 'technology-development' targets that were selected and brought into the structure analysis pipeline before the target selection process outlined here had been developed. Indeed, almost 80% of the targets selected by the initial realisation of the concept described here were sequence-unique; still not 100% because we considered targets as sequence-unique at the time of deposition in the PDB rather than at the time of selection (we did not have this information for all consortia). Thus, by this definition, the structural genomics consortia compete with each other and all other structural biologists. Since many consortia make an effort to choose 'high-leverage targets' (as demonstrated by the above average values for the leverage of each structure deposited, Fig. 4 upper panels), the high rate of sequence-unique structure illustrates another aspect of success: speed. Although the consortia differ in their focus, they obviously overlap in the attempt to prioritise targets that appear promising and interesting. This may explain the high degree of overlap between the consortia (on average 50%, Fig. 5 >


Fig. 5
fig5.gif

Fig. 5 : Overlap between structural genomics consortia. The fraction of sequence-unique proteins in Fig. 4 implicitly reflects the overlap between the structural genomics consortia and all structure deposited into the PDB (overlap means PSI-BLAST E-values below 10-3). Here, we addressed the question to which extent the structural genomics consortia overlapped (10-3) with each other (top panel: number of proteins that overlap as a percentage of the number of proteins deposited in PDB by 2003-03-26). Although all consortia have similar implicit constraints (as many sequence-unique structures as possible in order to substantiate funding), the overall overlap remained below 30% (84 of 315 proteins deposited into the PDB by 2003-03-26). The consortia differed substantially in the percentage of targets for which they overlap and deposited the structure faster than any other consortium (lower panel: number of proteins determined before all other consortia as percentage of number of overlapping proteins).




Alarm system detecting recently solved structures: work in progress. Each week we compare all proteins added to PDB against all our target PSI-BLAST profiles. If we find a new structure that has sequence similarity to any of our targets, we notify the group of Diana Murray at Weil Medical College of Cornell University, members of the NESG, who investigate the case in more detail. We currently apply three different thresholds depending on the experimental stage of the target. (1) If the target is already expressed, soluble, and purified and we have a good or promising first HSQC spectrum and/or a promising crystal, we require a similarity to a known structure at either a PSI-BLAST E-value <10-10 or at an HSSP-proximity >10 to "stop work" on the corresponding target. (2) If the target is cloned, expressed, soluble, and purified, but has not yet yielded success in preliminary efforts of structure analysis, the thresholds are E-value <10-3 or HSSP-proximity>2 for "stop work". (3) If the target has not been touched experimentally, or has been cloned but not yet expressed in soluble form, we simply remove it from the target list for E-values <1 and for HSSP-proximity> 0 <<<10-3) are alerted.

Over 40,000 targets for five eukaryotes, alone! The largest funding for structural genomics worldwide originates from the Science & Technology Agency in Japan and is concentrated at the RIKEN Structural Genomics Initiative (RSGI) at the Institute of Physical and Chemical Research [61] . The second largest funding originates from the National Institute of General Medical Sciences (NIGMS) at the National Institute of Health (NIH) in the USA. The NIGMS protein structure initiative (PSI) formulates as one of the goals of structural genomics "to determine representative structures from all protein families" [14] . The goal of our automatic target selection is exactly along this line: (1) find all representative families from entirely sequenced eukaryotes, (2) exclude those for which we already have knowledge about structure and which are likely to hinder progress in a high-throughput enterprise, and (3) develop the means that allow rapid structure determination for each of these families. We were surprised by the amount of technical difficulties that we had to surmount to approximate a solution for the tasks imposed on target selection by the first two steps; the results that we presented here refine in many ways the simple concepts laid out earlier [62, 63, 64, 65, 66] . Possibly the most surprising result of our work is summarised by the following two numbers: 27,669 singleton and 18,079 (after permissive merging of clusters) to 21,309 non-singleton clusters, summing up to over 40,000 proteins for which we have to determine structure if we want to 'minimally cover' the proteomes of only five eukaryotes. In fact, these numbers still underestimate the actual workload as they ignore membrane regions, domains that appear depleted of regular secondary structure [67, 68, 3, 4] , as well as proteins that clearly have similar folds but differ in function. About 6,700 clusters contain at least one full-length protein (untouched by CHOP) that falls into the NESG length range and belonged to only one cluster; most of these target clusters are already in the experimental pipeline of NESG. Currently, we work on ways to connect clusters through threading. While such connections will be very unreliable, they may enable us to further raise the probability of spanning as many structural families as possible with as few structures as we can determine experimentally. Nevertheless, at the current rate of structure determination, structural genomics projects will not run out of targets for many years to come.

 

 

Methods



Sequences from entirely sequenced organisms

We obtained all ORFs for the archae and prokaryotic reagent proteomes, for Saccharomyces cerevisiae and for Arabidopsis thaliana from the NCBI web site: ftp://ftp.ncbi.nih.gov/genbank/genomes/. For the remaining eukaryotic proteomes, we used the following sources: Homo sapiens, from SWISS-PROT (release 39) and TrEMBL (release 18), Drosophila melanogaster from http://www.fruitfly.org/ (release 2), Caenorhabditis elegans from http://www.sanger.ac.uk/Projects/C_elegans/wormpep/ (wormpep 65).



Table . 2
Table 2 : Target and reagent proteomes atNESG. *

 

Organism

Number of ORFs Number of CHOP fragments Number of CLUP clusters

Eukaryotic target proteomes

Arabidopsis thaliana

17401 26867 8487

Caenorhabditis elegans

12390 18094 8529

Drosophila melanogaster

7655 10994 8152

Homo sapiens

22547 34413 12605

Saccharomyces cerevisiae

3120 4401 3225

 

Archaeal reagent proteomes

Aeropyrum pernix K1

296 334 430

Achaeoglobus fulgidus

317 364 440

Methanobacterium thermoautotrophicum

292 330 401

Pyrococcus furiosus

290 340 369

Pyrococcus horikoshii

261 296 364

Thermoplasma acidophilum

235 265 330

 

Prokaryotic reagent proteomes

Aquifex aeolicus

304 380 511

Bacillus subtilis

470 535 527

Brucella melitensis

260 309 340

Campylobacter jejuni

193 220 274

Caulobacter crescentus

450 517 540

Deinococcus radiodurans

390 451 476

Escherichia coli

694 812 747

Fusobacterium nucleatum

224 267 316

Haemophilus influenzae

226 278 341

Helicobacter pylori

200 233 289

Lactococcus lactis

248 285 375

Neisseria meningitidis

251 296 385

Staphylococcus aureus

312 357 475

Streptomyces coelicolor

827 941 752

Streptococcus pyogenes

192 227 348

Thermotoga maritima

275 325 432

Vibrio cholerae

332 405 488

* All numbers refer to the subset of proteins in our final 22,037 non-singleton clusters. Note that the number of clusters to which each proteome contributes does not sum to the number of all target clusters, since all the clusters considered have more than one member. Note that only the US projects have consistently deposited their data into TargetDB, i.e. the special database for structural genomics provided by the PDB.





Database searches and prediction methods

Search for similar proteins. We detected similar sequences in two ways. (1) Run PSI-BLAST [69] searches against all known sequences contained in SWISS-PROT [13, 49] , TrEMBL [13, 49] , and PDB [5] . For simplicity, we refer to the combination of these three databases as the set BIG. We first searched against a 'filtered' version of BIG (regions of low complexity were marked by SEG [10] , proteins with over 98% pairwise sequence identity were removed) and then used the final profile to search against the unfiltered BIG [70, 71] . When building a sequence-structure family, we included all hits below a PSI-BLAST E-value of 10-3 or above an HSSP-value of 0 [34] . The HSSP-curve relates alignment length to pairwise sequence identity or similarity [29, 34] ; for alignments of 100 residues, HSSP=0 corresponds to 33% pairwise sequence identity, for alignments longer than 250 residues to about 20%; we refer to values >0 as HSSP-proximity (the larger the more similar) and to values <0 as HSSP-distance (the larger the more distant).

Membrane proteins. We used only the filtered MaxHom alignments [29, 34] for predicting membrane regions by the program PHDhtm [43, 44] using the default threshold of 0.8. Note: our notion of 'membrane proteins' is restricted to integral helical membrane proteins. In particular, we ignored proteins anchoring helices in the membrane since these classes of proteins cannot be identified generally from sequence information alone. We also ignored proteins inserting beta-strands (porins) into the membrane since (1) these proteins are not assumed to exist in any major fraction in any eukaryote [45] , and (2) no method that is currently publicly available detects all these proteins with sufficient reliability.

Signal peptides. We predicted signal peptides using the program SignalP [11, 12] . We considered a protein to contain a signal peptide if the mean S value in the prediction was above the default threshold. The accuracy of SignalP was estimated to be around 90% [11, 12, 72] . We excluded archae-bacterial reagent proteomes from the analysis since SignalP was developed for prokaryotes and eukaryotes. Like for all other 'problematic regions', the presence of a signal peptide did not exclude the respective protein from our target list, rather, we only excluded the signal peptide itself, and proteins for which except for the signal peptide no region of interest was longer than 50 residues.

Coiled-coil helices. We used the program COILS [2] to predict coiled-coil region, with the window-size set to 28 residues and the threshold for probability set to 0.9.

Low-complexity (SEG) and non-regular secondary structure (NORS). We labelled regions of low-complexity using the program SEG [10] using the default parameters. Using the filtered MaxHom alignments, we used PROFsec [73, 74, 43, 75] to predict secondary structure and PROFacc [76, 43, 75] to predict solvent accessibility. We considered stretches of more than 70 consecutive residues with less than 12% predicted helix or strand as 'NORS' [3, 4] .



Chopping into domain-like fragments (CHOP) and clustering (CLUP)

CHOP and CLUP, i.e. our methods for dissecting proteins into structural domain-like fragments and for clustering these fragments are described elsewhere [1] . Here, we only sketched the idea of both procedures ( Fig. 6 gives a simplified flow-chart for the entire automatic selection stage at the NESG).



Fig. 6
fig6.gif

Fig. 6 : Simplified flow-chart for automatic target selection stage at the NESG. Our first objective is to dissect as many of the proteins from the target and reagent proteomes (Table 2) into structural domain-like fragments (CHOP procedure [1] ) and to label these fragments (left panel). CHOP imposes a hierarchy in the sense that it explores first the most reliable information (PrISM domains of known structure), then the next reliable (Pfam-A regions annotated by experts), and finally pulls in the least reliable information (experimentally verified full-length proteins from SWISS-PROT). At each step, proteins are cut into the region that is homologous to any of the three sources (PrISM, Pfam-A, SWISS-PROT, labelled X in the flow-chart) and the fragments left and right of this cut (labelled notX). Fragments shorter than 30 residues are removed after this cut. Next, we filter out all CHOP fragments that match to known structures, and mark the regions in CHOP fragments from the eukaryotic target proteomes that may pose problems to rapid structure determination (NORS, SEG, TMH, COILS). Fragments that have less than 50 consecutive residues without any such 'problematic' regions are also removed at this step. (Note one of the many additional complications left out here is that we based these predictions on multiple alignments rather than on single sequences.) The second major objective is to group the CHOP fragments into domain-like sequence-structure families and to map proteins from the non-eukaryotic reagent proteomes into these families (right panel). We first cluster all CHOP fragments from the eukaryotic target proteomes (CLUP [1] ). Then we generate PSI-BLAST profiles for each cluster, and finally map the non-eukaryotic reagent proteins into these cluster families by aligning them into the CLUP PSI-BLAST profiles.



Domain-like fragments. CHOP implements three hierarchical steps that were applied by decreasing confidence in the accuracy of the information: (1) High reliability: PrISM domains, (2) Acceptable confidence: Pfam-A families, and (3) protein termini from SWISS-PROT. We discarded all fragments from step S that overlapped with fragments identified in the previous step S-1 (more reliable identification of domain boundaries). At any step, we discarded fragments with less than 30 residues. We applied CHOP to all proteins in the five eukaryotic target and the 23 reagent proteomes.

Clustering CHOP fragments. We clustered all the fragments (and uncut full-length) proteins obtained from CHOP (except those that were remove, see below). Toward this end, we simply ran an all-against-all PSI-BLAST [69] search with a threshold E-value of < 10-3. Then we applied a simple clustering algorithm starting from the shortest fragments in the groups with fewest members identified by the previous PSI-BLAST. Finally, we merge all clusters that have more than 10 members and share 90% of their members.

Conservative and permissive merging of CLUP clusters. In our final step, we merged all clusters that had more than 10 members and shared 90% of their members. We introduced this step since after analysing visually the relations between the largest degenerate clusters. For all examples that we looked at, this particularly conservative threshold did not challenge our goal that all members of a given cluster have a structurally similar region that is likely to constitute a structural domain. We also analysed a more permissive merging strategy by joining all clusters that shared at least 50% of their members (two in three for clusters with only three members).



Thresholds and exclusion of fragments

Removing fragments. Many proteins of known structure contain regions of low-complexity [68, 3] . However, proteins that contain almost no high-complexity regions constitute at best low-priority targets for structural genomics. Before clustering, we removed all fragments that had fewer than 50 residues in non-membrane, non-coiled, non-signal peptide, non-SEG, or non-NORS regions.

Different thresholds for different objectives. Our procedure required introducing different thresholds for sequence similarity depending on whether we wanted to chop proteins into fragments, to ascertain that two proteins have a common fold, or to guarantee that a target cluster is really unlikely to be modelled by comparative modelling. CHOP was restricted to pairwise BLAST E-values < 10-2 over 80% of the PrISM domain and to E<10-2 in HMMer [77] for Pfam-A regions. We included proteins into a sequence-structure family when they overlapped at least 50 residues with the representative chosen by CLUP at a similar threshold for sequence similarity as used to chop.

 

 



Acknowledgements

Thanks to our experimental colleagues at the Northeast Structural Genomics Consortium (NESG) for the invaluable readiness to let theory determine what they sweat on in their labs! In particular, thanks to Thomas Szyperski (Buffalo), Cheryl Arrowsmith and Aled Edwards (Toronto), John Hunt and Liang Tong (Columbia), to Mike Kennedy (Pacific Northwest Natl Laboratory, Richland) and to George DeTitta (Buffalo). Thanks to our colleagues also involved in target selection for helpful discussions: Barry Honig and Sharon Goldsmith (Columbia) and Diana Murray (Cornell). To Mark Gerstein and his group (Yale) for pushing us to develop PEP. Particular thanks to Phil Carter (Columbia, New York and Imperial College, London) for building the databases PEP, CHOP, and CLUP, to An-Suei Yang (Columbia) for providing and helping with PrISM, and to Dariusz Przybylski (Columbia) for providing preliminary information and programs. Also thanks to the EVA-team for helping with this resource essential for many aspects of this work, in particular to Volker Eyrich and Ingrid Koh (both Columbia), Alfonso Valencia (Madrid), Marc Marti-Renom, Andrej Sali (both UCSF) and their groups. This work was supported by a grant from the Protein Structure Initiative of National Institutes of Health (P50 GM62413). Last not least, thanks to all those who deposit their experimental data in public databases, in particular in the context of structural genomics, and to the teams around PDB (Helen Berman, Rutgers and Phil Bourne, UCSD), Pfam (Alex Bateman, Sanger and Erik Sonnhammer, Stockholm), and SWISS-PROT (Amos Bairoch, SIB Geneva) who maintain these databases that were central to this work.

 

 

 

 

References

1.Liu, J. & Rost, B. (2003). CHOPproteins into structural domain-like fragments. Proteins,submitted.
2.Lupas, A. (1996). Prediction andanalyis of coiled-coil structures. Meth. Enzymol., 266, 513-525.
3.Liu, J., Tan, H. & Rost, B.(2002). Loopy proteins appear conserved in evolution. J. Mol. Biol., 322, 53-64.
4.Liu, J. & Rost, B. (2003).NORSp: predictions of long regions without regular secondary structure. Nucl.Acids Res., 31,3833-3835.
5.Berman, H. M., Westbrook, J., Feng,Z., Gillliland, G., Bhat, T. N. et al. (2000). The Protein Data Bank. Nucl.Acids Res., 28,235-242.
6.Bateman, A., Birney, E., Cerruti,L., Durbin, R., Etwiller, L. et al. (2002). The Pfam protein families database.Nucl. Acids Res., 30, 276-80.
7.Yang, A. S. & Honig, B. (2000). Anintegrated approach to the analysis and modeling of protein sequences andstructures. II. On the relationship between sequence and structural similarityfor proteins that are not obviously related in sequence. J. Mol. Biol., 301, 679-689.
8.Yang, A. S. & Honig, B. (2000).An integrated approach to the analysis and modeling of protein sequences andstructures. I. Protein structural alignment and a quantitative measure forprotein structural distance. J. Mol. Biol.,301, 665-678.
9.Yang, A. S. & Honig, B. (2000).An integrated approach to the analysis and modeling of protein sequences andstructures. III. A comparative study of sequence conservation in proteinstructural families using multiple structural alignments. J. Mol. Biol., 301, 691-711.
10.Wootton, J. C. & Federhen, S.(1996). Analysis of compositionally biased regions in sequence databases. Meth.Enzymol., 266,554-571.
11.Nielsen, H., Engelbrecht, J.,Brunak, S. & von Heijne, G. (1997). Identification of prokaryotic andeukaryotic signal peptides and prediction of their cleavage sites. Prot.Engin., 10, 1-6.
12.Nielsen, H., Brunak, S. & vonHeijne, G. (1999). Machine learning approaches for the prediction of signalpeptides and other protein sorting signals. Prot. Engin., 12, 3-9.
13.Bairoch, A. & Apweiler, R.(2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in2000. Nucl. Acids Res., 28, 45-48.
14.NIGMS (2003). Protein structureinitiative. 2003, .
15.Gaasterland, T. (1998). Structuralgenomics taking shape. TIGS, 14, 135.
16.Rost, B. (1998). Marrying structureand genomics. Structure, 6, 259-263.
17.Sali, A. (1998). 100,000 proteinstructures for the biologist. Nat. Struct. Biol.,5, 1029-1032.
18.Shapiro, L. & Lima, C. D.(1998). The Argonne Structural Genomics Workshop: Lamaze class for the birth ofa new science. Structure, 6, 265-267.
19.Brenner, S. E., Barken, D. &Levitt, M. (1999). The PRESAGE database for structural genomics. Nucl. AcidsRes., 27, 251-3.
20.Burley, S. K., Almo, S. C.,Bonanno, J. B., Capel, M., Chance, M. R. et al. (1999). Structural genomics:beyond the human genome project. Nat. Gen.,23, 151-157.
21.Montelione, G. T. & Anderson,S. (1999). Structural genomics: keystone for a Human Proteome Project. Nat.Struct. Biol., 6,11-12.
22.Teichmann, S. A., Chothia, C. &Gerstein, M. (1999). Advances in structural genomics. Curr. Opin. Str. Biol., 9, 390-399.
23.Christendat, D., Yee, A., Dharamsi,A., Kluger, Y., Savchenko, A. et al. (2000). Structural proteomics of anarchaeon. Nat. Struct. Biol., 7, 903-9.
24.Hendrickson, W. A. (2000).Synchrotron crystallography. TIBS, 25, 637-643.
25.Montelione, G. T., Zheng, D.,Huang, Y. J., Gunsalus, K. C. & Szyperski, T. (2000). Protein NMR spectroscopyin structural genomics. Nat. Struct. Biol., 7, 982-985.
26.Moult, J. & Melamud, E. (2000).From fold to function. Curr. Opin. Str. Biol.,10, 384-389.
27.Skolnick, J., Fetrow, J. S. &Kolinski, A. (2000). Structural genomics and its importance for gene functionanalysis. Nat. Biotechnol., 18, 283-287.
28.Thornton, J. (2001). Structuralgenomics takes off. TIBS, 26, 88-89.
29.Sander, C. & Schneider, R.(1991). Database of homology-derived structures and the structural meaning ofsequence alignment. Proteins, 9, 56-68.
30.Abagyan, R. A. & Batalov, S.(1997). Do aligned sequences share the same fold? J. Mol. Biol., 273, 355-368.
31.Park, J., Teichmann, S. A.,Hubbard, T. & Chothia, C. (1997). Intermediate sequences increase the detectionof distant sequence homologies. J. Mol. Biol.,273, 349-354.
32.Brenner, S. E., Chothia, C. &Hubbard, T. J. P. (1998). Assessing sequence comparison methods with reliablestructurally identified distant evolutionary relationships. Proc. Natl.Acad. Sci. U.S.A., 95, 6073-6078.
33.Park, J., Karplus, K., Barrett, C.,Hughey, R., Haussler, D. et al. (1998). Sequence comparisons using multiplesequences detect three times as many remote homologues as pairwise methods. J.Mol. Biol., 284,1201-1210.
34.Rost, B. (1999). Twilight zone ofprotein sequence alignments. Prot. Engin., 12, 85-94.
35.Li, W., Pio, F., Pawlowski, K.& Godzik, A. (2000). Saturated BLAST: an automated multiple intermediatesequence search used to detect distant homology. Bioinformatics, 16, 1105-1110.
36.Blake, J. D. & Cohen, F. E.(2001). Pairwise sequence alignment below the twilight zone. J. Mol. Biol., 307, 721-735.
37.Liu, J. & Rost, B. (2003).Domains, motifs and clusters in the protein universe. Curr. Opin. Chem.Biol., 7, 5-11.
38.Schneider, R., de Daruvar, A. &Sander, C. (1997). The HSSP database of protein structure-sequence alignments. Nucl.Acids Res., 25,226-230.
39.Barker, W. C., Garavelli, J. S.,McGarvey, P. B., Marzec, C. R., Orcutt, B. C. et al. (1999). ThePIR-International Protein Sequence Database. Nucl. Acids Res., 27, 39-43.
40.Wu, C. H., Yeh, L. S., Huang, H.,Arminski, L., Castro-Alvear, J. et al. (2003). The Protein InformationResource. Nucl. Acids Res., 31, 345-347.
41.Orengo, C. A., Bray, J. E., Buchan,D. W., Harrison, A., Lee, D. et al. (2002). The CATH protein family database: Aresource for structural and functional annotation of genomes. Proteomics, 2, 11-21.
42.Lo Conte, L., Brenner, S. E.,Hubbard, T. J., Chothia, C. & Murzin, A. G. (2002). SCOP database in 2002:refinements accommodate structural genomics. Nucl. Acids Res., 30, 264-7.
43.Rost, B. (1996). PHD: predictingone-dimensional protein structure by profile based neural networks. Meth.Enzymol., 266,525-539.
44.Rost, B., Casadio, R. &Fariselli, P. (1996). Topology prediction for helical transmembrane proteins at86% accuracy. Prot. Sci., 5, 1704-1718.
45.Schulz, G. E. (2000). beta-Barrelmembrane proteins. Curr. Opin. Str. Biol., 10, 443-447.
46.Wunderlich, Z., Liu, J., Kornhaber,G., Acton, T. B., Rost, B. et al. (2003). ZebaView: the official protein targetlist of the northeast structural genomics consortium. Proteins,in press.
47.Bertone, P., Kluger, Y., Lan, N.,Zheng, D., Christendat, D. et al. (2001). SPINE: an integrated trackingdatabase and data mining approach for identifying feasible targets inhigh-throughput structural proteomics. Nucl. Acids Res., 29, 2884-2898.
48.Carter, P., Liu, J. & Rost, B.(2003). PEP: Predictions for Entire Proteomes. Nucl. Acids Res., 31, 410-3.
49.Boeckmann, B., Bairoch, A.,Apweiler, R., Blatter, M. C., Estreicher, A. et al. (2003). The SWISS-PROTprotein knowledgebase and its supplement TrEMBL in 2003. Nucl. Acids Res., 31, 365-370.
50.de Souza, S. J., Long, M.,Schoenbach, L., Roy, S. W. & Gilbert, W. (1996). Intron positions correlatewith module boundaries in ancient proteins. Proc. Natl. Acad. Sci. U.S.A., 93, 14632-14636.
51.Lupas, A. N., Ponting, C. P. &Russell, R. B. (2001). On the evolution of protein folds: are similar motifs indifferent protein folds the result of convergence, insertion, or relics of anancient peptide world? J. Struct. Biol., 134, 191-203.
52.Eisenberg, D., Marcotte, E. M.,Xenarios, I. & Yeates, T. O. (2000). Protein function in the post-genomicera. Nature, 405,823-826.
53.Jones, S., Marin, A. &Thornton, J. M. (2000). Protein domain interfaces: characterization andcomparison with oligomeric protein interfaces. Prot. Engin., 13, 77-82.
54.Teichmann, S. A., Rison, S. C.,Thornton, J. M., Riley, M., Gough, J. et al. (2001). The evolution andstructural anatomy of the small molecule metabolic pathways in Escherichiacoli. J. Mol. Biol., 311, 693-708.
55.Liu, Y. & Eisenberg, D. (2002).3D domain swapping: As domains continue to swap. Prot. Sci., 11, 1285-1299.
56.Ofran, Y. & Rost, B. (2003).Analysing six types of protein-protein interfaces. J. Mol. Biol., 325, 377-387.
57.Moult, J. (2003). Structure tofunction consortium (S2F). 2003, .
58.Eyrich, V., Mart-Renom, M. A.,Przybylski, D., Fiser, A., Pazos, F. et al. (2001). EVA: continuous automaticevaluation of protein structure prediction servers. Bioinformatics, 17, 1242-1243.
59.Marti-Renom, M. A. (2003). Accuracyof comparative modelling. .
60.Koh, I. Y. Y., Eyrich, V. A.,Marti-Renom, M. A., Przybylski, D., Madhusudhan, M. S. et al. (2003). EVA:evaluation of protein structure prediction servers. Nucl. Acids Res., 31, 3311-3315.
61.Yokoyama, S. & Kuramitsu, S.(2003). RIKEN Structural Genomics Initiative (RSGI). 2003, .
62.Linial, M. & Yona, G. (2000).Methodologies for target selection in structural genomics. Prog. Biophys.molec. Biol., 73,297-320.
63.Brenner, S. E. (2001). A tour ofstructural genomics. Nature, 2, 801-809.
64.Liu, J. & Rost, B. (2001).Comparing function and structure between entire proteomes. Prot. Sci., 10, 1970-1979.
65.Vitkup, D., Melamud, E., Moult, J.& Sander, C. (2001). Completeness in structural genomics. Nat. Struct.Biol., 8, 559-566.
66.Liu, J. & Rost, B. (2002).Target space for structural genomics revisited. Bioinformatics, 18, 922-933.
67.Wright, P. E. & Dyson, H. J.(1999). Intrinsically unstructured proteins: re-assessing the proteinstructure-function paradigm. J. Mol. Biol.,293, 321-331.
68.Dunker, A. K. & Obradovic, Z.(2001). The protein trinity-linking function and disorder. Nat. Biotechnol., 19, 805-806.
69.Altschul, S. F., Madden, T. L.,Schaffer, A. A., Zhang, J., Zhang, Z. et al. (1997). Gapped BLAST andPSI-BLAST: a new generation of protein database search programs. Nucl. AcidsRes., 25, 3389-402..
70.Jones, D. T. (1999). Proteinsecondary structure prediction based on position-specific scoring matrices. J.Mol. Biol., 292, 195-202.
71.Przybylski, D. & Rost, B.(2002). Alignments grow, secondary structure prediction improves. Proteins, 46, 195-205.
72.Emanuelsson, O. & von Heijne,G. (2001). Prediction of organellar targeting signals. Biochim. Biophys. Ac., 1541, 114-119.
73.Rost, B. & Sander, C. (1993).Prediction of protein secondary structure at better than 70% accuracy. J.Mol. Biol., 232,584-599.
74.Rost, B. & Sander, C. (1994).Combining evolutionary information and neural networks to predict protein secondarystructure. Proteins, 19, 55-72.
75.Rost, B. & Liu, J. (2003). ThePredictProtein server. Nucl. Acids Res., 31, 3300-3304.
76.Rost, B. & Sander, C. (1994).Conservation and prediction of solvent accessibility in protein families.Proteins, 20, 216-226.
77.Eddy, S. R. (1998). Profile hiddenMarkov models. Bioinformatics, 14, 755-63. 

Contact:    rost@columbia.edu Version:    Sep 28, 2003
 top - TOC - CUBIC-papers - CUBIC