bottom - CUBIC-papers - CUBIC

Title: Loopy proteins appear conserved in evolution
Author:Jinfeng Liu, Hepan Tan & Burkhard Rost
Quote: J Mol Biol, 2002, 332(1): 53-64

This article is published in (Journal of Molecular Biology, Vol 332, 2002, p53-64) © copyright Journal of Molecular Biology, Academic Press (2002). Academic Press is the only authorised source. All copying of this article including placing on another website requires the written permission of the copyright owner.



Loopy proteins appear conserved in evolution

Jinfeng Liu 1,2,3, Hepan Tan 2 & Burkhard Rost 2, 3, 4, *

  1. Department of Pharmacology, Columbia University, 630 West 168th Street, New York, NY 10032, USA, liu@cubic.bioc.columbia.edu
  2. Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street, New York, NY 10032, USA, liu@cubic.bioc.columbia.edu tan@cubic.bioc.columbia.edurost@columbia.edu
  3. North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA
  4. Columbia University Center for Computational Biology and Bioinformatics (C2B2), RussBerrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA
Table of contents
  • References



  •  


    Abstract

    Over the last decade, structural biologists have unravelled many proteins that appear natively disordered. Common assumptions are that many of these proteins adopt structure through binding and that the structural flexibility enables them to adopt different functions. Here, we investigated regions of more than 70 sequence-consecutive residues that have no regular secondary structure (NORS). Analysing 31 entirely sequenced organisms, we predicted five times as many proteins with NORS regions ('loopy' proteins) in eukaryotes (20%) than in prokaryotes and archaeas (4%). Thousands of these NORS regions were over 150 residues long. The amino acid composition of NORS regions differed from that of loops in PDB. Although NORS regions had significantly more low-complexity residues than other proteins, simple cut-off thresholds for sequence bias missed most NORS regions. On average, NORS regions were evolutionarily at least as conserved as their flanking regions. Furthermore, yeast proteins with NORS regions had more protein-protein interaction partners than other proteins. Regulatory and transcription-related functions were over-represented in loopy proteins, biosynthesis and energy metabolism were under-represented. Overall, our analysis confirmed that proteins with non-regular structures appear to play important functional roles, and they may adopt yet unknown types of protein structures.

     

     

    Key words: genome sequence analysis; proteomes; bioinformatics; natively unstructured proteins; no regular secondary structure; disordered regions; low-complexity; protein function; protein-protein interactions.

     

     

    Abbreviations used

    ; 3D structurethree-dimensional structure, i.e. co-ordinates of all residues/atoms in a protein
    COILSprediction of coiled-coil regions from sequence based on statistics and expert rules [1]
    DIPdatabase of interacting proteins [2]
    DSSPautomatic assignment of secondary structure and solvent accessibility from 3D co-ordinates [3]
    NORSsegment of more than 70 consecutive residues of NO Regular Secondary structure, i.e. without helix or strand (more precisely, we required that less than 12% of the residues in the respective region were in helix or strand and that at least one region of more than 10 residues was exposed to solvent, see Methods)
    NORS proteinsproteins with at least one NORS region
    ORFopen reading frame (protein predicted by genome sequencing project)
    PDBprotein data bank of protein structures [4]
    PDBsubsequence-unique subset of PDB with 1947 chains
    PHDaccprofile-based neural network prediction of solvent accessibility [5]
    PHDsecprofile-based neural network prediction of secondary structure [6, 7]
    SignalPneural network based prediction of signal peptides [8]
    SWISS-PROTcurated database with protein sequences and functional annotations [9]
    TrEMBLautomatic translation of EMBL nucleotide database of protein sequences [9] .


     

     

    Introduction

    Protein function may require flexible structures. The sequence of a protein largely determines its three-dimensional (3D) structure [10, 11, 12] , and structure often determines function [13, 14, 15, 16, 17, 18, 19, 20, 21, 22] . Nevertheless, many proteins undergo changes in conformation upon binding to substrates or other ligands [23, 15, 24, 25] . Some biological functions may require more flexible structures than others: For the catalytic activity of an enzyme the precise interaction between enzyme and substrate may be critical [26, 27] . On the other hand, a structure that is intrinsically more flexible, more 'loopy' may adapt more readily to different environments. Consequently, loopy structures may recognise many different biological targets [28, 29, 30, 31, 32, 33, 34, 35, 36, 37] . For example, the Serine/threonine phosphatase Calcineurin becomes activated by binding a Ca2+-calmodulin complex by a region that exists as a disordered ensemble [38, 39, 40] . Loopy structures also appear important for macro-molecular assembly as exemplified by the assembly of the tobacco mosaic virus or the bacterial flagellum [41] .

    Identifying disordered regions in silico. One class of 'natively disordered' regions was initially identified through the observation that such regions are invisible in electron density maps, since the disorder prevented them from crystallising into well-ordered structures that scatter X-rays coherently. These regions appear to be frequently characterised by a particular bias in the use of amino acids, usually referred to as 'compositional bias' or 'regions of low sequence complexity' [42, 43, 44, 45] . Romero and colleagues developed a method that predicts disordered regions by training a neural network to identify low-complexity regions longer than 40 residues [46, 47] . Applying their method to the SWISS-PROT database [9] they have found more than 15,000 protein regions that are putatively disordered [46] .

    Here, we studied the problem of disordered proteins from a more structure-oriented perspective. We investigated regions of more than 70 residues that have very low content in regular secondary structure (helix or strand). These extended regions of 'no regular secondary structure' (NORS) may still be sufficiently 'ordered' to deflect X-rays and yield electron density maps. However, their lack of regular secondary structure is certainly intriguing. We found NORS regions to be particularly abundant in Eukaryotic proteomes, to be evolutionarily conserved, and to be enriched in regulatory functions and in protein-protein interactions.

     

    Results


    Analysing 'loopy' proteins in PDB

    Visual classification into four types. We defined NORS regions to have at least one sequence-continuous fragment of over 70 residues with fewer than seven residues in regular secondary structure (helix or strand). We found less than 20 such proteins in a sequence-unique subset of PDB [4] , and then visually sorted these NORS proteins into 'types' according to the structural context. Note that these types were by no means objective, i.e. did not base on a definition enabling automatic classification. We distinguished the following four types ( Fig. 1 ). (1) Connecting loops ( Fig. 1 A) are long loops that connect structural domains or linked subunits (e.g. 1AA6 [48] , 1BF2 [49] , and 4DPV [50] ). (2) Loopy ends ( Fig. 1 B), i.e. long N- or C-terminal loops (e.g. 1DHX [51] , 1B35 C chain [52] , and 1B0P [53] ). (3) Wrapping loops ( Fig. 1 C) are long loops wrapping around otherwise 'normal globular' domains (e.g. 2BAA [54] , 7CAT [55, 56] , and 1CPO [57] ). (4) Loopy domains ( Fig. 1 D) are entire proteins or domains lacking regular secondary structure (e.g. 1TBI [58] and 1TAC [59] ).



    Fig. 1 fig1.gif

    Fig. 1. : Four types of PDB proteins with NORS regions. NORS regions are defined to have at least 70 consecutive residues with less than 10% regular secondary structure (helix or strand). We found four types of proteins.
    (A) Connecting loops: long loops that connect two domains or chains (shown Formate Dehydrogenase H, 1AA6, [48] ). In the Isoamylase (1BF2), the 96-residue loop connects domains N and A, and forms an inter-domain bridge across the barrel. The corresponding loop is absent in the alpha-amylase family enzymes lacking domain N, but a domain analysis based on the distance map showed the loop to be included in domain A [49] . The loop EF (residues 279-334) in the DNA-containing capsid of canine parvovirus (4DPV) contacts several neighbouring beta-strands thus apparently accounting for the specificity in the assembly of interactions [50] .
    (B) Loopy ends: long N- or C-terminal regions that lack regular secondary structure (shown Hexon from adenovirus type 2, 1DHX [51] ). The overall shape of the trimeric hexon molecule of the Adenovirus type 2 hexon (1DHX) is unusual and may be divided into a pseudo-hexagonal base rich in secondary structure, and a triangular top formed from three long loops [51] . The hexon top consists of intimately interacting loops (l1, l2 and l4) emerging from P1 and P2 in the base; it has a triangular shape not exhibiting the pseudo-symmetry of the base. In general, temperature factors are good indicators of atomic flexibility. The N-terminal arm at the base and the loop l1 at the top of 1DHX have the highest average temperature values in that structure. Similarly, the loop insertions between ß-strands E and F and G and H of CPV capsid protein (1B35, C chain) add specificity to the assembly interaction by forming inter-bridging sheets [52] .
    (C) Loopy wraps: long loopy regions wrapping around globular domains (shown Class II chitinase, 2BAA [54] ). The third domain of beef liver Catalase (7CAT, residues 321-436) is referred to as the "wrapping domain"; it forms an outer layer to each subunit. It lacks discernible secondary structure in a long stretch between residues 366 and 420. However, this domain contains the essential helix with the proximal ligand Tyr357. The wrapping domain also forms a short secondary structure with an identical region of the P-axis-related-subunit [55] .
    (D) Loopy domains: entire structures that have almost no regular secondary structure (shown extra-cellular domain of T beta RI, 1TBI [58] ). The Arg78-Gly79-Asp80 (RGD loop) of the HIV-1 trans-activating regulatory protein TAT (1TAC) is a key site for adhesive recognition and receptor interaction. This region is also solvent-exposed at the tip of a hairpin structure that is experimentally well defined by several NOESY cross-peaks as reflected in the low variation between the NMR ensemble for this loop. This low variation is unusual given that RGD loops seem to be very flexible in most proteins studied so far, e.g., the structure of decorsin (1DEC) has a rigid RGD loop similar to that of the HIVZ2 Tat protein [104] . The rigidity of the HVIZ2 Tat protein RGD loop structure may be due to two close Proline residues, Pro77 and Pro81, flanking the loop.




     

    Functional reasons for NORS regions. Most NORS regions in PDB have enzymatic activities and/or are involved in substrate/ligand processes. For example, the turnover number of Pyruvate ferredoxin oxidoredisoamylase (1B0P, A chain) increases by at least a factor of five upon reduction and complete removal of the loop [53] . A molecular dynamics simulationhas indicated that the removal of the C-terminal domain VII may result in the formation of a new hydrophobic channel of only 7Å. This channel could bridge the active site close to the molecular surface and could serve to evacuate reaction products. The observed increase in activity of the reduced enzyme compared to the oxidised form may be due to an easier flow of substrates and products toward and from the active site. For Chloroperoxidase (1CPO) [57] and for the Class II Chitinases (2BAA) [54] , the long loops are involved in substrate binding. The RGD loop is the key site for adhesive recognition and receptor interaction in HIVZ2 Tat protein (1TAC) [59] . The loop of the Sea raven type II antifreeze protein (SRAFP) (2AFP) [60] is part of the ice-binding site. The loops share inhibitor binding and DNA binding capabilities in Carboxypeptidase (6CPA) [61] .

    Errors in predicting NORS regions. We optimised our definition of predicted NORS regions to yield a low false positive rate when applying the method to PDB ( Table 1 S, Supplement): The predicted content in regular secondary structure (helix or strand) is below 12% over at least 70 consecutive residues, and at least 10 consecutive residues are predicted to be exposed. Based on these criteria, we predicted NORS regions for 23 proteins from our sequence-unique subset of PDB. Five of these were also identified when using the DSSP assignments of the actual rather than the predicted secondary structure; five others were false positives. The remaining 13 proteins contained unusually long loopy regions although they were not detected when applying our criterion on the DSSP assignment from the 3D coordinates.

    NORS regions on average depleted of hydrogen bonds. We investigated whether NORS regions found in PDB were indeed flexible in structure in the following way: We calculated the number of hydrogen bonds within and between NORS regions, as well as, between NORS regions and non-NORS regions. Between residues within NORS regions, we counted on average 0.66 hydrogen bonds per residue. Between residues in non-NORS regions of similar length, we counted 1.209 hydrogen bonds. These two values differed significantly (t=-7 8 <<<0.001). Similarly, we found about 0.13 hydrogen bonds per residue between NORS residues and the rest of the proteins (non-local), while non-NORS regions had 0.27 non-local hydrogen bonds (t=-3.3, p=0.001). Thus, NORS regions appeared significantly less stabilised by hydrogen bonds than non-NORS regions.

     


    Predicting NORS regions in entire proteomes

    Many proteins with NORS regions in proteomes. We predicted a high fraction of proteins with NORS regions in each of the 31 entirely sequence proteomes that we tested ( Fig. 2 Table 2 S, Supplement). The numbers differed considerably between the three kingdoms: for most in archaebacteria and prokaryotes we predicted NORS regions in less than 5% of all proteins (exception: Aeropyrum pernix, 13%), while the percentages were 17-30% for eukaryotes ( Fig. 2 A). For eukaryotes, 7-15% of the entire residue mass was predicted in eukaryotic NORS regions ( Fig. 2 B). Most NORS regions were between 70-130 residues long ( Fig. 2 C). Almost all extremely long NORS regions (>500 residues) were found in eukaryotes.



    Fig. 2
    fig2.gif

    Fig. 2. : Many NORS proteins predicted in proteomes. We predicted many NORS regions in 31 entirely sequenced organisms. NORS proteins appeared particularly abundant in eukaryotes. The upper left graph (A) gives the percentage of proteins in respective proteome for which we predict at least one NORS region. The upper right graph (B) illustrates the percentage of all the residues of the respective proteome for which we predict a NORS region (note the difference in scale between A and B). The lower graph (C) gives the percentage of all predicted NORS regions that are between N and N+10 residues long (note that, by definition, NORS regions are longer than 70 residues). Surprisingly, almost 15% of all the predicted NORS regions extend over more than 200 residues (inset of C).





     

    NORS regions have specific amino acid composition. We compared the NORS regions predicted in the proteomes ( Fig. 3 A) to non-NORS regions in the same set of proteins ( Fig. 3 B), as well as to all proteins in SWISS-PROT (Fig. 3C) and to all residues without regular secondary structure in a sequence-unique subset of PDB ( Fig. 3 D). The most rigid amino acids (WCFIYVLM), as measured by the Vihinen scale [62] that reflects the side chain motion, were severely under-represented, while only some of flexible amino acids (GQSP) were over-represented ( Fig. 3 A). Although loop residues ( Fig. 3 D) exhibited a similar trend, the amino acid composition of NORS regions differed significantly from that of loop residues. More specifically, NORS regions were more depleted in WFVD, and more enriched in QSP. NORS regions and loops shared some similarities (high P, low E) that distinguished them from SWISS-PROT proteins and non-NORS proteins in proteomes.



    fig3
    Fig. 3

    Fig. 3. : NORS regions use particular amino acids. The height of the one-letter amino acid code is proportional to the abundance of the respective acid in each data set. The actual value is the difference in occurrence with respect to the frequency observed in a sequence-unique subset of PDB:

    .

    Inverted letters indicate acids that are less frequent than 'expected'. The amino acids are sorted by 'flexibility' [62] , with the more rigid ones on the left. Overall, NORS regions are as abundant in more flexible residues as loop regions in PDB. However, we found considerably more Serine (S), Glutamine (Q), and Glycine (G) and considerably fewer Arginine (R), Aspartic acid (D), Glutamic acid (E), Tryptophan (W), and Phenylalanine (F) in NORS regions than in loop regions, in general.




     

    Low-complexity and NORS regions differed. We compared the complexity (K2 eqn. 1 ) between NORS and non-NORS regions ( Fig. 4 ). NORS regions were clearly shifted toward lower complexity values ( Fig. 4 A): about 16% of the NORS regions had K2 values below 2.9, while only 1% of the fragments in non-NORS, PDB proteins, or SWISS-PROT proteins were below this value ( Fig. 4 A, inset). We also monitored low-complexity using a slightly different definition, namely the percentage of residues considered to be of low-complexity according to the widely used method SEG [42] . Consistent with the findings for the K2 distribution, NORS regions had a higher fraction of SEG residues ( Fig. 4 B). However, more than 80% of the NORS regions predicted could not have been identified by only applying some threshold in low-complexity.



    Fig. 4
    fig4

    Fig. 4. : Most NORS regions had similar compositional bias as PDB proteins. We measured sequence composition in two slightly different ways: by the Shannon entropy averaged over segments of 45 consecutive residues (K2, Eqn. 1) and by the percentage of low-complexity residues assigned by the program SEG [42] . The distribution of the Shannon entropy was shifted towards lower values than that for non-NORS regions (A). Similarly, NORS regions had significantly more residues of low-complexity than non-NORS regions (B). However, if we choose a threshold in complexity that considers only 1% of the PDB proteins to have low-complexity segments longer than 45 residues, we detect only 16% of the NORS regions predicted (cumulative percentage given in the inset of A).




     

    NORS regions as conserved as flanking regions. In multiple alignments of evolutionarily diverged protein families, we typically observe two kinds of consecutive regions [63, 64, 65, 66] : (1) Regions that can be aligned over the entire length, and (2) regions for which some of the family members have insertions. The usual assumption is that regions with long deletions/insertions are functionally less important. To determine whether NORS regions were evolutionarily conserved, we compared the information content ( eqn. 3 ) in NORS regions to that of the N- and C-terminal non-NORS segments. For more than 20% of all NORS regions, we could not distinguish between the conservation of the NORS and of its flanking regions (no difference in information content, Fig. 5 ). For about 56% of all pairs NORS/flanking region, the NORS had a similar or higher information content, i.e. was evolutionarily equally or more conserved (negative values, Fig. 5 inset). A detailed analysis revealed that the differences in evolutionary conservation were not statistically significant. This suggested that NORS regions evolved according to similar evolutionary constraints as the flanking regions.



    Fig. 5
    fig5.gif

    Fig. 5. : NORS regions as conserved as flanking regions. In order to investigate whether or not NORS regions were evolutionarily conserved, we measured the difference in the information contents between alignments in NORS and their flanking regions (Iflanking region – INORS, Eqn. 3). The percentage values were compiled over all pairs of NORS/flanking regions, i.e. the total number of pairs was twice the number of NORS regions found. The inset gives the cumulative percentages. The difference in conservation between NORS and flanking regions were statistically not significant. In other words, NORS regions appeared, on average, as conserved as non-NORS regions.




     

    NORS proteins had slightly more interaction partners than non-NORS proteins. We analysed the Database of Interacting Proteins (DIP) [2] that dominantly lists all protein-protein interactions unravelled by the first two large-scale yeast-two-hybrid experiments [67] . We found that 3464 (72%) of all non-NORS yeast proteins had one or more binding partners. In contrast, about 1126 (79%) of all 1556 NORS proteins in yeast had at least one interaction partner ( Fig. 6 ). This difference was statistically significant (z=5 18 <<<0.001, eqn. 2 ). The comprehensive experimental analysis of protein-protein interactions in Helicobacter pylori [68] yielded similar results: 64% of the NORS and 44% of the non-NORS proteins had interaction partners (z=2.41, p=0.008, eqn. 2 ). Most data from yeast-2-hybrid experiments do not reveal the precise regions involved in protein-protein interfaces. However, we found 37 examples of protein-protein interactions in DIP for which the regions of interaction overlapped with NORS regions for at least 70 residues ( Table 3 S, Supplement). For example, the yeast protein YKR025w, a potential subunit of RNA polymerase III, interacts with RPC4_YEAST (YDL150W, RNA polymerase III chain C53) via region 62-200 [69] , which coincides with predicted NORS region at 82-155 ( Table 3 S). Another family of examples were revealed by a genome-wide two-hybrid screening showing that Lsm (Like sm) proteins interact with some splicing factors and proteins involved in mRNA turnover [70] . Many of these protein-protein interactions of the Lsm's seemed to be mediated by predicted NORS regions ( Table 3 S). Finally, NORS regions also appeared involved in interactions with actin-related proteins [71] ( Table 3 S).



    Fig. 6
    fig6.gif

    Fig. 6. : NORS proteins interacted more than non-NORS proteins. We compared the number of interacting partners annotated in DIP [2] between predicted NORS and non-NORS proteins. We found that considerably more NORS than non-NORS proteins had one or more interaction partners (the inset gives the cumulative percentages). The difference between the distributions for NORS and non-NORS proteins was statistically significant (z=5.18, p<0.001, Eqn. 2).





     

    NORS proteins often related to regulation and transcription. For all predicted NORS proteins, we searched for functional annotations in SWISS-PROT. We found a variety of descriptions, including numerous carbohydrate modification sites, phosphorylation sites, disulphide bridges, and catalytic active sites. NORS regions occurred in many transcription factors, and were frequently found spanning half of the Zinc Finger motifs and the residues preceding these. Residues immediately upstream of Homeodomains were also often in NORS regions. Monika Riley introduced classes of cellular function to characterise the functional content of proteomes [72, 73] . We assigned such classes automatically through the program EUCLID [74, 75] . We classified about 45-65% of all proteins into one of 14 functional classes at a level reported to yield 70% correct classifications, namely above 30% pairwise sequence identity [76] . Notably, NORS proteins were significantly under-represented in most biosynthesis classes ('Amino acid biosynthesis', 'Biosynthesis of cofactors, prosthetic groups, and carriers', 'Fatty acid and phospholipid metabolism', 'Purines, pyrimidines, nucleosides, and nucleotides'), in 'Energy metabolism', and in 'Translation' compared to non-NORS proteins ( Fig. 7 A). In contrast, 'Regulatory Functions' and 'Transcription' classes were more abundant in NORS proteins ( Fig. 7 A). This was consistent with our observation that NORS appeared in many transcription factors. When grouping the 14 classes into three super-classes: energy, information and communication [74] , we found that NORS proteins were more often associated with 'Communication' (24±1% vs. 14±1%, Fig. 7 B) and less often with 'Energy' (21±1% vs. 31±2%, Fig. 7 B) than were non-NORS proteins.



    Fig. 7
    fig7.gif

    Fig. 7. : NORS proteins unique in their spectrum of cellular functional classes. The program EUCLID [74] sorts proteins of experimentally known function into classes of cellular function. For each proteome, we compared the fraction of NORS proteins in each of these classes to that of the non-NORS proteins. Here, we show the averages over all 31 proteomes (inner circle: NORS proteins, outer circle: non-NORS proteins). The upper graph (A) separates all 14 classes assigned by EUCLID, the lower graph (B) groups the 14 classes into three 'super-classes'.


     

     

     




     

     

     

    Discussion and Conclusion

    Do NORS-, disordered-, natively unfolded regions and structural switches differ? It is commonly assumed that regions of non-regular secondary structure (turns, loops, or bends) are more flexible than are the networks of backbone hydrogen bonds, stabilising helices and strands. In fact, two-third of all globular protein structures fall into a rather narrow window of 50-65% regular secondary structure, and only 1% of the proteins longer than 70 residues have less than 20% regular secondary structure [77] . Over the last years, experimentalists are beginning to find increasing evidence of proteins that appear to be unfolded in their native, unbound conformation [30, 31, 78, 35, 36, 37] , or that can undergo considerable conformational changes upon binding [79, 80, 81, 82, 83, 84, 85, 86] ; structural switches can be predicted from sequence [87, 25] . Dunker and colleagues developed a method predicting 'natively disordered' regions (labelled Dunker-Regions in the following) [43, 44, 46] . Uversky, Gillespie & Fink claimed that natively unfolded proteins could be identified through their net charge and hydrophobicity [78] ; Zetina claimed that many 'natively unfolded' proteins contain a particular 'helix-unfolding' sequence-motif [88] . Here, we focused on regions longer than 70 residues that have unusually low content of regular secondary structure (NORS). We expect that NORS regions overlap with regions identified by the Dunker group: both Dunker-Regions and our NORS regions had more segments of low complexity than typical globular proteins K2 ( Fig. 4 and [47] ). In contrast, the Dunker-Regions and NORS differed in their amino acid composition: while R and E were more abundant in Dunker-Regions, the usage of R was similar between NORS and non-NORS ( Fig. 3 ) and E was even less frequent in NORS than in non-NORS ( Fig. 3 ). The under-representation of C in Dunker-Regions did also not correspond to the observation in NORS regions. Finally, Dunker et al. predicted almost two times more 'disordered' regions in eukaryotes than we predicted NORS regions ( Fig. 2 vs. [34] ). Dunker and colleagues published their 20 strongest predictions: for all we also predicted NORS regions, although for two Dunker-Regions and NORS did not overlap (Table 4S, Supplement). We could not verify that all NORS regions were strictly confined to a particular region in hydrophobicity vs. net charge plot as has been postulated for natively unfolded proteins [78] . Clearly, the NORS regions we predicted did not overlap with regions typically labelled as 'structural switches'. In summary, the various attempts at characterising non-regular regions in proteins identify sets of regions that overlapped only to some extent.

    Are NORS regions important for function? Although NORS regions were abundant in low complexity residues, they were evolutionarily as conserved as flanking regions ( Fig. 5 ). This suggested that NORS regions are important for function. One way in which NORS regions could play important functional roles is through protein-protein interactions. Calcineurin is one particular example for protein-protein interactions that are mediated by disordered regions: the flexibility of a 95-residue segment in subunit A is important for Calmodulin binding [89, 90] . The analysis of the yeast two-hybrid results ( Fig. 6 ) confirmed that proteins with NORS regions have, on average, more interaction partners, than other proteins. The seemingly more active role of NORS proteins in protein-protein interaction might be explained by the hypothesis that NORS regions might be stabilised by protein-protein interactions (induced fit). We also found NORS proteins to be more often related to regulatory functions and transcription than non-NORS proteins ( Fig. 7 ). We do not know the precise role NORS regions play in transcription factors. An obvious hypothesis is that the conformational adaptability of NORS regions enables different regulation. This might also explain why NORS proteins appeared particularly abundant in Eukaryotes ( Fig. 2 ).

    New types of protein structures? We found over 20,000 proteins with NORS regions over 130 residues in Eukaryotes alone. Currently, we have no example for any of these in PDB. Large-scale efforts at determining all proteins structures (structural genomics initiatives) [15, 91, 92, 93, 94, 95, 96, 97] may unravel whether or not loopy proteins constitute a 'new class of protein structures'. However, while we have no data supporting speculations how these proteins look or what they do, we could refute the assumption that NORS regions constitute some ancient carry-over that is functionally unimportant.

     

    Methods


    Data sets

    Source of proteome sequences. We obtained the sequences for all 31 organisms that we analysed from the public domain. We downloaded most ORFs from NCBI (ftp://ncbi.nlm.nih.gov/genbank/genomes). The exceptions were Homo sapiens (from SWISS-PROT release 39 and TrEMBL database release 15), Caenorhabditis elegans (from www.sanger.ac.uk/Projects/C_elegans/wormpep/, wormpep65), Drosophila melanogaster (from www.fruitfly.org/, release 2), and Mus musculus (from www.ensembl.org/Mus_musculus/).

    Sequence-unique subset of PDB (PDBsub). To reduce the bias from mutation studies, we restricted our analysis of PDB to a sequence-unique subset. This set was defined by that no pair in the subset had more than 33 pairwise identical residues over more than 100 residues aligned. More precisely by that the HSSP distance [98] was below 0 for any pair in the set. We maintain a weekly update of such a set through our EVA server [99] . The set used for this study, contained 1947 protein chains.


    Prediction methods

    Secondary structure, membrane helices and solvent accessibility. We obtained multiple sequence alignments by searching with the dynamic programming method MaxHom [100] against SWISS-PROT [9] . The resulting alignments were subsequently filtered [98] and used as input for PHDsec [6, 7] , PHDacc [7] , and PHDhtm [101] . For all methods we used the default parameters. For proteins of known structure, we assigned accessibility and secondary structure with DSSP [3] . In particular, we used the following convention to convert the 8 DSSP states into 3 classes: DSSP 'HGI'-> helix (H), DSSP 'EB'-> strand (E), and all other to non-regular (L). Buried residues were defined as those with a relative accessibility to solvent of < 16%.

    Secreted proteins and coiled-coil regions. We predicted signal peptides using the program SignalP [8] considering a protein to contain a signal peptide if the “mean S” value was above the default threshold. We predicted coiled-coil regions with COILS [1] using a window-size of 28 and a probability threshold of 0.9.

    Sequence complexity. Low sequence complexity region was determined by SEG [42] using default parameters. Sequence compositional complexity K2 of a sequence window was calculated in the same way as in the work of Wootton & Federhen [42] :

               (1)

    where N represents the number of letters in the alphabet (20 for amino acids in protein) and ni is the occurrence of amino acid i in sequence window of length L. In particular, we chose segments of 45 consecutive residues to measure compositional bias.

    Functional classification. We classified cellular function using the program EUCLID [75] . The SWISS-PROT homologues input to EUCLID were identified by MaxHom (pairwise sequence identity > 30%). EUCLID assigned the following 14 categories of cellular function [102] : "Amino acid biosynthesis", "Biosynthesis of cofactors, prosthetic groups, and carriers", "Cell envelope", "Cellular processes", "Central intermediary metabolism", "Energy metabolism", "Fatty acid and phospholipid metabolism", "Other categories", "Purines, pyrimidines, nucleosides, and nucleotides", "Regulatory functions", "Replication", "Transcription", "Translation", "Transport and binding proteins". Finally, we added the class "Unclassified" listing all those proteins for which we either did not find homologues in SWISS-PROT or which could not be classified by EUCLID.


    Definition of No Regular Secondary Structure region (NORS)

    We identified NORS regions (extended regions of NO Regular Secondary structure) in the following way. First, we applied all programs (PHDsec, PHDacc, PHDhtm, COILS, and SignalP). Then, we compiled the percentage of residues with 'regular structural' signals (regular secondary structure, transmembrane helices, coiled-coil regions, signal peptides) over sliding windows of 70 consecutive residues. NORS were assigned if both following conditions applied. (1) The 'regular structural' content was below 12%, i.e. there were less than 12% helix/strand/coiled-coil/membrane helix/signal peptide. (2) We found at least one continuous segment longer than 10 residues within which all residues were exposed to solvent. NORS regions were extended in both directions as long as the above two criteria remained valid.


    Calculation of inter- and intra-region hydrogen bonds

    We extracted hydrogen bond information for all residues in PDBsub (Data sets) through the DSSP program [3] . We calculated the number of inter- and intra- region hydrogen bonds per residue for each NORS region in PDB, and averaged over all regions. For non-NORS regions, these numbers were obtained for all 70-residue sequence windows from the data set by randomly selecting 2000 such windows in order to avoid over-sampling of overlapping sequence windows.


    Statistical analysis

    We applied the standard Student's t-test to determine whether the difference between the number of hydrogen bonds of NORS regions and non-NORS regions was significant. We also tested the differences between the proportions of two populations p1 and p2 in the following way:

                      (2)

    where  , and  

    The one-tailed probability was then obtained through normal distribution table.


    Calculation of information content in a sequence segment

    The information content of sequence segments was determined based on the multiple sequence alignment generated by MaxHom, according to the method described by Gorodkin et al. [103] . Briefly, the information content of position i of the alignment Ii was calculated as follows:

                (3)

    where A ={A, C, D, …, W, Y, –} is the set of 20 amino acids including gaps ('–'), qik is the fraction of ‘amino acid’ k at position i. When k is not ‘–’, pk equals the a priori distribution of the amino acid for SWISS-PROT database. p_ is set to 1. The average information content of a sequence segment was thencalculated by taking the average of individual positions within the segment.


    Possible functional annotation of non-structured regions

    For proteins with NORS regions that were contained in SWISS-PROT, we extracted functional annotations from the 'FT' entry. For other proteins, we aligned each NORS region against SWISS-PROT (using MaxHom), and extracted functional annotations for homologues that had more than 50% pairwise identical residues over more than 100 aligned residues, i.e. an HSSP-distance above 15 [98] . Wherever possible, we only kept annotations that were related explicitly to the NORS regions.

     

     



    Acknowledgements

    Thanks to Henrik Nielsen (CBS, Denmark) for providing the source code for SignalP and for his generous help in using this program, to Andrei Lupas (MPI Tübingen) for helpful suggestions about using the COILS program, and to Henry Bigelow (Columbia) for crucial comments on the manuscript. Thanks to Florencio Pazos, Damien Devos and Alfonso Valencia (all CNB Madrid) for supplying and helping with the program EUCLID. Thanks to Alexei Murzin (MRC Cambridge) for useful discussions. Finally, thanks to the undisclosed reviewer who suggested analysing the hydrogen bonding networks of NORS regions. The work of JL and BR were supported by the grants 1-P50-GM62413-01 and RO1-GM63029-01 from the National Institute of Health. Last, not least, thanks to all those who deposit their experimental data into public databases, and to those who maintain these databases.

     

     

    References

    1.Lupas, A. (1996). Prediction andanalyis of coiled-coil structures. Meth. Enzymol., 266, 513-525.
    2.Xenarios, I., Salwinski, L., Duan,X. J., Higney, P., Kim, S. M. et al. (2002). DIP, the Database of InteractingProteins: a research tool for studying cellular networks of protein interactions.Nucl. Acids Res., 30, 303-5..
    3.Kabsch, W. & Sander, C. (1983).Dictionary of protein secondary structure: pattern recognition of hydrogenbonded and geometrical features. Biopolymers, 22, 2577-2637.
    4.Berman, H. M., Westbrook, J., Feng,Z., Gillliland, G., Bhat, T. N. et al. (2000). The Protein Data Bank. Nucl.Acids Res., 28, 235-242.
    5.Rost, B. & Sander, C. (1994).Conservation and prediction of solvent accessibility in protein families.Proteins, 20, 216-226.
    6.Rost, B. & Sander, C. (1993).Prediction of protein secondary structure at better than 70% accuracy. J. Mol.Biol., 232, 584-599.
    7.Rost, B. & Sander, C. (1994).Combining evolutionary information and neural networks to predict proteinsecondary structure. Proteins, 19, 55-72.
    8.Nielsen, H., Engelbrecht, J.,Brunak, S. & von Heijne, G. (1997). Identification of prokaryotic andeukaryotic signal peptides and prediction of their cleavage sites. Prot.Engin., 10, 1-6.
    9.Bairoch, A. & Apweiler, R.(2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in2000. Nucl. Acids Res., 28, 45-48.
    10.Anfinsen, C. B. (1973). Principlesthat govern the folding of protein chains. Science, 181, 223-230.
    11.Anfinsen, C. B. & Scheraga, H.A. (1975). Experimental and theoretical aspects of protein folding. Adv. Prot.Chem., 29, 205-300.
    12.Ellis, R. J., Dobson, C. &Hartl, U. (1998). Sequence does specify protein conformation. TIBS, 23, 468.
    13.Murzin, A. G. (1996). Structuralclassification of proteins: new superfamilies. Curr. Opin. Str. Biol., 6,386-394.
    14.Blomberg, N. & Nilges, M.(1997). Functional diversity of PH domains: an exhaustive modelling study.Folding & Design, 2, 343-355.
    15.Lima, C. D., Klein, M. G. &Hendrickson, W. A. (1997). Structure-based analysis of catalysis and substratedefinition in the HIT protein family. Science, 278, 286-290.
    16.Wallace, A. C., Borkakoti, N. &Thornton, J. M. (1997). TESS: a geometric hashing algorithm for deriving 3Dcoordinate templates for searching structural databases. Application to enzymeactive sites. Prot. Sci., 6, 2308-2323.
    17.Russell, R. B. (1998). Detection ofprotein three-dimensional side-chain patterns: new examples of convergentevolution. J. Mol. Biol., 279, 1211-1227.
    18.Jones, S., van Heyningen, P.,Berman, H. M. & Thornton, J. M. (1999). Protein-DNA interactions: Astructural analysis. J. Mol. Biol., 287, 877-896.
    19.Moult, J. & Melamud, E. (2000).From fold to function. Curr. Opin. Str. Biol., 10, 384-389.
    20.Irving, J. A., Whisstock, J. C.& Lesk, A. M. (2001). Protein structural alignments and functionalgenomics. Proteins, 42, 378-382.
    21.Jones, S., Daley, D. T., Luscombe,N. M., Berman, H. M. & Thornton, J. M. (2001). Protein-RNA interactions: astructural analysis. Nucl. Acids Res., 29, 943-954.
    22.Todd, A. E., Orengo, C. A. &Thornton, J. M. (2001). Evolution of function in protein superfamilies, from astructural perspective. J. Mol. Biol., 307, 1113-1143.
    23.Trowbridge, I. S. (1991). Endocytosisand signals for internalization. Curr. Opin. Cell Biol., 3, 634-641.
    24.Unwin, N. (1998). The nicotinicacetylcholine receptor of the Torpedo electric ray. J. Struct. Biol., 121,181-190.
    25.Young, M., Kirshenbaum, K., Dill,K. A. & Highsmith, S. (1999). Predicting conformational switches inproteins. Prot. Sci., 8, 1752-1764.
    26.Rozovsky, S., Jogl, G., Tong, L.& McDermott, A. E. (2001). Solution-state NMR investigations oftriosephosphate isomerase active site loop motion: ligand release in relationto active site loop dynamics. J. Mol. Biol., 310, 271-280.
    27.Rozovsky, S. & McDermott, A. E.(2001). The time scale of the catalytic loop motion in triosephosphateisomerase. J. Mol. Biol., 310, 259-270.
    28.Perutz, M. F. (1992). What areenzyme structures telling us? Faraday Discuss, 1-11.
    29.Wang, Y., Sha, M., Ren, W. Y., vanHeerden, A., Browning, K. S. et al. (1996). pH-dependent and ligand inducedconformational changes of eucaryotic protein synthesis initiation factoreIF-(iso)4F: a circular dichroism study. Biochim. Biophys. Ac., 1297, 207-213.
    30.Wyatt, R., Kwong, P. D.,Desjardins, E., Sweet, R. W., Robinson, J. et al. (1998). The antigenicstructure of the HIV gp120 envelope glycoprotein. Nature, 393, 705-711.
    31.Wright, P. E. & Dyson, H. J.(1999). Intrinsically unstructured proteins: re-assessing the proteinstructure-function paradigm. J. Mol. Biol., 293, 321-31..
    32.Zhou, G., Ellington, W. R. &Chapman, M. S. (2000). Induced fit in arginine kinase. Biophys. J., 78,1541-50.
    33.Claussen, H., Buning, C., Rarey, M.& Lengauer, T. (2001). FlexE: efficient molecular docking consideringprotein structure variations. J. Mol. Biol., 308, 377-395.
    34.Dunker, A. K., Lawson, J. D.,Brown, C. J., Williams, R. M., Romero, P. et al. (2001). Intrinsicallydisordered protein. J Mol Graph Model, 19, 26-59..
    35.Okada, K., Hirotsu, K., Hayashi, H.& Kagamiyama, H. (2001). Structures of Escherichia coli branched-chainamino acid aminotransferase and its complexes with 4-methylvalerate and2-methylleucine: induced fit and substrate recognition of the enzyme. Biochem.,40, 7453-63.
    36.Weiss, M. A. (2001). Floppy SOX:mutual induced fit in hmg (high-mobility group) box-DNA recognition. MolEndocrinol, 15, 353-62.
    37.Yaremchuk, A., Tukalo, M., Grotli,M. & Cusack, S. (2001). A succession of substrate induced conformationalchanges ensures the amino acid specificity of Thermus thermophilus prolyl-tRNAsynthetase: comparison with histidyl-tRNA synthetase. J. Mol. Biol., 309,989-1002.
    38.Manalan, A. S. & Klee, C. B.(1983). Activation of calcineurin by limited proteolysis. Proc. Natl. Acad.Sci. U.S.A., 80, 4291-4295.
    39.Manalan, A. S., Krinks, M. H. &Klee, C. B. (1984). Calcineurin: a member of a family of calmodulin-stimulatedprotein phosphatases. Proc Soc Exp Biol Med, 177, 12-16.
    40.Kissinger, C. R., Parge, H. E.,Knighton, D. R., Lewis, C. T., Pelletier, L. A. et al. (1995). Crystalstructures of human calcineurin and the human FKBP12-FK506-calcineurin complex.Nature, 378, 641-644.
    41.Namba, K. (2001). Roles of partlyunfolded conformations in macromolecular self-assembly. Genes Cells, 6, 1-12..
    42.Wootton, J. C. & Federhen, S.(1996). Analysis of compositionally biased regions in sequence databases. Meth.Enzymol., 266, 554-571.
    43.Dunker, A. K., Garner, E.,Guilliot, S., Romero, P., Albrecht, K. et al. (1998). Protein disorder and theevolution of molecular recognition: theory, predictions and observations. Pac.Symp. Biocomput., 473-84..
    44.Garner, E., Cannon, P., Romero, P.,Obradovic, Z. & Dunker, A. K. (1998). Predicting Disordered Regions fromAmino Acid Sequence: Common Themes Despite Differing StructuralCharacterization. Genome Inform Ser Workshop Genome Inform, 9, 201-213..
    45.Dunker, A. K. & Obradovic, Z.(2001). The protein trinity-linking function and disorder. Nat .Biotechnol.,19, 805-806.
    46.Romero, P., Obradovic, Z.,Kissinger, C. R., Villafranca, J. E., Garner, E. et al. (1998). Thousands ofproteins likely to have long disordered regions. Pac. Symp. Biocomput.,437-448.
    47.Romero, P., Obradovic, Z., Li, X.,Garner, E. C., Brown, C. J. et al. (2001). Sequence complexity of disorderedprotein. Proteins, 42, 38-48.
    48.Boyington, J. C., Gladyshev, V. N.,Khangulov, S. V., Stadtman, T. C. & Sun, P. D. (1997). Crystal structure offormate dehydrogenase H: catalysis involving Mo, molybdopterin, selenocysteine,and an Fe4S4 cluster. Science, 275, 1305-1308.
    49.Katsuya, Y., Mezaki, Y., Kubota, M.& Matsuura, Y. (1998). Three-dimensional structure of Pseudomonasisoamylase at 2.2 A resolution. J. Mol. Biol., 281, 885-897.
    50.Xie, Q. & Chapman, M. S.(1996). Canine parvovirus capsid structure, analyzed at 2.9 A resolution. J.Mol. Biol., 264, 497-520.
    51.Athappilly, F. K., Murali, R., Rux,J. J., Cai, Z. & Burnett, R. M. (1994). The refined crystal structure ofhexon, the major coat protein of adenovirus type 2, at 2.9 A resolution. J.Mol. Biol., 242, 430-455.
    52.Tate, J., Liljas, L., Scotti, P.,Christian, P., Lin, T. et al. (1999). The crystal structure of cricketparalysis virus: the first view of a new virus family. Nat. Struct. Biol., 6,765-774.
    53.Chabriere, E., Charon, M. H.,Volbeda, A., Pieulle, L., Hatchikian, E. C. et al. (1999). Crystal structuresof the key anaerobic enzyme pyruvate:ferredoxin oxidoreductase, free and incomplex with pyruvate. Nat. Struct. Biol., 6, 182-190.
    54.Hart, P. J., Pfluger, H. D.,Monzingo, A. F., Hollis, T. & Robertus, J. D. (1995). The refined crystalstructure of an endochitinase from Hordeum vulgare L. seeds at 1.8 Aresolution. J. Mol. Biol., 248, 402-413.
    55.Fita, I. & Rossmann, M. G.(1985). The NADPH binding site on beef liver catalase. Proc Natl Acad Sci U SA, 82, 1604-1608.
    56.Fita, I. & Rossmann, M. G.(1985). The active center of catalase. J. Mol. Biol., 185, 21-37.
    57.Sundaramoorthy, M., Terner, J.& Poulos, T. L. (1995). The crystal structure of chloroperoxidase: a hemeperoxidase--cytochrome P450 functional hybrid. Structure, 3, 1367-1377.
    58.Jokiranta, T. S., Tissari, J.,Teleman, O. & Meri, S. (1995). Extracellular domain of type I receptor fortransforming growth factor-beta: molecular modelling using protectin (CD59) asa template. FEBS Lett., 376, 31-36.
    59.Bayer, P., Kraft, M., Ejchart, A.,Westendorp, M., Frank, R. et al. (1995). Structural studies of HIV-1 Tatprotein. J. Mol. Biol., 247, 529-535.
    60.Gronwald, W., Loewen, M. C., Lix,B., Daugulis, A. J., Sonnichsen, F. D. et al. (1998). The solution structure oftype II antifreeze protein reveals a new member of the lectin family. Biochem.,37, 4712-21.
    61.Kim, H. & Lipscomb, W. N.(1990). Crystal structure of the complex of carboxypeptidase A with a stronglybound phosphonate in a new crystalline form: comparison with structures ofother complexes. Biochem., 29, 5546-5555.
    62.Vihinen, M., Torkkila, E. &Riikonen, P. (1994). Accuracy of protein flexibility predictions. Proteins, 19,141-9..
    63.Pascarella, S. & Argos, P.(1992). Analysis of insertions/deletions in protein structures. J. Mol. Biol.,224, 461-471.
    64.Bork, P. & Gibson, T. J.(1996). Applying motif and profile searches. Meth. Enzymol., 266, 162-184.
    65.Taylor, W. R. (1997). Multiplesequence threading: an analysis of alignment quality and stability. J. Mol.Biol., 269, 902-943.
    66.Tatusov, R. L., Natale, D. A.,Garkavtsev, I. V., Tatusova, T. A., Shankavaram, U. T. et al. (2001). The COGdatabase: new developments in phylogenetic classification of proteins fromcomplete genomes. Nucl. Acids Res., 29, 22-28.
    67.Uetz, P., Giot, L., Cagney, G.,Mansfield, T. A., Judson, R. S. et al. (2000). A comprehensive analysis ofprotein-protein interactions in Saccharomyces cerevisiae. Nature, 403, 623-7.
    68.Rain, J. C., Selig, L., De Reuse,H., Battaglia, V., Reverdy, C. et al. (2001). The protein-protein interactionmap of Helicobacter pylori. Nature, 409, 211-5..
    69.Flores, A., Briand, J. F., Gadal,O., Andrau, J. C., Rubbi, L. et al. (1999). A protein-protein interaction mapof yeast RNA polymerase III. Proc Natl Acad Sci U S A, 96, 7815-20..
    70.Fromont-Racine, M., Mayes, A. E.,Brunet-Simon, A., Rain, J. C., Colley, A. et al. (2000). Genome-wide proteininteraction screens reveal functional networks involving Sm-like proteins. Yeast,17, 95-110..
    71.Bon, E., Recordon-Navarro, P.,Durrens, P., Iwase, M., Toh, E. A. et al. (2000). A network of proteins aroundRvs167p and Rvs161p, two proteins related to the yeast actin cytoskeleton.Yeast, 16, 1229-41..
    72.Riley, M. (1993). Function of thegene products in Escherichia coli. Microbiol. Rev., 57, 862-952.
    73.Riley, M. & Labedan, B. (1997).Protein evolution viewed through Escherichia coli protein sequences:introducing the notion of a structural segment of homology, the module. J. Mol.Biol., 268, 857-868.
    74.Tamames, J., Ouzounis, C., Sander,C. & Valencia, A. (1996). Genomes with distinct function composition. FEBSLett., 389, 96-101.
    75.Tamames, J., Ouzounis, C., Casari,G., Sander, C. & Valencia, A. (1998). EUCLID: automatic classification ofproteins in functional classes by their database annotations. Bioinformatics,14, 542-3.
    76.Devos, D. & Valencia, A.(2000). Practical limits of function prediction. Proteins, 41, 98-107.
    77.Rost WWW, B. (2001). Analysingsecondary structure composition of known structures. Columbia University, WWWdocument (http://cubic.bioc.columbia.edu/results/2001/secstr/).
    78.Uversky, V. N., Gillespie, J. R.& Fink, A. L. (2000). Why are "natively unfolded" proteinsunstructured under physiologic conditions? Proteins, 41, 415-427.
    79.Stouten, P. F. W., Sander, C.,Wittinghofer, A. & Valencia, A. (1993). How does the switch II region ofG-domains work? FEBS Lett., 320, 1-6.
    80.Ashkenazi, G., Ripoll, D. R.,Lotan, N. & Scheraga, H. A. (1997). A molecular switch for biochemicallogic gates: conformational studies. Biosens Bioelectron, 12, 85-95.
    81.Noel, J. R. (1997). Turning off theRas switch with the flick of a finger. Nat. Struct. Biol., 4, 677-681.
    82.Solano, R., Fuertes, A.,Sanchez-Pulido, L., Valencia, A. & Paz-Ares, J. (1997). A single residuesubstitution causes a switch from the dual DNA binding specificity of planttranscription factor MYB.Ph3 to the animal c-MYB specificity. J. Biol. Chem.,272, 2889-95.
    83.Azuma, Y., Renault, L.,Garcia-Ranea, J. A., Valencia, A., Nishimoto, T. et al. (1999). Model of theran-RCC1 interaction using biochemical and docking experiments. J. Mol. Biol.,289, 1119-30.
    84.Simeonidis, S., Stauber, D., Chen,G., Hendrickson, W. A. & Thanos, D. (1999). Mechanisms by which IkappaBproteins control NF-kappaB activity. Proc. Natl. Acad. Sci. U.S.A., 96, 49-54.
    85.Sola, M., Lopez-Hernandez, E.,Cronet, P., Lacroix, E., Serrano, L. et al. (2000). Towards understanding a molecularswitch mechanism: thermodynamic and crystallographic studies of the signaltransduction protein cheY. J. Mol. Biol., 303, 213-25.
    86.Falke, S., Fisher, M. T. &Gogol, E. P. (2001). Structural changes in GroEL effected by binding adenatured protein substrate. J. Mol. Biol., 308, 569-577.
    87.Kirshenbaum, K., Young, M. &Highsmith, S. (1999). Predicting allosteric switches in myosins. Prot. Sci., 8,1806-1815.
    88.Zetina, C. R. (2001). A conservedhelix-unfolding motif in the naturally unfolded proteins. Proteins, 44,479-483.
    89.Meador, W. E., Means, A. R. &Quiocho, F. A. (1992). Target enzyme recognition by calmodulin: 2.4 A structureof a calmodulin-peptide complex. Science, 257, 1251-5..
    90.Kissinger, C. R., Parge, H. E.,Knighton, D. R., Lewis, C. T., Pelletier, L. A. et al. (1995). Crystalstructures of human calcineurin and the human FKBP12-FK506-calcineurin complex.Nature, 378, 641-4..
    91.Gaasterland, T. (1998). Structuralgenomics taking shape. TIGS, 14, 135.
    92.Rost, B. (1998). Marrying structureand genomics. Structure, 6, 259-263.
    93.Sali, A. (1998). 100,000 proteinstructures for the biologist [see comments]. Nat. Struct. Biol., 5, 1029-32.
    94.Burley, S. K., Almo, S. C.,Bonanno, J. B., Capel, M., Chance, M. R. et al. (1999). Structural genomics:beyond the human genome project. Nat Genet, 23, 151-7.
    95.Blundell, T. L. & Mizuguchi, K.(2000). Structural genomics: an overview. Prog Biophys Mol Biol, 73, 289-295.
    96.Shapiro, L. & Harris, T.(2000). Finding function through structural genomics. Current Opininon inBiotechnology, 11, 31-35.
    97.Thornton, J. (2001). Structuralgenomics takes off. TIBS, 26, 88-89.
    98.Rost, B. (1999). Twilight zone ofprotein sequence alignments. Prot. Engin., 12, 85-94.
    99.Eyrich, V., Martí-Renom, M.A., Przybylski, D., Fiser, A., Pazos, F. et al. (2001). EVA: continuousautomatic evaluation of protein structure prediction servers. Bioinformatics,17, 1242-1243.
    100.Sander, C. & Schneider, R.(1991). Database of homology-derived structures and the structural meaning ofsequence alignment. Proteins, 9, 56-68.
    101.Rost, B., Casadio, R. &Fariselli, P. (1996). Topology prediction for helical transmembrane proteins at86% accuracy. Prot. Sci., 5, 1704-1718.
    102.Fraser, C. M., Gocayne, J. D.,White, O., Adams, M. D., Clayton, R. A. et al. (1995). The minimal genecomplement of Mycoplasma genitalium [see comments]. Science, 270, 397-403.
    103.Gorodkin, J., Heyer, L. J.,Brunak, S. & Stormo, G. D. (1997). Displaying the information contents ofstructural RNA alignments: the structure logos. CABIOS, 13, 583-6..
    104.Krezel, A. M., Wagner, G.,Seymour-Ulmer, J. & Lazarus, R. A. (1994). Structure of the RGD proteindecorsin: conserved motif and distinct function in leech proteins that affectblood clotting. Science, 264, 1944-1947. 

    Contact:    rost@columbia.edu Version:    Jul 15, 2002
    top - CUBIC-papers - CUBIC