Protein structures sustain evolutionary drift

Folding & Design, 1997, 2, 519-524.

Burkhard Rost
Address: EMBL, 69 012 Heidelberg, Germany
Correspondence: Burkhard Rost
e-mail: rost@embl-heidelberg.de
contact e-mail:rost@embl-heidelberg.de


Table of Contents


Abstract

Background: A protein sequence folds into a unique three-dimensional protein structure. However, different sequences can fold into similar structures. How stable is a protein structure with respect to sequence changes? What percentage of the sequence are 'anchor' residues, i.e., are crucial for protein structure and function? Here, these questions were pursued by analysing large numbers of structurally homologous protein pairs.

Results: The distribution of pairwise sequence identity between structural homologues showed one prominent and unexpected feature: most pairs clustered in an approximately Gaussian peak centred at 8-9% sequence identity. The distribution was surprisingly similar to that expected for 'random' pairs of completely unrelated sequences. Qualitatively, the distributions were similar for different criteria to consider two structures homologous. An analysis of four entire genomes suggested that most structurally homologous pairs have less than 45% pairwise sequence identity.

Conclusions: (1) Most pairs of similar structures have sequence identity as low as expected from randomly related sequences. (2) On average only three to four percent of all residues are 'anchor' residues (residues crucial for maintaining the structure). (3) The symmetric shape of the distribution at low sequence identity suggests that for most structures, four billion years of evolution was sufficient to reach an equilibrium. (4) The mean identities for convergent (different ancestor) and divergent evolution (same ancestor) of proteins to similar structures are quite close, and hence, in most cases it is difficult to distinguish between the two effects. In particular, low levels of sequence identity appear not to be indicative of convergent evolution.


Introduction

Large scale studies of protein structure evolution can now begin. We have a detailed and ever-widening knowledge of the evolution of DNA sequences. But what do we really know about the evolution of protein structure? Until recently, the answer was: not much. The first detailed structures were determined 27 years ago [1, 2], and 14 years ago, the database of atomic-resolution protein structures contained just 312 structures (PDB [3]). Since then, due to advances in determination methods [4], the PDB has grown exponentially; presently it holds 5000 entries. A parallel development has occurred in methods for aligning of protein structures [5-16]. Applying these automatic methods to the current PDB, we can now begin to analyse the evolution of protein structure.

Stability of structures with respect to sequence changes enable evolutionary drift. It has long been accepted that each protein sequence folds into a unique 3D structure, and that proteins with similar sequences have similar structures [17]. But exactly how similar do two sequences have to be to have similar structures? In other words, how large is the sequence attractor of a protein structure (i.e., the region in sequence space which maps onto the same fold)? The answer is: surprisingly large: essentially all protein pairs of known structure with more than 30 out of 100 residues pairwise identical have similar structures[18]. This high robustness of structures with respect to residue exchanges explains partly the robustness of organisms with respect to gene-replication errors, and it allows much scope for variety in evolution. In recent years, many examples of protein pairs have been uncovered which have similar structures at even lower levels of pairwise sequence identity. At first, this was a surprise [19, 20]. However, we now start to realise that low levels of sequence identity between similar structures is not the exception.

Here, a [21] of protein structure evolution is detailed, and extended. This analysis is based on all pairs of proteins in the PDB with similar three-dimensional structures. For each pair, the structures were aligned, and the sequence identity (pairwise identical residues) in the aligned regions was measured. To minimise bias in regions of higher identity, distributions of pairwise sequence identity were compiled for four entire genomes representing all terrestrial kingdoms: haemophilus influenzae (prokaryote), yeast (eukaryote), mycoplasma genitalium (prokaryote), and methanococcus jannaschii (archea).


Results

Expected distributions for convergent and divergent evolution. A priori, we might suppose that divergent evolution of sequences from the same ancestor would give rise to a distribution of sequence identity scores with a peak value, D, at some probably low value (e.g. D < 30%), and a smooth, relatively flat tail for high values (Fig. 1). In the case of convergent evolution, where two unrelated sequences evolve to the same structure, we would expect a sharp Gaussian distribution with a peak value, C, at very low identity, e.g. C < 10% (Fig. 1).


Fig. 1

fig1.gif

Hypothetical distribution of pairwise sequence identity for two evolutionary events: (1) protein pairs that converged from a different ancestor to similar structures (grey line; peak at C), and (2) proteins that diverged from a common ancestor maintaining a similar structure (black line; peak at D). The dashed line indicates that it is a priori not clear which relation to expect between the divergent peak and its tail at high levels of sequence identity.


Observed distribution for structural homologues: one peak around 8-9%. The distribution of sequence identity scores for structural homologues (Fig. 2) has three distinct regions: (1) a large, approximately Gaussian, peak centred at 8-9%; (2) many smaller, sharp peaks between 15-95%; and (3) a large peak near 100%. The last peak may arise from engineered mutants to facilitate structure determination. The second region can be explained by 'incoherent noise' peaks arising from uneven sampling and the still relatively small size of the current PDB (see below). The peak in the first region seemed to be incompatible with the hypothesis that convergent and divergent evolution yielded two different Gaussian distributions (around C and D ); the observed peak occurs at very low average identity (8-9%), and is remarkably symmetric (Fig. 2, left panel). The peak is also very similar to the distribution of random sequence identities with a peak value, R , at 5.6% (Fig. 2, left panel). Qualitatively similar results were obtained when using sequence similarity instead of sequence identity [21].


Fig. 2

fig2.gif

Distribution of pairwise sequence identity for structural alignments (open circles, black line) and random alignments (left panel, only; crosses, grey line; see Materials). The average sequence identity of all remote structural homologues (< 25% pairwise sequence identity, left panel) was about 8.5% (standard deviation 5%). The dashed line is a Gaussian envelope (left panel) fitted to the observed distribution. The average sequence identity of random alignments was about 5.6% (standard deviation 3%).


Have divergent and convergent evolution reached a similar equilibrium? Three scenarios could have generated the observed distribution for similar structures with vanishing pairwise sequence identity (<15%) as a superposition of two separate events (Fig. 2). (1) Divergent evolution has not reached down to very low levels of sequence identity; the observed distribution for remote homologues is entirely dominated by pairs that converged from different ancestors to adopt similar structures. (2) Convergent evolution is negligible; all observed pairs have originated from divergence to very low levels of sequence identity. (3) Divergent and convergent evolution have reached similar equilibrium distributions.

Divergent evolution not under-represented for remote structural homologues. The under-representation of pairs that have diverged from a common ancestor may have been caused by the particular definition of structural similarity (the more 'relaxed' the more likely even functionally unrelated proteins could be deemed 'structurally similar'). However, various different criteria yielded qualitatively the same result: a single Gaussian distribution peaking around 8-9% sequence identity explained the observed data best (Fig. 3).


Fig. 3

fig3.gif

Distribution of pairwise sequence identity for remote structural homologues using different thresholds for structural similarity. Different thresholds were generated by excluding all pairs in the FSSP database [23] below a given threshold in the DALI z-score (data given for zDali < 2, 3, and 6) [24]; and for which the alignment covered a too small percentage of the aligned preoteins (data given for R < 0.2, 0.6). Note: the logarithmic scale of the vertical axis, in lagarithmic scale a Gaussian becomes a parabola. By tuning the threshold, rather different data sets were generated. At the most stringent cut-off value (Z<6), too few examples for statistical analysis were found in the PDB. However, the plots suggest that the details of the shape of the function displayed in Fig. 2 were independent of the particular choice of the cut-off for structural similarity.


Similar results for alignment inter- and intra-kingdom. The observation of one peak at around 8-10% sequence identity (Fig. 2) may have arose from alignments between proteins from the same terrestrial kingdom (e.g. prokaryotes with prokaryotes, or eukaryotes with eukaryotes) whereas the several peaks above 30% sequence identity may have resulted from alignments between proteins from the different terrestrial kingdom (e.g. prokaryotes with eukaryotes). Compiling the distributions separately for all prokaryote-prokaryote, for all eukaryote-eukaryote, and for all eukaryote-prokaryote alignments did not confirm this suspicion. Instead, for all inter- and intra-kingdom alignments the distributions appeared qualitatively similar to the one for all protein pairs (Fig. 4).


Fig. 4

fig4.gif

Distribution of pairwise sequence identity for structural homologues belonging to the same kingdom; shown are the results for all prokaryotes against all prokaryotes (solid black line), for all eukaryotes against all eukaryotes (dashed black line), and for all eukaryotes against all prokaryotes (solid grey line). The lower left and the upper right chart differ in the subsets of PDB considered (lower left chart: starting from the sequence unique subset of PDB, Fig. 6; upper right chart: starting from the structure unique subset of PDB, Fig. 6). The bars indicate that the absolute values of the distribution below and above 25% sequence identity are not comparable! To minimise the effect of database bias below 25% the start set was aligned against a subset of PDB (in which no two proteins had more than 25% pairwise sequence identity), whereas above 25% the start set was aligned against the entire PDB (Fig. 6).


Most close structural homologues have less than 45% pairwise sequence identity. Sequence alignments of all proteins from the four entire genomes against SWISS-PROT, and TREMBL [22] yielded several clear results. (1) The coherent peak near 100% (Fig. 2, right panel) is not present (Fig. 5) for any of the genomes. (2) The various smaller peaks between 40 and 80% in the distribution of structural alignments (Fig. 2, right panel) are not coherently observed in the four genomes (Fig. 5). (3) The majority of close structural homologues (>30% pairwise sequence identity) for all four genomes had 30-42% pairwise sequence identity (data not shown).


Fig. 5

fig5.gif

Distribution of pairwise sequence identity for all close structural homologues between the four entire genomes and the SWISS-PROT database. The results from the analysis of structural pairs are also plotted. Counts are normalised to percentages by the sum over all pairs for each genome (resp. the structural pairs). Values between the structural alignments (PDB) and the sequence alignments (genomes) were not strictly comparable. However, the main trend was clear: the peaks in the PDB distribution around 80 and 100% sequence identity arise from bias in the selection of structures in the PDB.



Discussion

How stable is protein structure with respect to sequence changes? Most pairs of similar structures have sequence identity as low as expected from randomly related sequences (Fig. 2, left panel). This does not imply that sequence changes were random but that to us - as observers of the record of evolutionary history - the sequence variations look random.

How many 'anchor' residue define a structure? The average percentage 'anchor' residues, i.e., residues that are crucial for protein structure and function, is not given by the average of the observed distribution. Instead, the relevant number is the difference between the peak observed for structural homologues (O = 8-%, Fig. 2), and the peak for random alignments (R). Thus, on average only three to four percent of the residues anchor a protein structure (O - R , Fig. 2).

Did evolution have enough time to reach an equilibrium? Most remote structural homologues have less than 15% pairwise sequence identity (Fig. 2), and most close structural homologues have less than 45% pairwise sequence identity (Fig. 3). This implies that the rate of creation of new structures is much slower than the drift towards the mean sequence identity (D ). Furthermore, the symmetric shape of the distribution at low sequence identity suggests that four billion years of evolution was sufficient to reach an equilibrium between these two processes.

Can we distinguish between convergent and divergent evolution? Naïvely, we may have supposed that the level of pairwise sequence identity for remote homologues can be used to distinguish between convergent and divergent evolution. However, the results presented here suggest that convergent, and divergent evolution may have quite similar equilibrium states (difference between divergent and convergent average, D - C quite small, Fig. 2), and hence, in the remote homology region (<15%) it is difficult to distinguish between the two effects.

How will the distribution look for all globular proteins? Assuming that the three terrestrial kingdoms (eukaryotes, prokaryotes, archaeans) would result in separated clusters, the observed distributions could also be attributed to such clusters. The relation between the major peak around 8-10% sequence identity, and the minor peaks around 60, and 80% (Fig. 2) would then reflect merely the distribution of organisms that are the sources for the protein sequences in the PDB. Furthermore, the conclusion suggested by Fig. 2 that most structural homologues have less than 15% pairwise sequence identity may also be a result of the particular distribution of organisms in PDB. However, restricting the distributions to alignments the same kingdom, and to alignments between two different kingdoms did qualitatively not alter the distributions (Fig. 4). The fact that very different data sets produced similar results may suggest that the distribution for all globular proteins would look similar to the one observed today. However, the data sets are still too small to allow drawing firm conclusions.

Trivial or surprise? In presenting this analysis at the Copenhagen meeting, and elsewhere, most experts have expressed surprise at the low value of the average pairwise sequence identity. Clearly, then, the distributions shown in Fig. 3 contain an important lesson in advancing our understanding of the evolution of proteins.


Materials and methods

Compiling pairwise sequence identity. The basic score compiled was the level of pairwise sequence identity, i.e., the percentage of residues identical between two aligned proteins. Structure alignments were taken from the FSSP database of structure alignments [23]. The method that produced these alignments (DALI) attempts to superimpose two structures according to their similarity in the pattern of inter-residue contacts [24]. Thus, the feature analysed here (pairwise sequence identity) has NOT been used by the structural alignment algorithm.

Two regions of pairwise sequence identity: close and remote structural homologues. The level of pairwise sequence identity for which two naturally evolved proteins are guaranteed to have similar structures [17], depends on the alignment length [18]. A 'twilight zone' [25] distinguishes the region in which sequence identity implies structure similarity, and the region for which many proteins have different structures. Schneider and Sander [18] defined a length-dependent cut-off line which was used in this analysis to separate: (1) the region of close structural homologues (pairwise sequence identity > 25%), and (2) the region of remote structural homologues (pairwise sequence identity < 25%). In the first region sequence alignment methods produce accurate alignments; in the second region, reliable alignments have to be based on knowledge about structure.

Avoiding bias by populated folds through selection of data set. First, the largest subset of unique structures was selected (272 proteins for which 148 had at least one remote homologue; Fig. 6; compiled from [23]). Second, the largest subset of sequence unique structures was selected (849 proteins; Fig. 6; compiled from [23]). To further reduce possible bias, the unique structures were aligned against the set of unique sequences, only (instead of against the entire PDB; Fig. 6). The distributions of levels of pairwise sequence identity above 25% were generated by aligning all proteins in the 'structure unique' set against all proteins in PDB (note that by definition of the set of sequence unique structures there is no pair in that set with more than 25% pairwise sequence identity; Fig. 6). To explore the effect of comparisons inter- and intra the two major terrestrial kingdoms the alignments were additionally restricted to homologues between: (1) prokaryotes and prokaryotes; (2) eukaryotes and eukaryotes; and (3) mixed pairs, i.e. one protein from prokaryotes, the other from eukaryotes. For the later, starting from the structure unique set yielded too low counts (upper right chart in Fig. 4). Therefore, the inter- and intra kingdom data was also compiled when starting from the sequence unique subset of PDB (lower left chart in Fig. 4).


Fig. 6

fig6.gif

Avoiding bias by selection of data set. The analysis presented in this paper was compiled based on the largest possible subsets of PDB that fulfilled the following criteria: (A) sequence unique: no pair of structures in that set had more than 25% pairwise sequence identity (according to structural alignment [24]); (B) structure unique: starting from the sequence unique set, a set was selected in which no pair of structures had a significant structural similarity (defined by the DALI cut-off [24]). This procedure implied the separation into two regions of pairwise sequence identity: (1) remote structural homologues (< 25% pairwise sequence identity): all proteins in the structure unique set (grey inner circle) were aligned against all proteins in the sequence unique set (white outer circle); (2) close structural homologues (> 25% pairwise sequence identity): all protein in the structure unique set (grey inner circle) were aligned against all proteins in PDB.


Exploring four entire genomes. The structure alignments yielded a rather 'noisy' distribution in the region above 40% sequence identity (Fig. 2). A way to less biased distributions has been opened by sequencing the entire genomes of (1) haemophilus influenzae (HI) [26]; (2) yeast (saccaromyces cerevisiae, YE) [27]; (3) mycoplasma genitalium (MG) [28]; and (4) methanococcus jannaschii (MJ) [29]. For each genome, distributions were generated by aligning all protein sequences against the SWISS-PROT and the TREMBL [22] database (using the multiple sequence alignment program MAXHOM [18, 30]).

Generating random background distributions. The random background distribution was generated by the following procedure. All proteins in the 'structure unique' set were randomly superimposed onto all proteins in the 'sequence unique' set. 'Randomly superimposed' refers to selecting 'alignment' begin and end in both 'aligned' proteins by generating random numbers, i.e. irrespective of the amino acids that were superimposed. A constraint was imposed while randomly selecting pairs: the pairs had to mirror the distribution of alignment length observed by the structurally aligned pairs. This particular construction of random pairs guaranteed that the random background was representative for the set of structurally aligned proteins: as well singlet frequencies (amino acid composition), as higher order correlations (di-, tri-, multi-peptide frequencies) were similar. (Therefore, the average value of 5.6% was lower, than what would have been expected from superimposition of randomly shuffled sequences.)


Acknowledgements

I am most grateful to Sean O'Donoghue (EMBL, Heidelberg) for crucial encouragement, many discussions, extremely valuable remarks, and last not least, for proof-reading. Thanks also to Chris Sander (EBI, Hinxton) for his intellectual, and financial support. Furthermore, thanks to the extremely constructive comments from one of the anonymous referees. In general thanks to all those who deposit information in public databases, and to those who carry the burden of maintaining these valuable evolutionary record.


References