bottom - TOC - CUBIC-papers - CUBIC - Rost group

Title: Using genetic algorithms to select most predictive protein features
Author:Andrew Kernytsky & Burkhard Rost
Quote: Proteins, 2008, vol, pages

Using genetic algorithms to select most predictive protein features

Andrew Kernytsky 1,2,*, Burkhard Rost 1 1,2,3,4,
1 Dept. of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA
2 Columbia University Center for Computational Biology and Bioinformatics (C2B2), 1130 St. Nicholas Avenue Rm 802, New York, NY 10032, USA
3 North East Structural Genomics Consortium (NESG), Columbia University, 1130 St. Nicholas Avenue Rm 802, New York, NY 10032, USA
4 New York Consortium On Membrane Protein Structure (NYCOMPS), Columbia University, 1130 St. Nicholas Avenue Rm 802, New York, NY 10032, USA
* Corresponding author: kernytsky@rostlab.org URL http://www.rostlab.org/ 

 

Table of contents


Abstract

Many important characteristics of proteins such as biochemical activity and subcellular localization present a challenge to machine learning methods: it is often difficult to encode the appropriate input features at the residue level for the purpose of making a prediction for the entire protein. The problem is usually that the biophysics of the connection between a machine learning method's input (sequence feature) and its output (observed phenomenon to be predicted) remains unknown; in other words, we may only know that a certain protein is an enzyme (output) without knowing which region may contain the active site residues (input). The goal then becomes to dissect a protein into a vast set of sequence-derived features and to correlate those features with the desired output.We introduce a framework that begins with a set of global sequence features and then vastly expands the feature space by generically encoding the co-existence of residue-based features. It is this combination of individual features, i.e. the step from the fractions of serine and buried (input space 20+2) to the fraction of buried serine (input space 20*2) that implicitly shifts the search space from global feature inputs to features that can capture very local evidence such as a the individual residues of a catalytic triad. The vast feature space created is explored by a genetic algorithm paired with neural networks and support vector machines. We find that the genetic algorithm is critical for selecting combinations of features that are neither too general, resulting in poor performance, nor too specific, leading to overtraining. The final framework manages to effectively sample a feature space that is far too large for exhaustive enumeration. We demonstrate the power of the concept by applying it to prediction of protein enzymatic activity.

Contact: amk2002@columbia.edu

Key words: genome sequence analysis, predicting globularity, protein domains, protein function prediction, secondary structure, solvent accessibility, multiple alignments, transmembrane helices, feature selection, genetic algorithm, neural network, support vector machine.

Abbreviations used

AUCarea under the ROC curve
GAgenetic algorithm
ROC curvereceiver operating characteristics curve
Swiss-Protdatabase of well annotated protein sequences

Introduction

Biological features not always mapped to local sequence regions. Proteins that frequently find their way into test tubes have a tendency to be well characterized in that researchers know their solubilities in various media, their functions, and perhaps even the residues that give rise to their functions. For the scores of proteins that are seldom studied, we usually know neither the function nor the relevant residues and have no reasonable hypothesis as to what type of sequence motif might reveal the function. For example, while we can list some features that prevent proteins from being soluble or from crystallizing, predicting solubility or crystallizability from sequence is still an open problem [1] . This problem is related to the observation that short segments of a few consecutive residues can prevent a protein from crystallizing [2] , while an increase in crystallizability might be a result of many residues working in concert, i.e. crystallizability might stem from the addition of small contributions from many residues spread out over the entire protein.

Many approaches infer function through homology-transfer, i.e. they copy (transfer) the experimental annotation of a homologous protein [3, 4] . In contrast, de novo methods predict function without requiring experimental annotations for homologues. De novo function prediction often faces the same type of problem that stymies efforts to predict crystallizability. Simply put: we do not quite know where to look; all we can do is attempt to correlate sequence features with aspects of function. For instance, native subcellular localization has been correlated with overall amino acid composition [5] . The rationale for this observation is that different cellular environments call for different biophysical properties of the proteins native to these environments. Unsurprisingly, this correlation is therefore the most relevant on the protein surface [6] . However, for de novo predictions, using the overall amino acid composition yields predictions that are actually very good with respect to the methods available a decade ago [7] . Combining machine learning methods such as support vector machines (SVMs) and probabilistic methods such as hidden Markov models has led to some success in detecting combinations of amino acid residues that are predictive of crystallizability [8] . Further, the evolutionary rate at an amino acid site is correlated with functional or structural importance and can be identified using Bayesian methods therefore focusing attention on functionally relevant residues [9] .

Local sequence motifs versus global features. There are at least two possible solutions to the problem of predicting characteristics of biological molecules when we do not know what gives rise to the characteristics. The first is to identify sequence motifs, the second to correlate the aspect that we want to predict with a set of global features – those calculated by taking a percent composition of some state over the entire sequence, e.g. percentages of alanine composition or buried residue composition. Sequence motifs derived from experiments such as those in PROSITE [10] or nuclear localization signals [11, 12] are extremely powerful means of extending homology-transfer far beyond actual homology, i.e. certifiable evolutionary connections. However, collections of in silico derived motifs or sliding window-based searches for functionally important regions usually fail to provide enough information to predict particular biological features. For instance, such local regions or motifs do not suffice to identify all proteins with enzymatic activity in the human proteome (collection of all proteins in homo sapiens) [13] . Therefore, computational biologists often have no other way of overcoming a particular prediction challenge than to identify global sequence features (e.g. overall amino acid composition) that somehow correlate with a certain biological feature (e.g. localization).

Early on, many groups correlated macroscopic features to global sequence features [14, 15, 16, 17, 18, 19] . The Brunak group pioneered the idea of using trial-and-error to identify a set of more advanced global features that are sequence-derived and correlate weakly with the feature to be predicted [13, 20, 21, 22] . However, if the number of data points (e.g. number of known enzymes) is orders of magnitude smaller than the number of possible input feature combinations, these methods will not be able to effectively explore the input feature space. If we had enough observations and if we knew all relevant sequence features, we could supposedly solve any prediction task by feeding these two into supervised learning devices. In a realistic situation, both conditions are not met: we have too few experimental examples and we may scarcely be able to see and explore more than the tip of the iceberg of all relevant features. In this situation, it is important to somehow identify a subset of the most relevant features by a combination of intuition and trial-and-error [13, 20] . What if we lack good intuition outright, do not have the time to exhaustively apply it to a wide range of classes requiring prediction (e.g. many functions requiring distinct features), or if our intuition suggests trial-and-error for a number of features that exceeds by far what the experimental data supports – that is, features that are so specific that they occur once or even never in the training set (over-training)?

Here, we describe a novel methodological framework that allows the exploration of a vast feature space that is not matched by the amount of experimental data. Through a combination of genetic algorithms (GA) with supervised machine learning algorithms, namely neural networks, this new methodology enabled us to discover the few dozen most informative features in an ocean of many tens of thousands of sampled features. Our main goal was the development of a framework that could be applied to a variety of tasks. In the current work, however, we illustrated the success of the concept by solving one particular prediction task, namely the prediction of the enzymatic activity for a protein sequence of unknown function.

Methods

Test problem. The method we present for sampling the space of sequence features was initially created to predict the enzymatic activity of a protein given only its sequence. The motivation for such a project stems from the fact that there is presently a very large set of protein sequences whose function is not known [23] . We set out to predict whether a protein had enzymatic activity and if so, what that activity was in terms of the Enzyme Commission (EC) classification hierarchy [24] . This system classifies enzymes according to the enzymatic reaction they catalyze by specifying a four digit EC number wherein each successive digit gives a more specific classification level of the reaction type (digits 1-3) and substrate (4th digit).

Data set. We created a sequence unique subset of all protein sequences in Swiss-Prot [25] release 42 with sequence redundancy reduced to HSSP-value 0 – i.e., for alignments of 250 residues, no pair of proteins with > 22% pairwise sequence identity was allowed [26, 27, 28, 29] . In eliminating proteins above this threshold, the algorithm made no preference for enzyme or non-enzyme proteins. Thus we have created the largest possible subset of Swiss-Prot proteins such that we cannot infer structural similarity – let alone enzymatic activity [30] – between any pair of proteins from sequence alone.

The EC annotations [24, 31] that are cross-referenced in Swiss-Prot were used to describe the proteins' enzymatic activity; proteins with experimental annotations but without EC numbers were considered non-enzymes. We discarded any sequences that had the keywords hypothetical, probable, putative, fragment or polypeptide in the Swiss-Prot description field (DE) or had a first or last residue marked as NONTERM (denotes residue is a non-terminal residue) in the features entry field (FE). The final dataset consisted of 6843 non-enzymes and 1795 enzymes with over 1,000 EC classes mapped to the enzymes (Table 1). We evaluated our method on the prediction of each of the 59 mapped EC classes that contain 18 or more sequence-unique proteins (Table 1 sum of last column) by training and evaluating the system on each enzyme class independently.



Table 1
Table 1: Dataset composition. *

EC level

Proteins

EC classes

EC classes used in training

None

6843

1

1

1 (x)

1795

6

6

2 (x.x)

1747

56

21

3 (x.x.x)

1688

167

24

4 (x.x.x.x)

1429

846

7

EC level – number of EC digits annotated (None = non-enzyme), Proteins – proteins in dataset with at least the given EC level annotation (excepting the non-enzyme level, successive annotation levels are subsets of the previous levels, i.e. having level 2 annotation implies having level 1), EC classes – number of EC classes found in the dataset with the exact EC level (successive levels are not subclasses), EC classes used in training – number of EC classes found in the dataset in more than 18 proteins with the exact EC level.



Cross-validation. For each type of enzymatic activity, the positive set consisted of all sequence-unique proteins with that enzymatic activity or a more specific activity while the negative set consisted of all other sequence-unique proteins (enzyme and non-enzyme). That is, for the prediction of enzymatic activity with EC code 3.1, the positive set includes EC 3.1 as well as all of its children nodes: 3.1.1, 3.1.21, 3.1.21.4, etc. Both the positive and negative sets were randomly divided into three sub-groups for cross-validation (Table 2). The testing set is never seen by any algorithm except for the purpose of providing a final performance measure. The fitness used by the genetic algorithm (GA fitness) to assess the usefulness of each combination of features is thus calculated on a set that is independent of this final result.



Table 2
Table 2: Role of three sub-divisions created in eachdataset set.

Dataset Name

Genetic algorithm (GA) Neural network Support vector machine (SVM)

Training

Not used Train predictor Train predictor

Cross-training

Determines fitness of input set (solution) Detect stopping point for training Optimize SVM parameters

Testing

Not used Report final result Report final result



Identification of initial group of relevant features. We began by encoding sequence features that we suspected would be useful for the prediction enzymatic activity. With the exception of sequence length, the most useful ones could typically be predicted or calculated from sequence on a per-residue basis including secondary structure, solvent accessibility, and transmembrane helix propensity as predicted by PROF/PHD [32, 33, 34] , functional residue prediction [35] , as well as a score of residue conservation in a multiple sequence alignment (Evolutionary Information, below) based on ClustalX [36] . We then cast these into global features – percentage composition of a particular amino acid or secondary structure state across an entire sequence – and used these features as input to machine learning devices (below). The success of our subsequent feature deducing method was judged based on whether it could improve considerably over the initial group of relevant features.

Evolutionary Information. Multiple sequence alignments were generated by performing a PSI-BLAST [37] search against a database of the sequences in Swiss-Prot [25] , TrEMBL [25] , and PDB [38] . The resulting sequence profile was filtered [28, 39] to only include hits below a PSI-BLAST e-value of 10-3. The profiles were further filtered to remove sequences that were very similar to each other by removing any protein with sequence identity >80% to a previously added sequence. This is done to reduce the number of nearly identical sequences and to obtain evolutionary information from more distant sequences.

Intersection features: combinations of simple components. Rather than calculating isolated, simple global percentages for a protein feature, e.g. the percentage glycine or the percentage helix in the sequence, we can combine these two features by calculating their intersection, i.e. the percentage of the protein that is both a particular amino acid and in a particular secondary structure state (Fig. 1). These intersection features allow the learning method to detect relevant combinations of states at a particular amino acid residue.

A formal definition of intersection features is left to the supplementary materials section Intersection features and levels defined. However, the important points are that: (i) any feature's intersection level equals the number of different global features that are intersected to create the intersection feature, (ii) maximal intersection features are formed by the intersection of all of the global features (Fig. 1 C, line 6), and (iii) minimal intersection features result from intersecting two global features (Fig. 1 C, line 3).

Fig. 1
fig1.gif

  

Fig. 1: Intersection features and feature encoding. (A) Creation of intersection features by calculating the percent of the sequence positions that have a particular feature combination across one, two or all three lines/tuples: (from top to bottom) 20% of residues are glycine, 10% are glycine predicted to be buried, 5% are glycines predicted to be buried and to be in an helix (B) Legend of abbreviations used for feature classes and number of features in each (C) Columns: Feature Class Combination – short format for describing a set of feature classes (see Feature Class Legend for abbreviation meanings), Description of features – full description of the set of feature classes, Features – enumeration of some or all of the individual features encoded by the solution, Calculation – calculation of number of features encoded by solution (next column), Number of Features – total number of features produced by the solution.



Combinatorial expansion of feature space. For the purpose of simplifying this study and to show its power on even a limited set of input features, we have chosen to use only a subset of all the feature classes that we have used in practice (and representing some of the features most commonly used in prediction methods): amino acid residue, secondary structure state, solvent accessibility, transmembrane helix propensity, and conservation. We use the term feature class to refer to these groups of features ( Fig. 1 C, Feature Class Combination column) while feature refers to a single feature (Fig. 1C, Features column). All combinations of global and intersected features produced 2,147,483,647 possible combinations given only these 5 feature classes (eqn. 1)
  (Eq. 1)

where c is the number of global feature classes (i.e. amino acid residue, secondary structure), x the number of feature classes including the original global features and all intersected features generated, and t the total number of combinations of all x (intersection and global) feature classes. Fig. 2 shows an example with two global features (c = 2) – amino acid residue and secondary structure type – that give rise to a total of three global and intersection features (x = 3) and seven combinations (t = 7) of the intersection and global features. In practice, many more than our five feature classes may be used and the number of combinations t grows very rapidly (5 features: 2.1 x 109, 6 features: 9.2 x 1018 combinations, 7 features: 5.7 x 1038). Clearly this space is too large to explore exhaustively and therefore requires the use of a heuristic.

Discretizing input features. It has been our experience that some learning algorithms – in particular neural networks – often train more successfully when continuous input features are discretized into several "bins". This has frequently been our experience and a s a result many input features such as the predicted solvent accessibility of a residue – a floating point number – are discretized into three possible classes; for example, the levels of solvent accessibility are split into buried, intermediate, and exposed (Fig. 1 A). In the context of predicting enzymatic activity, we suspected that selecting real valued thresholds for each of the class boundaries for some features could improve performance. Consequently, we evaluated a range of thresholds for marking a residue as conserved. In terms of eqn. 1each time a feature is discretized at a different threshold it is treated as a new feature class, effectively incrementing x by one. Since there are many possible thresholds, the discretization of course expanded the intersection space even further.

Genetic algorithms (GAs). We ultimately developed a method that uses a genetic algorithm (GA) to select appropriate combinations of feature classes that are then fed into a neural network inner learning method which in our test problem predicted enzymatic activity. The method works by learning which combinations of feature classes (e.g. amino acid residues intersected with secondary structure states) are most relevant for a particular prediction target (e.g. that a protein acts as ligase). Although there are some learning systems that use a GA to evolve the connections in neural networks as well as feature selection systems for neural networks, the use of GAs to actually select the input features for a neural network (an entirely different algorithm) is not as widespread [40, 41] . We valued genetic algorithms for their ability to escape local minima, to deal with a discontinuous fitness landscape, and to traverse very large feature spaces [41] . An application of GAs for neural network feature selection that shares some similarities with our system is called GANN and is used for the detection of conserved regions in DNA [42] .

Inner learning method. The inner learning method predicts, in the test case, enzymatic activity from a given set of input features. We tested two machine learning methods: feed forward neural networks [43] and support vector machines [44] . For each protein in the dataset, we constructed a vector of input features (Identification of relevant features above) that corresponded to real numbers predicted or calculated from the protein sequence. The real numbers for each input component corresponded to either a global percentage composition (e.g. percentage of glycine in the entire protein) or an intersection feature composition (e.g. percentage buried glycine). The neural network was trained using three-fold cross-validation with balanced sampling [43, 45, 46] , i.e. the neural networks saw equal numbers of positive and negative samples. Although our method ultimately used a neural network as the inner learning method, we also evaluated the use of a support vector machine (SVM) in this role. The SVM was trained with three-fold cross-validation using a radial basis function (RBF) kernel. Balanced training was accomplished by using a misclassification cost factor that adjusted for the different sizes of the positives and negatives. A search for the best RBF gamma parameter was performed each time the GA evaluated a new input feature class set (solution), while several cost (C) parameters were tested on different runs of the GA.

GA structure: "solutions" are the sets of input features. The functional unit of a GA is a candidate solution – also referred to as chromosomes or genomes – to the problem the GA is trying to solve. These candidate solutions are used to specify the input sets to the inner learning method: which feature classes to use as simple global features, and which feature classes should be intersected with which other feature classes to create intersection features (Fig. 1 C). These solutions can be thought of as recipes to construct a feature vector that is passed on to the inner learning method and used for training (Fig. 2 yellow rectangles). Note that the solutions do not encode specific combinations of features such as "alanines in helices" (%Ala-helical); instead, they encode combinations of entire feature classes such as the intersection of all amino acid types and all secondary structure states. This construction was used because it emulated the way in which we started experimenting with feature intersection. It also has the added benefit of limiting the likelihood that the GA will over-train by picking some very specific feature combinations that may only occur in a few proteins but that will not generalize well to proteins outside the training dataset. A solution may contain more than one combination of features to intersect (Fig. 1 C and Fig. 2). Sequence length is not a feature that can be used to make intersection features as it is calculated for an entire sequence and does not vary from residue to residue as does secondary structure and cannot be combined with other features into intersection features. It is thus excluded from the GA selection process (not coded in the GA solutions) and is instead used as a single input feature in all input sets.

Fig. 2
fig2.gif

  

Fig. 2 : Algorithm schematic. From left to right: For a given protein sequence, a large number of feature class combinations can theoretically be built; all seven possible feature combinations (t = 7 in eqn. 1) of only one intersection and two global feature classes shown for clarity (c = 2, x = 3 in eqn. 1). Any given solution population (vertical rectangular box) encodes several of these combinations as "existing" solutions (yellow horizontal bars). These combinations of features are created and input to the inner learning algorithm and the performance on the cross-training set becomes the solution's fitness. Genetic algorithm evolution (green) then selects solutions that are generally more fit (selection) and spawns the next generation with genetic operations (usually crossover and mutation) effecting some change in the population's solutions – i.e. changing the combinations of feature classes used.

 


GA fitness (objective) function. For the GA to evolve better input features (coded by the solutions) for the inner learning methods, the performance of the given set of input features needs to be assessed. We monitored the area under the receiver operating characteristics (ROC) curve [47] . In particular, the ROC was plotted as true positive rate (or sensitivity) = TP/(TP+FN) vs. false positive rate (or 1-specificity) = FP/(FP+TN), where TP is the true positive count, FP the false positives, TN the true negatives, and FN the false negatives.

GA learning strategy. The GA begins by creating an initial population of solutions encoding combinations of global and intersection feature classes (Fig. 2). The feature vectors encoded by these solutions are constructed and used as input to the neural network inner learning method. At this point the inner learning method is trained using three-fold cross-validation, and its performance – measured as AUC, i.e. area under the ROC curve on the cross-training set – is considered to be the fitness of that input set (Table 2). The GA then creates a next generation of input sets (solutions). Although many of the new solutions will be less fit than their parents, some may be more fit and will be more likely to survive, since they encode intersection feature classes that produce better predictions. The process is repeated for a pre-defined number of generations (generally several hundred generations).

GA learning parameters. The initial population of solutions consists of multiple copies of only two solutions: one containing all five of the global feature classes (Fig. 1 C, line 5) that creates 31 input features, and another containing the maximal intersection of all five feature classes (Fig. 1 C, line 6) that creates 360 input features. Thus only two of the 109 possible combinations (t) of the five feature classes exist as solutions in the first generation's population (Fig. 3).

New offspring solutions are created by inserting or deleting nodes in the existing solutions. Nodes can be feature classes or an intersection operator "×" (Fig. 3 Feature Class Legend) which combines two global feature classes (Fig. 1 C, line 2) into an intersection feature class (Fig. 1 C, line 3). Inserting the intersection operator thus allows a new intersection feature class to be created from two global feature classes in the parent solution (Fig. 3 solid arrow). The probability of each mutation event occurring – insertion or deletion – is set to 0.4 per node per generation. At this mutation frequency, a typical solution containing about 4-5 nodes has a good chance of having two or more nodes deleted or inserted. This has the effect of allowing significant "leaps" around the feature class space to occur. The probability of crossover of two solutions is set to 0.2 per solution per generation (Fig. 3 dashed arrow).

Selection of the solutions that continue to the next generation is done by means of a steady state process: after new offspring solutions are generated via crossover and/or mutation (insertion/deletion) of the parent solutions, the worst solutions – parents and/or offspring – are then discarded to bring the population back to the original size (Fig. 3). This type of selection ensures that the best performing solutions are not selected out of the next generation by chance and has a tendency to converge faster at the cost of losing diversity among the solutions more quickly. This contrasts with the more typical selection scheme – roulette wheel selection – where the most fit solutions have a higher chance than less fit solutions of getting to the next generation but are not guaranteed survival. We used a population size of 100 solutions with 50 potential offspring created each generation; each GA ran for 1000 generations.

Fig. 3
fig3.gif

  

Fig. 3 : Solution population evolution. Hypothetical solution populations of the first three generations of GA evolution. Fitness of each solution is shown after the È symbol. Generation 1: Initial population (n solutions 1-4) are comprised of all global or maximal intersection feature classes. New potential offspring (solutions 5-8) are created via insertion, deletion and crossover operations. Generation 2+: Only the best n solutions continue to the 2nd generation – i.e. both parents and offspring may not survive. Potential offspring are again produced as in generation 1. Mutation: Insertion of intersection operator "×" (solid arrow) creates new intersection feature class AA×sec. Crossover of two solutions (dashed arrow) creates solution with new feature classes.

 



Decidability index. The number of features prescribed in the solutions can quickly become very large when creating various intersection features. Therefore, when this number of input features was greater than 100, the feature set was reduced to the best 100 features as scored by the decidability index (DI):

  (Eq. 2)

where p1 and p2 are the mean output values of the positive and negative predictions and s1 and s2 are the corresponding standard deviations [48, 49] . Put another way, input features that are better at discriminating between the positive and negative sets (higher DI) will have a larger separation between the means of the two sets or "narrower" standard deviations. In the most extreme case, a solution encoding the maximal intersection of all five feature classes stipulates inputting 360 intersection features to the inner learning method (Fig. 1 line 6). The decidability index would select the 100 most predictive features and discard the others. The added value of the additional 260 features in this case would be extremely low (or even slightly negative because of overtraining) while the computational requirements would increase almost fourfold.

Measuring performance and significance. We determined performance by running the entire algorithm three times, rotating through each of the three roles assigned to the three random splits of the sequence unique proteins, such that each third of the proteins is the testing set once (Table 2). We reported the average of the testing set performances. We assessed the significance of improvement by repeating the above procedure five times with different random splits into three sets (15 runs total). Then we applied a one-tailed t-test (only positive changes considered) for each predicted enzyme function. The null hypothesis was rejected for all enzyme functions for levels p < 0.01.

 

 

Results and Discussion

Features selected by GA. We aggregated the results for the 59 function classes that were used in training into the four EC levels and one all-enzyme level 0 that includes all proteins in the EC hierarchy. EC levels 2 and 3 have the highest number of function classes (Table 3); fewer classes are listed in EC level 4 because although it is the most populated, virtually all of its functional classes have very few sequence-unique proteins and are excluded from training (Table 1).

When using the neural network as the inner learning method, the GA selected feature classes generally improved on the performance of the global features at each EC level, showing a significant (p < 0.01) improvement in 41 out of 59 classes (Table 3). Using all maximal intersection features (Fig. 1 C, line 6) usually fell between using global and using GA selected feature classes (Fig. 4).

Fig. 4
fig4.gif

  

Fig. 4: Enzyme class prediction. With a neural network as the inner learning algorithm, the average performance is shown for prediction of the enzyme classes in the EC hierarchy at each of four depths (1-4) as well as at level 0 which corresponds to the prediction of whether or not a protein is an enzyme (in the EC hierarchy or not). The AUC is given for all global features classes (gray line, circles), all maximal intersection features classes (dashed line, squares), and for the GA selected feature classes (black line, pluses).





Table 3
Table 3 : GA improvement by using global features. *
EC level Global features GA improvement Functions at EC level Function with significant improvement Improvement on significant functions

0

0.811

0.016

 1

 1 (100%)

0.016

1

0.776

0.059

 6

 4 (66%)

0.078

2

0.801

0.081

21

16 (76%)

0.097

3

0.803

0.098

24

17 (71%)

0.121

4

0.877

0.044

 7

 3 (43%)

0.074

* EC level – 0 corresponds to a prediction of any enzymatic activity, 1-4 correspond to enzyme functions at that depth in the EC hierarchy, Global Features – AUC performance using all available global features, GA improvement – increase in performance over global features when using the GA to select the best combinations of intersection and global feature classes, Functions at EC level– number of enzyme functions individually trained on, Functions with significant improvement – number (and percentage) of functions where GA selection imparted a significant (p < 0.01) increase, Improvement on significant functions – increase in performance for enzyme functions that had a significant improvement.



Using an SVM as the inner learning method rather than a neural network yielded similar results for all global and all maximal intersection features, but the GA selection of best feature classes was scarcely able to outperform the use of maximal intersection features (Supporting Online Materials, Using a Support Vector Machine as the Inner Learning Method and Fig. S1). We thus focused on neural networks and all further results use a neural network as the inner learning method.

Correct feature composition improves performance. Overall, the best feature combination chosen by the GA improved on using either all maximal intersection or all global features (Fig. 4). However, looking at the best possible performances with various levels of intersection properties and various numbers of global features can provide more detail about how the GA selects features (Fig. 5). The highest feature intersection level (|i| in Eqn. S5 and x axis in Fig. 5) corresponds to how fine grained an intersection feature exists in the solution (inner learning method input set), while the number of global features (y axis in Fig. 5) indicates how many inputs are more general global features. Fig. 5 shows the best case performance on the testing set by not using cross-validation in order to illustrate general trends and interesting cases.

Fig. 5
fig5.gif

  

Fig. 5 : Performance vs. feature composition. Non-white squares show the best non cross-validated testing set performance achieved by the corresponding types of GA selected features as area under the ROC curve. White squares are regions that the GA did not explore.

 



The distinction between enzymes and non-enzymes (EC 0) exhibited a common pattern: mostly global feature classes (top left of plot) and mostly high intersection level features (bottom right of plot) performed more poorly than most other points on the plot (Fig. 5 A). The GA selected features performed well (Fig. 5 A, yellow/orange) over a somewhat broad area. The situation differed for the prediction of glycosyltransferase prediction (EC 2.4, Fig. 5 B): the GA quickly moved away from the global features in the top left and extensively explored the intersection features out to features of intersection level 7 (Fig. 5 B, orange points at bottom of graph), while never visiting some of the points that the GA trained on enzyme/non-enzyme – EC 0 – visited (Fig. 5 B, white points at 1,4 and 2,3). Note that this was the expected behavior since the GA is unlikely to allow the significantly less fit solutions in the upper left to survive resulting in a rapid early extinction of these types of solutions. In contrast, the prediction of hydrolases (EC 3, Fig. 5 C) was more atypical: the GA found an island of higher fitness in the set of input features (solution) at 2,1 that was noticeably more fit than all of the surrounding solutions (Fig. 5 C).

The feature class combination solutions that correspond to the best features for EC 0 (Fig. 5 A) are combinations of one or two intersection features each comprised of some combination of accessibility, secondary structure, amino acid and conservation (Table 4).

Varying the level of feature discretization was not found to be particularly helpful since a separate training run with the discretization of the conservation feature fixed at 95 gave the same average AUC for GA selected features (Methods, Discretizing input features).



Table 4
Table 4: Best feature combinations for enzyme/non-enzymeprediction. *

5

AA acc sec htm cons-95

4

AA acc sec cons-95 AA acc acc×sec htm cons-95

3

AA sec cons-97 AA acc×sec sec cons-95 AA acc×sec×cons-94 acc sec

2

AA sec AA acc sec×cons-94 cons-83×cons-94 AA acc×sec×cons-89  cons-95

1

AA AA acc×cons-96 sec×cons-91 AA acc×sec×cons-94  acc×cons-94

0

AA×cons-94 acc×cons-94 AA×cons-82 acc×sec×cons-94

↑ Global

Inter.→

1

2

3

The table entries show the feature class combinations (GA solutions) that gave the highest performance in the testing set for a given combination of number of global features (y axis) and highest feature intersection levels (x axis); they correspond to the values in left "half" of Fig. 4A – enzyme / non-enzyme (Highest feature intersection levels 1 through 3); cons-X features are conservation features discretized at varying percentiles X of conservation throughout the entire sequence



GA learning complements neural network learning. As the inner learning method, the neural network alone (without GA feature class selection) reaches a performance ceiling for classifying many enzymatic activity classes (Fig. 4 dashed and gray lines). When all maximally encoded intersection features (Fig. 1 C, line 6) are passed to a neural network, there may as many as one hundred redundant and/or noisy inputs that can only be eliminated by slowly reducing their weight to zero via back propagation and internode weight decay. Bringing an internode weight to zero is a slow and asymptotic process that will never come to completion. The GA, however, can completely eliminate entire swaths of input features by eliminating one or more feature classes in a GA solution. With fewer noisy or marginally useful features, the neural network can often do a better job of learning the training data in a way that generalizes well. In this way, the GA is able to progressively prune away the unnecessary feature classes from the starting solutions (Fig. 1 C, lines 5 and 6) to arrive at a better performing solution than the neural network could have achieved alone.

Effect of feature selection via the decidability index. While the GA performs a large part of the feature selection work, the decidability index (DI) performs a second selection step that pares down the pool of features coded for by GA solutions to the 100 most predictive features (Methods, Decidability Index). While this was done to prevent a few very complex GA solutions coding for over 1000 input features from slowing the entire GA evolution, this selection also bestows some performance benefit to the maximally intersected and GA selected feature classes. The selection by the DI does not affect the use of all global feature classes since they code for far less than 100 input features (32 using all six global feature classes). Creating all maximally intersected features, however, creates 1440 input features with the DI discarding 1340 features that may have predictive value. It is thus possible that the DI is actually decreasing the performance of the maximally intersected features by removing these features.

Eliminating the selection of the top 100 features using the DI significantly decreases the performance of the maximally intersected features (Fig. 6 green solid line: 100 features, green dotted line: no DI, 1440 features). Examining the training of the neural network inner learning method shows that compared to using the top 100 features, disabling the feature selection via DI and using all 1440 features leads to significant more cases of overtraining wherein the classification performance increases on the training set but decreases rapidly on the cross-training and/or validation sets (data not shown). This is consistent with the hypothesis that maximal intersection features define very specific combinations of features that occur so infrequently that they lead to the method training on features that do not generalize well beyond the training data set.

Fig. 6
fig6.gif

  

Fig. 6 : Effect of decidability index on enzyme class prediction. The decidability index (eqn. 2) is used to select the top 100 features (solid lines), 400 features (dashed lines), and is turned off allowing up to all 1440 features (dotted lines). Area under the ROC curve is shown for prediction of the enzyme function classes in the EC hierarchy at each of four EC levels (1-4) as well as at depth "0" which corresponds to any enzyme function. Performance is given for all global features classes (circles), all maximal intersection features classes (triangles), and for the GA selected feature classes (squares).

 


Using the top 400 features selected by DI (Fig. 6 green dashed line) might be a compromise between selecting the top 100 features and using all 1440 features and shows performance that comes closer to that of the data using the top 100 features. That the effect of the DI is most pronounced at EC levels 3 and 4 is not surprising since these are the levels with the most specific functions and with the smallest numbers of positive training samples. Thus, they are most susceptible to overtraining and performance decreases on out of training samples as the number of features selected by the DI increases from 100 to 400 and 1440.

The performance with global feature classes (Fig. 6 blue solid line) does not change since they number fewer than the 100 selected for by the strictest DI filter. The performance of the GA selected feature classes (Fig. 6 red lines) is generally not as significantly affected by the choice of DI stringency and shows somewhat higher performance when the DI selects the top 100 features at EC levels 2 and 3 along with a decreased performance for enzyme / non-enzyme prediction (EC level 0). Considering that EC levels 2 and 3 comprise 45 out of the 59 function classes, the loss of performance for EC level 0 seems to be an acceptable trade off. The use of the DI is thus helpful in allowing the maximal intersection features to achieve adequate performance on their own and seems to have an overall beneficial effect on the GA based performance.

Prediction of enzymes competitive. Although we chose the prediction of enzymatic activity only as an example to test our framework, we obviously have to compare the concept in light of state-of-the-art methods for the prediction of enzymatic activity. ProtFun is a popular method that predicts EC numbers from sequence [13, 20] . One problem for a head-to-head comparison between ProtFun and our method was that we did not have identical training/testing conditions available. Therefore, the following comparison of the prediction of whether a protein has any enzymatic activity – "EC 0" – has to be viewed rather cautiously. ProtFun reported an AUC of 0.796 (Lars Jensen, personal communication). We queried the ProtFun server with 1,000 proteins chosen at random from our data set. We did not use all proteins in order to avoid overloading the ProtFun server, and the performance of our method on those 1,000 was in no way outstanding. For this set ProtFun gave an AUC of 0.805. The small difference between these numbers, 0.796 on their data vs. 0.805 on our data, might indicate similarities in the data sets used by the ProtFun authors and by us. Alternatively, ProtFun may simply be good at generalizing on out-of-training-set data. ProtFun's AUC level of 0.796 was similar to our method using global features (AUC=0.811 on those 1,000 proteins), but it was lower than the performance when using the GA for feature selection (AUC=0.827). A comparison between ProtFun and our method at lower level EC class predictions was not possible because of differing definitions of the positive and negative classes used in training at these levels. However, the coarse-grained comparison provided the proof-of-principle: the GA easily performed on par with a well-designed specialized method.

With the exception of residue conservation, the features that we used were a subset of the 14 features used in ProtFun and notably excluded subcellular localization and the three post-translational modifications features that ProtFun found useful. This points to the possibility that our method is encoding a set of features via intersection that are fairly orthogonal to those used by ProtFun and most other methods while achieving comparable performance.

GA captures essence of serine protease activity. A well known example of convergent evolution is the catalytic triad found in serine proteases (or serine endopeptidases) that is composed of serine, histidine and aspartic acid (Fig. 7). Remarkably, evolution arrived at the same residue triplet with the same functionality – cleaving substrate proteins – and the same chemical mechanism completely independently in prokaryotes and eukaryotes. Because of this there is no significant structural or sequence similarity detectable between the two molecules that might give us a hint about their functional similarity. We cannot therefore use homology methods to infer the function of subtilisin from chymotrypsin and vice versa. We might then try to use a motif database to identify their common function; PROSITE sequence motifs [10] can identify the histidine and serine residues in each protein individually. (Aspartic acid has a comparatively weak motif only in subtilisin.) However, the motifs for the corresponding histidine and serine residues in subtilisin and chymotrypsin have almost no similarities (Table V); that is, chymotrypsin's histidine motif will not detect the catalytic histidine in subtilisin and vice versa. In this sense PROSITE motifs, tremendously useful as they are, only detect a local sequence fingerprint specific to some subset of serine proteases, but may not be capturing the essence of the biology at the catalytic site. The predictor trained by the GA for the serine endopeptidases EC class (3.4.21) can, however, effectively detect a wide range of serine protease activity (19 proteins are in the unbiased data set used for training) including both subtilisin and chymotrypsin and has an AUC of 0.881. By selecting appropriate combinations of feature classes, the serine protease predictor ends up focusing on what roughly amounts to proteins that have at least one glycine and one serine that are conserved and predicted to be buried (glycine's backbone amide group plays an auxiliary role in stabilizing a transition state during the cleavage reaction). In this way it is able to detect the biological reality that gives rise to serine protease activity and succeeds in a situation where traditional motifs and homology transfer are unable to detect the activity common to both chymotrypsin and subtilisin.

Fig. 7
fig7.jpg

  

Fig. 7 : Catalytic triad site in two serine proteases. Structure of chymotrypsin [52] (left) and subtilisin [53] (right) in cartoon format with beta sheets in yellow and alpha helices in blue. Catalytic triad of aspartic acid, histidine and serine (left to right) is shown in stick format with carbons green, nitrogens blue, and oxygens red. Both enzymes are serine proteases, cleaving the substrate protein. Serine initiates the enzymatic cleavage reaction via nucleophilic attack on the substrate chain's backbone carbonyl group ultimately leading to cleavage of the substrate; histidine and aspartic acid function as a charge relay network with histidine also briefly accepting a water molecule. The two molecules are examples of convergent evolution wherein prokaryotes and eukaryotes evolved the same reaction center completely independently. There is no significant structural or sequence similarity between the two. Generated with VMD [54] .

 

 



Detecting enzymatic activity in poorly characterized proteins. Although a predictor that captures the activity of both subtilisin and chymotrypsin may have some utility, we believe its true advantage lies in the detection of serine protease activity (and by extension many other EC class activities) for proteins that do not have strong PROSITE motifs. There are 98 active sub-classes of the serine endopeptidase EC class possibly corresponding to as many different substrate specifies. PROSITE includes serine active site motifs for only 9 of those sub-classes and histidine active site motifs for 6. The GA predictor and its intersection features can actually identify serine protease activity in many of the other 89 proteins that have not been extensively characterized and may not have enough examples for the creation of strong PROSITE motifs. Since the prediction of EC level 3 annotations – such as serine endopeptidase activity – has an average AUC of 0.901 (Fig. 4 ), we can expect that the predictors generated for these classes will be equally successful at identifying enzymatic activities in the many proteins that are less studied and characterized than the most popular proteins.

Not just feature selection. Although these results have so far focused on the applicability and performance of the method for enzyme function prediction, they may indicate that similar results are achievable in other applications. The method succeeds because of the particular way in which it generates and uses intersection features. The GA's selection effectively zooms into local regions or sequence motifs with significantly co-existing traits such as particular amino acids predicted to be in particular secondary structure states without having any knowledge of these motifs a priori. Unlike many traditional sequence motifs, these "motifs" neither depend on relative sequence positions (such as GxxG), nor are they conformational in the 3D space of protein structure. Instead, our framework implicitly uses intersection features that can encode rules similar to the one used to identify serine proteases (see GA captures essence of serine protease catalytic triad). Although our framework succeeds by implicitly using such relations on the full-length protein, we expect that inputting domain-like fragments would improve performance, certainly for very long proteins.

Methods that select the appropriate features to improve a learning algorithm's performance on a particular prediction task are worthwhile as is evident from the myriad examples described in the literature [50, 51, 40] . However, the more important novelty presented here is the completely generalized encoding that allows the method to capture co-existing features (or intersection features). The problem solved by this encoding rears a follow-up puzzle of selecting appropriate combinations of these intersection and global features which is solvable using a genetic algorithm. The success of the method thus stems from the appropriate combination of intersection features with automatic feature selection.

Plug-in concept for many tasks. We believe that our methodology fills an important space in the realm of learning methods for biological sequence analysis. The question of how to encode and select relevant biological features for presentation to a learning method is central to many prediction problems. This is particularly true for protein level predictions ranging the gamut from enzymatic function to solubility (and crystallizability) to subcellular localization. These are cases that call for a prediction for the protein as a whole while somehow grasping the local biology at a handful of residues. Pairing the genetic algorithm with intersection features that focus on co-occurring residue features, we can detect the most useful feature classes for at least one of these prediction problems – enzymatic activity. The range of success on the 59 different EC classes seems to indicate that the method can be successfully applied to various other protein level prediction problems.

 

Acknowledgements

Thanks to Dariusz Przybylski, Kazimierz Wrzeszczynski, and Avner Schlessinger (all Columbia) for valuable suggestions and feedback and to Gabor Halasz, Marco Punta, and Zsuzsanna Dosztanyi (Columbia) for helpful comments on the manuscript. The well-documented, open source genetic algorithm library GAlib written by Matthew Wall (MIT) was instrumental to this work. AK and BR were supported by the grant R01-LM07329-01 from the National Library of Medicine (NLM), the grant R01-GM079767 from the National Institute of General Medical Sciences (NIGMS) at the NIH, and by the grant U54-GM074958-01 to the Northeast Structural Genomics consortium (NESG) from the Protein Structure Initiative (PSI) of the National Institutes of Health (NIH). Finally, thanks to all those who deposit their experimental data in public databases, and to those who maintain these databases.

References

1.Liu, J. & Rost, B. (2002). Target space forstructural genomics revisited. Bioinformatics, 18, 922-933.
2.Rupp, B. & Wang, J. (2004). Predictive models forprotein crystallization. Methods, 34,390-407.
3.Rost, B., Liu, J., Nair, R., Wrzeszczynski, K. O.& Ofran, Y. (2003). Automatic prediction of protein function. Cellularand Molecular Life Sciences, 60,2637-2650.
4.Whisstock, J. C. & Lesk, A. M. (2003). Predictionof protein function from protein sequence and structure. Quarterly Reviewsof Biophysics, 36, 307-340.
5.Nakai, K. & Kanehisa, M. (1991). Expert systemfor predicting protein localization sites in gram-negative bacteria. Proteins, 11, 95-110.
6.Andrade, M. A., O'Donoghue, S. I. & Rost, B.(1998). Adaptation of protein surfaces to subcellular location. J Mol Biol, 276, 517-25.
7.Nair, R. & Rost, B. (2005). Mimicking cellularsorting improves prediction of subcellular localization. J Mol Biol, 348, 85-100.
8.Smialowski, P., Schmidt, T., Cox, J., Kirschner, A.& Frishman, D. (2006). Will my protein crystallize? A sequence-basedpredictor. 62, 343-355.
9.Mayrose, I., Graur, D., Ben-Tal, N. & Pupko, T.(2004). Comparison of Site-Specific Rate-Inference Methods for ProteinSequences: Empirical Bayesian Methods Are Superior. Mol Biol Evol, 21, 1781-1791.
10.Falquet, L., Pagni, M., Bucher, P., Hulo, N.,Sigrist, C. J. et al. (2002). The PROSITE database, its status in 2002. NucleicAcids Research, 30, 235-238.
11.Cokol, M., Nair, R. & Rost, B. (2000). Finding nuclearlocalisation signals. EMBO Reports, 1,411-415.
12.Nair, R., Carter, P. & Rost, B. (2003). NLSdb:database of nuclear localization signals. Nucleic Acids Research, 31, 397-399.
13.Jensen, L. J., Gupta, R., Blom, N., Devos, D.,Tamames, J. et al. (2002). Prediction of human protein function frompost-translational modifications and localization features. Journal ofMolecular Biology, 319,1257-1265.
14.Blout, E. R., de LozŽ, C., Bloom, S. M. &Fasman, G. D. (1960). Dependence of the conformation of synthetic polypeptideson amino acid composition. Journal of American Chemical Society, 82, 3787-3789.
15.Davies, D. R. (1964). A correlation between aminoacid composition and protein structure. Journal of Molecular Biology, 9, 605-609.
16.von Heijne, G. (1981). Membrane proteins. The aminoacid composition of membrane-penetrating segments. Eur. J. Biochem., 120, 275-278.
17.Nishikawa, K. & Ooi, T. (1982). Correlation ofthe amino acid composition of a protein to its structural and biologicalcharacteristics. Journal of Biochemistry,91, 1821-1824.
18.Nishikawa, K., Kubota, Y. & Ooi, T. (1983).Classification of proteins into groups based on amino acid composition andother characters: I. Angular distribution. Journal of Biochemistry, 94, 981-995.
19.Nakai, K. & Kanehisa, M. (1991). Expert systemfor predicting protein localization sites in gram-negative bacteria. Proteins:Structure, Function, and Genetics, 11,95-110.
20.Jensen, L. J., Gupta, R., Staerfeldt, H. H. &Brunak, S. (2003). Prediction of human protein function according to GeneOntology categories. Bioinformatics,19, 635-642.
21.Bendtsen, J. D., Nielsen, H., von Heijne, G. &Brunak, S. (2004). Improved prediction of signal peptides: SignalP 3.0. JMol Biol, 340, 783-95.
22.Eden, E. & Brunak, S. (2004). Analysis andrecognition of 5' UTR intron splice sites in human pre-mRNA. Nucleic AcidsRes, 32, 1131-42.
23.Ofran, Y., Punta, M., Schneider, R. & Rost, B.(2005). Beyond annotation transfer by homology: novel protein functionprediction methods that can assist drug discovery. Drug Discovery Today, 10, 1475-1482.
24.NC-IUBMB (1992). Recommendations of theInternational Union of Biochemistry on the Nomenclature and Classification ofEnzymes. In Enzyme Nomenclature eds.), pp. Academic Press, New York, .
25.Boeckmann, B., Bairoch, A., Apweiler, R., Blatter,M. C., Estreicher, A. et al. (2003). The SWISS-PROT protein knowledgebase andits supplement TrEMBL in 2003. Nucleic Acids Research, 31, 365-370.
26.Sander, C. & Schneider, R. (1991). Database ofhomology-derived structures and the structural meaning of sequence alignment. Proteins:Structure, Function, and Genetics, 9,56-68.
27.Hobohm, U., Scharf, M., Schneider, R. & Sander,C. (1992). Selection of representative protein data sets. Protein Sci, 1, 409-17.
28.Rost, B. (1999). Twilight zone of protein sequencealignments. Protein Engineering, 12,85-94.
29.Mika, S. & Rost, B. (2003). UniqueProt: creatingrepresentative protein sequence sets. Nucleic Acids Research, 31, 3789-3791.
30.Rost, B. (2002). Enzyme function less conserved thananticipated. Journal of Molecular Biology,318, 595-608.
31.Bairoch, A. (1999). The ENZYME data bank in 1999. NucleicAcids Res, 27, 310-1.
32.Rost, B., Sander, C. & Schneider, R. (1994). PHD- an automatic server for protein secondary structure prediction. CABIOS, 10, 53-60.
33.Rost, B. (2001). Protein secondary structureprediction continues to rise. Journal of Structural Biology, 134, 204-218.
34.Rost, B. (2005). How to use protein 1D structurepredicted by PROFphd. In The Proteomics Protocols Handbook (Walker, J. E.,eds.), pp. 875-901, Humana, Totowa NJ.
35.Capra, J. A. & Singh, M. (2007). Predictingfunctionally important residues from sequence conservation. Bioinformatics, 23, 1875-1882.
36.Thompson, J. D., Gibson, T. J., Plewniak, F.,Jeanmougin, F. & Higgins, D. G. (1997). The CLUSTAL_X windows interface:flexible strategies for multiple sequence alignment aided by quality analysistools. Nucleic Acids Res, 25,4876-82.
37.Altschul, S. F., Madden, T. L., Schaffer, A. A.,Zhang, J., Zhang, Z. et al. (1997). Gapped BLAST and PSI-BLAST: a newgeneration of protein database search programs. Nucleic Acids Res, 25, 3389-402.
38.Berman, H. M., Battistuz, T., Bhat, T. N., Bluhm, W.F., Bourne, P. E. et al. (2002). The Protein Data Bank. Acta Crystallogr DBiol Crystallogr, 58,899-907.
39.Przybylski, D. & Rost, B. (2002). Alignmentsgrow, secondary structure prediction improves. Proteins, 46, 197-205.
40.Leray, P. & Gallinari, P. (1999). Featureselection with neural networks. Behaviormetrika, 26, 145-166.
41.Reed, R. & Marks, R. (1999). Genetic Algorithmsand Neural Networks. In Neural smithing: supervised learning in feedforwardartificial neural networks eds.), pp. MIT Press, .
42.Beiko, R. G. & Charlebois, R. L. (2005). GANN:genetic algorithm neural networks for the detection of conserved combinationsof features in DNA. BMC Bioinformatics,6, 36.
43.Rost, B. & Sander, C. (1993). Prediction ofprotein secondary structure at better than 70% accuracy. J Mol Biol, 232, 584-99.
44.Vapnik, V. N. (1995). The nature of statisticallearning theory. Springer, .
45.Rost, B. & Sander, C. (1993). Improvedprediction of protein secondary structure by use of sequence profiles andneural networks. Proc Natl Acad Sci U S A,90, 7558-62.
46.Rost, B. (1996). PHD: predicting one-dimensionalprotein structure by profile based neural networks. Methods in Enzymology, 266, 525-539.
47.Fawcett, T. (2006). An introduction to ROC analysis.Pattern Recognition Letters, 27,861-874.
48.Williams, G. O. (1996). The use of d' as adecidability index. Security Technology, 1996. 30th Annual 1996International Carnahan Conference,65-71.
49.Daugman, J. (2000). Biometric decision landscapes.University of Cambridge Computer Laboratory, TR482.
50.Setiono, R. & Liu, H. (1996). Improvingbackpropagation learning with feature selection. Applied Intelligence, 6, 129-139.
51.Kohavi, R. & John, G. H. (1997). Wrappers forfeature subset selection. Artif. Intell.,97, 273-324.
52.Blevins, R. A. & Tulinsky, A. (1985). Therefinement and the structure of the dimer of alpha-chymotrypsin at 1.67-Aresolution. J Biol Chem, 260,4264-75.
53.Takeuchi, Y., Noguchi, S., Satow, Y., Kojima, S.,Kumagai, I. et al. (1991). Molecular recognition at the active site ofsubtilisin BPN': crystallographic studies using genetically engineeredproteinaceous inhibitor SSI (Streptomyces subtilisin inhibitor). Protein Eng, 4, 501-8.
54.Humphrey, H., Dalke, A. & Schulten, K. (1996).VMD - Visual Molecular Dynamics. J. Molec. Graphics, 14, 33-38.
55.Abagyan, R. & Totrov, M. (1994). Biasedprobability Monte Carlo conformational searches and electrostatic calculationsfor peptides and proteins. Journal of Molecular Biology, 235, 983-1002.

Contact:    admin@rostlab.org Version:    Jun 30, 2008
 top - TOC - CUBIC-papers - CUBIC - Rost group