Bottom - Index of papers - Paper in HTML - Abstract - RostGroup

Title: Using genetic algorithms to select most predictive protein features
Author: Andrew Kernytsky & Burkhard Rost
Quote: Proteins, 2008, vol, pages

CUBIC_papers_abstract:
Using genetic algorithms to select most predictive protein features

Many important characteristics of proteins such as biochemical activity and subcellular localization present a challenge to machine learning methods: it is often difficult to encode the appropriate input features at the residue level for the purpose of making a prediction for the entire protein. The problem is usually that the biophysics of the connection between a machine learning methodŐs input (sequence feature) and its output (observed phenomenon to be predicted) remains unknown; in other words, we may only know that a certain protein is an enzyme (output) without knowing which region may contain the active site residues (input). The goal then becomes to dissect a protein into a vast set of sequence-derived features and to correlate those features with the desired output.We introduce a framework that begins with a set of global sequence features and then vastly expands the feature space by generically encoding the co-existence of residue-based features. It is this combination of individual features, i.e. the step from the fractions of serine and buried (input space 20+2) to the fraction of buried serine (input space 20*2) that implicitly shifts the search space from global feature inputs to features that can capture very local evidence such as a the individual residues of a catalytic triad. The vast feature space created is explored by a genetic algorithm paired with neural networks and support vector machines. We find that the genetic algorithm is critical for selecting combinations of features that are neither too general, resulting in poor performance, nor too specific, leading to overtraining. The final framework manages to effectively sample a feature space that is far too large for exhaustive enumeration. We demonstrate the power of the concept by applying it to prediction of protein enzymatic activity.

Contact: amk2002@columbia.edu



Top - Index of papers - Paper in HTML - Abstract - RostGroup