Loctree

From Rost Lab Open
Jump to: navigation, search

NOTE: LocTree3 is a new sub-cellular localizatoin predictor for membrane and globular proteins in life's three domains.


Contents

Introduction

LOCtree is a prediction method for sub-cellular localization of proteins

Prediction algorithm.

LOCtree is a novel system of support vector machines (SVMs) that predict the subcellular localization of proteins, and DNA-binding propensity for nuclear proteins, by incorporating a hierarchical ontology of localization classes modeled onto biological processing pathways. Biological similarities are incorporated from the description of cellular components provided by the gene ontology consortium (GO). GO definitions have been simplified and tailored to the problem of protein sorting. Technically the ontology has been implemented using a decision tree with SVMs as the nodes. LOCtree, was extremely successful at learning evolutionary similarities among subcellular localization classes and was significantly more accurate than other traditional networks at predicting subcellular localization. Whenever available, LOCtree also reports predictions based on the following:

  1. Nuclear localization signals found by PredictNLS,
  2. Localization inferred using Prosite motifs and Pfam domains found in the protein, and
  3. SWISS-PROT keywords associated with a protein.

Localization is inferred in the last two cases using the entropy-based LOCkey algorithm. Additional information can be found in the LOCtree manuscript and associated PredictNLS and LOCkey publications.

Comprehensive prediction of localization.

LOCtree can predict the subcellular localization and DNA-binding propensity of non-membrane proteins in non-plant and plant eukaryotes as well as prokaryotes. LOCtree classifies eukaryotic animal proteins into one of five subcellular classes, while plant proteins are classified into one of six classes and prokaryotic proteins are classified into one of three classes . The novel feature of using a hierarchical architecture is the ability to make intermediate localization class predictions at much higher accuracy's. Another source of improvement is the use of 'noisy' training data. 'Noisy' predictions from LOCKey (SWISS-PROT keyword based annotations) and LOCHom (annotations using sequence homology) are used to train the hierarchical SVMs.

Accuracy of localization prediction.

LOCtree achieved a sustained level of 74% accuracy for non-plant eukaryotes, 70% for plants, and 84% for prokaryotes during six fold cross-validation on a non-redundant data set. We rigorously benchmarked LOCtree in comparison to the best alternative methods for localization prediction. LOCtree outperformed all other methods in nearly all benchmarks. Localization assignments using LOCtree agreed quite well with data from recent large-scale experiments. Our preliminary analysis of a few entirely sequenced organisms, namely human (Homo sapiens), yeast (Saccharomyces cerevisiae), and weed (Arabidopsis thaliana) suggested that over 35% of all non-membrane proteins are nuclear, about 20% are retained in the cytosol, and that every fifth protein in the weed resides in the chloroplasm.

Availability/Web server

This program can be accessed via the PredictProtein service.

Downloading and Installing LOCtree

Please consult the package overview page on how to get loctree. Also check out this Installation Guide.

Running LOCtree

Please see the LOCtree man page:

 man loctree

Help for LOCtree predictions

Reference

Mimicking Cellular Sorting Improves Prediction of Subcellular Localization. Rajesh Nair and Burkhard Rost Journal of molecular biology, 348(1):85-100

Predicted Subcellular localization

Eukaryotic non-plant proteins are classified by LOCtree into one of five subcellular classes (extra-cellular,organelles,nuclear,cytoplasmic or mitochondrial) while for plants the classification is into one of six classes with chloroplast being the sixth class. The organelles are proteins sorted to one of the following subcellular classes (ER,Golgi,Lysosome,Peroxysome or Vacuoles). Gram-positive bacteria are classified into one of two classes (cytoplasmic and extra-cellular) while gram-negative bacteria are classified into one of three classes with periplasm being the third class. For eukaryotic proteins, if the protein is predicted to be Nuclear, LOCtree additionally tries to predict if the protein is "DNA-binding" or "Not DNA-binding". The "DNA-binding" prediction module has not been published. Additional information regading the LOCtree algorithm can be obtained from the original JMB article.

Intermediate Localization prediction

The novel feature of LOCtree is the prediction of intermediate subcellular classes using our SVM implementation which 'Mimicks the cellular sorting machinery'. Intermediate subcellular classes are predicted at much higher accuracy's and can provide useful clues in inferring the true localization of the protein. See figure for further explanation of the predicted intermediate subcellular classes.
Explaining the LOCtree hierarchical architecture.
Figure 1
The figure explains the Hierarchical architecture of LOCtree. LOCtree uses specialized architecture to predict subcellular localization of proteins from different organisms: (a) architecture for eukaryotic non-plant proteins; (b) architecture for plant proteins; and (c) the architecture for prokaryotic proteins. At each branch point a support vector machine (SVM) is used to accomplish a binary classification (either protein belongs to localization class L or does not belong to L). The hierarchical architecture has been designed to mimic the biological protein sorting mechanism as closely as possible. The branches of the tree represent intermediate stages in the sorting machinery while the nodes represent the decision points in the sorting machinery. The different levels of SVMs in the hierarchical tree are labeled Level 0, Level 1, etc. For example, Level 0 represents the top node SVM which discriminates between secretory pathway proteins and other intra-cellular proteins ((a) and (b)) or proteins which remain in the cytoplasm from the rest (c). The intermediate node SVMs in the next level are represented as Level 1, and are responsible for separating extra-cellular proteins from proteins sorted to the organelles and nuclear proteins from cytoplasmic proteins ((a) and (b)). For the prokaryotic architecture (c), Level 1 is the terminal level for Gram-negative bacteria and separates extra-cellular proteins from periplasmic proteins. In addition, Level 1 also contains the cytoplasmic leaf which is propagated without branching from Level 0. For Gram-positive bacteria, Level 0 is the terminal level and separates cytoplasmic proteins from extra-cellular proteins (non-cytoplasmic branch). The leaves of the tree, represented by rectangular boxes represent the final localization classes for which prediction is made. If a leaf has a depth smaller than the overall depth of the tree it is propagated without branching for the remainder of the tree. Level 2 is the terminal level for the eukaryotic non-plant architecture (a) and is responsible for sorting proteins into one of five subcellular classes (mitochondria and cytosol plus the three leaves from Level 1), while Level 3 is the terminal level for the plant architecture (c) and separates proteins into one of six classes (mitochondria and chloroplast plus the four leaves from Level 2). The prediction accuracy of the parent nodes is higher than the child nodes leading to a significantly improved prediction accuracy for the intermediate localization states. Abbreviations: EXT, extra-cellular; NUC, nucleus; CYT, cytosol; MIT, mitochondria; CHLORO, chloroplast; RIP, periplasm; and ORG, organelle. Organelles are the endoplasmic reticulum, Golgi apparatus, peroxysomes, lysosomes, and vacuolar compartments.

Reliability Index

Figure 2
Figure 2

Reliability index (RI) values range from 1-10, with 10 denoting the most confident predictions. The reliability index is a measure of the strength of SVM prediction. A reliability index of 10 implies that the prediction is among the top 10% strongest predictions for the predicted subcellular class while a reliability index of 7 implies that the prediction is among the top 30%-40% strongest predictions. The performance (Figure 2) of LOCtree has been rigorously evaluated on a non-redundant test set of proteins. Predictions with a reliability index of 3 or less are borderline cases which have a high chance of being a wrong prediction. For such predictions the user should try to corroborate the prediction using other sources.

In Figure 2 The curves show prediction accuracy of LOCtree for eukaryotic animal sequences. (a) Overall performance: the prediction accuracy decreases as we descent the hierarchical tree (Figure 1(a)). The Level 2 accuracy shown includes the accuracy of all Level 1 leaves like the extra-cellular, organelle and nuclear classes (Figure 1(a)), and represents the accuracy of classifying the protein into one of five subcellular classes. At 75% coverage the prediction accuracy is around 94% for Level 0, dropping to 84% for Level 1 and 77% for Level 2. The ability of the hierarchical system to predict intermediate localization states at a significantly higher accuracy is evident from the 17% difference in prediction accuracy between Level 0 and Level 2. Level 1 separates proteins into one of four subcellular classes and is over 7% more accurate than Level 2, which separates proteins into one of five classes. (b) Class-wise performance: LOCtree is best at discriminating secretory pathway proteins from all other proteins (91% accuracy at 50% coverage). Prediction of nuclear and extra-cellular proteins was only slightly less accurate (84% accuracy at 50% coverage) while performance was significantly worse for cytosolic proteins with only 64% correctly predicted. The standard deviation in the prediction accuracy for each of the localization classes was roughly 7%.

Interpretation of example prediction

Consider the case where the intermediate localization prediction column reads: "Not Secreted,Nuclear,Not DNA-binding" and the Reliability index of intermediate localization prediction column reads: "9,3,4". This implies the protein is predicted to be "Not Secreted" with RI=9, "Nuclear" with RI=3 and "Not DNA-binding" with RI=4. The predicted subcellular localization is "Not DNA-binding" with RI=4. The weakest link in this prediction is the "Nuclear" prediction which only has a RI=3. Thus a prediction with the second highest confidence for this protein would be in the "Non Nuclear" protein category. The final localization score is the average over all predicted scores, which is 5.

Datasets used

Sequence unique datasets used for developing/testing LOCtree can be downloaded here.

Additional Prediction Methods

In addition to the support vector machine (SVM) based subcellular localization prediction,the LOCtree server also predicts nuclear localization signals using PredictNLS, SWISS-PROT keywords based localization prediction using LOCkey and localization prediction based on the presence of PROSITE and PFAM signatures. Localization predictions based on the diferent algorithms are reported separately since they are based on different features with very different causes of wrong predictions. In general, PredictNLS is the most accurate with nearly 100% accuracy but has the lowest coverage. This is followed by PROSITE/PFAM based predictions. Next in accuracy are the SWISS-PROT keywords based predictions and LOCtree predictions. PredictNLS and PROSITE/PFAM based predictions are all based on the presence of functional motifs/signatures, though PFAM family assignments are not 100% accurate and thus constitute an additional source of error. SWISS-PROT keywords are quite often wrongly assigned to a protein and hence the keywords based method has a lower accuracy than the previous two methods. In contrast to the previous methods LOCtree predicts localization at 100% coverage which leads to lower average accuracy's than the previous methods. However LOCTree predictions at high reliability have a comparable accuracy to the previous methods.

LOCkey: keyword based annotations

(thumbnail)
Figure 3

LOCkey is a novel method for assigning proteins to subcellular classes based on lexical analysis of SWISS-PROT keywords. For a query protein U, SWISS-PROT keywords are assigned by first identifying the sequence homologues of this protein in the SWISS-PROT database. Next, all keywords for the homologues of U are extracted and merged. These keywords are assigned to the query protein U. Figure 3 describes the entropy-based algorithm used by LOCkey to infer subcellular localization. LOCkey: information theory based classifier. The LOCkey system is a novel M-ary classifier which predicts the sub-cellular localization of a protein based on SWISS-PROT keywords. The algorithm can be divided into two steps: (1) Building data sets of trusted vectors for known proteins, and (2) classifying unknown proteins. Firstly, a list of keywords is extracted from SWISS-PROT for all proteins with known sub-cellular localization. Most proteins have 2-5 keywords, on average. A data set of binary vectors is generated for each protein by representing the presence of a certain keyword in the protein by 1 and absence by 0. Secondly, to infer sub-cellular localization of an unknown protein U all keywords for U are read from SWISS-PROT. These keywords are translated into a binary keyword vector. From this original keyword vector, LOCkey generates a set of all possible combinations of alternative vectors by flipping vector components of value 1 (presence of keyword) to 0 in all possible combinations. For example, for a protein with three keywords, there are 23-1 = 7 possible sub-vectors: 111, 110, 101, 011, 100, 010 and 001. These sub-vectors constitute all possible keyword combinations for protein U. The keyword combination, i.e. sub-vector, that yields the best classification of U into one of ten classes of sub-cellular localizations is found. This is done by retrieving all exact matches of each of the sub-vectors to any of the proteins in the trusted set, i.e. by finding all proteins in the trusted set that contain all the keywords present in the sub-vector. By construction, the proteins retrieved in this way may also contain keywords not found in U. The next task is to estimate the 'surprise value' of the given assignment. Toward this end, LOCkey simply compiles the number of proteins belonging to each type of sub-cellular localization. This procedure is repeated in turn for each of the sub-vectors and localization is finally assigned to a protein by minimising an entropy-based objective function. The system accurately solves the classification problem when the number of data points (proteins) and dimensionality of the feature space (number of keywords) are not too large. LOCkey reached a level of more than 82% accuracy in a full cross-validation test. Read the LOCkey manuscript.

Predicted subcellular localization using LOCkey

LOCkey assigne proteins to one of 10 subcellular classes: Extra-cellular,Nuclear,Cytoplasm,Mitochondria,Chloroplast,ER, Golgi, Peroxysome, Lysosome and Vacuole. A protein is assigned to a subcellular class only if the SWISS-PROT keywords associated with this protein meet pre-specified entropy cut-off criteria.

Confidence of prediction using LOCkey

Confidence is assined to a prediction based on the occurrence of the combination of keywords that best localize a protein in a certain subcellular class. For example, if a protein is predicted as Nuclear with a confidence of 85%, this implies that the combination of keywords that were used to infer this localization were found in Nuclear proteins 85% of the times.

SWISS-PROT keywords used in LOCkey

LOCkey assigns subcellular localization by assigning SWISS-PROT keywords to a protein and looking at the occurrence of these keywords in a localization annotated database of SWISS-PROT proteins. Only those SWISS-PROT keywords are used which are found to be correlated with subcellular localization based on entropy criteria.

PROSITE motif based annotations

Proteins are assigned to subcellular classes based on lexical analysis of PROSITE and PFAM motifs or signatures found in the protein. The algorithm used to infer subcellular class is similar to the one used be LOCkey. See figure 3 for further explanation of the entropy-based algorithm used by LOCkey to infer subcellular localization.

Predicted subcellular localization using PROSITE and PFAM motifs

Proteins are assigned to one of 10 subcellular classes: Extra-cellular,Nuclear,Cytoplasm,Mitochondria,Chloroplast,ER, Golgi, Peroxysome, Lysosome and Vacuole. A protein is assigned to a subcellular class only if the PROSITE or PFAM signature associated with this protein meets pre-specified entropy cut-off criteria. Confidence of prediction using PROSITE/PFAM signatures Confidence is assined to a prediction based on the occurrence of the combination of PROSITE or PFAM signatures that best localize a protein in a certain subcellular class. For example, if a protein is predicted as Nuclear with a confidence of 85%, this implies that the combination of PROSITE/PFAM signatures that were used to infer this localization were found in Nuclear proteins 85% of the times.

PROSITE/PFAM signatures used to assign localization

This column shows the PROSITE and PFAM signatures which were used to assign subcellular class to this protein. Clicking on the links provides further information about the respective PROSITE and PFAM signatures. Only those PROSITE and PFAM signatures are used which are found to be correlated with subcellular localization based on entropy criteria.

Contact

For questions, please contact Tatyana Goldberg goldberg@rostlab.org

Personal tools