PROFtmb - Per-residue and whole-proteome prediction of bacterial transmembrane beta barrels
- 1 Introduction
- 2 Availability/Web server
- 3 Download and install Software
- 4 Methods
- 5 Download supporting material
PROFtmb - a prediction method for Bacterial Transmembrane Beta Barrels
This service predicts whole-protein class (TMB/non-TMB) by providing a length-based z-score for a given protein. When tested on a representative set of known TMB and non-TMB proteins, the method detected 50% of TMBs at 80% accuracy (z-score>=10) and 70% of TMBs at 35% accuracy (z-score >= 6)
Four-state Residue prediction.
Secondly, it provides a four state (up-strand, down-strand, periplasmic loop, and outer loop) per-residue prediction for the protein. On a jackknifed test set of 8 families of TMBs of known structure, our method predicted 87.2% of residues correctly.
In addition to obtaining new predictions, you may download 4-state per residue predictions for all proteins in 78 Gram-Negative bacterial genomes.
Predicting transmembrane beta-barrels in proteomes Bigelow HR, Petrey DS, Liu J, Przybylski D, Rost B. Nucleic Acids Res. 2004 May 11;32(8):2566-77
PROFtmb provides both whole-protein (TMB/non-TMB) and per-residue (up-strand,downstrand,periplasmic loop,extracellular loop) predictions. However, if the whole-protein score is below 0 (a value corresponding to 55% coverage and 55% accuracy, PROFtmb does not provide a per-residue 4-state prediction, since it is expected to be inaccurate. There is in general no use for a per-residue prediction if the protein turns out not to be a TMB. About half of TMBs receive low scores and are thus undetectable by statistical means.
This program can be accessed via the PredictProtein service.
Download and install Software
All source and binary distributions come with a complete set of input files necessary to do a test run. For an actual run, the user must generate a [How_to_generate_an_HSSP_file_from_alignment psiblast profile] for the protein sequence of interest.
Design and Training of Profile-based HMM.
We use an HMM whose parameters are trained on a set of labelled sequence profiles, and which accepts sequence profiles as input for prediction. Training is achieved using the Baum-Welch (Expectation Maximization) algorithm considering only valid paths to calculate the expectations at each iteration. See the Web supplement for the mathematical derivation. This idea was originally proposed by Anders Krogh (ref. below) and adapted to profile trained/profile fed HMMs by Martelli et. al.
Clustering and Profiles.
We started with the 56 transmembrane beta barrel structures in the PDB, clustering them at an HSSP distance of 3 to obtain 11 families, 3 of which were discarded as explained in the associated PROFtmb paper. Then, we used a representative of each of these 8 families to build a PSI-BLAST profile.
We labelled each amino acid position in the profile based on its structural environment, recognizing individual latitutes along the transmembrane strands, two abundant types of hairpins in the periplasmic side (4- and 5-hairpins, and a 'general' hairpin), and extracellular loops. There were 75 total labels, composed of 10 hairpin states, 1 extracellular loop state, 32 up-strand states, and 32 down-strand states. The process is depicted in this picture: Baum-Welch Parameter Estimation
Having specified the model architecture, we used the Baum-Welch Parameter Estimation procedure to train the model parameters. During training, the expected number of times each parameter is used to generate the training profile is calculated. The resulting expectations are normalized over all emission parameters from a given node, or all transition parameters from a given node. The model parameters are then assigned these normalized expectations. This cycle is iterated until the total probability of the profile converges within a given step size. As mentioned above, only valid paths (paths through the architecture consistent with the structure-based sequence labelling) are used to calculate the expectations. For good introductions to these procedures see: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition Lawrence Rabiner Proceedings of the IEEE vol. 77 no. 2 pp. 257-286 and Bioinformatics: The Machine Learning Approach Pierre Baldi and Søren Brunak MIT Press, Feb. 1998
For the details specific to valid-path ('clamped') training, see Hidden Markov Models for Labeled Sequences Anders Krogh Proceedings - International Conference on Pattern Recognition vol. 2, 1994 pp. 140-144
Per-residue 4-state Prediction.
To use the trained model to predict the labelling of a new sequence, we first generate the PsiBlast profile of that sequence, then input it to the model. The prediction is achieved by a process generally called Decoding. In our case, decoding was achieved in two steps. In the first step, we calculate the probability of each position in the profile to be in a beta-strand state as the sum of probabilities of all 64 individual beta-strand states. Then, having a sequence of probabilities, we use the Viterbi algorithm combined with the original transition probabilities to find the highest-probability two-state path through the model. It is this path which we use as the Per-residue two-state prediction. See this Prediction showing true labelling.
To detect TMBs in a database of proteins of unknown structure, we start with the bits (log-odds) score, calculated as log_2(P(s|M)/P(s|B)), where P(s|M) is the full sum-over-paths probability of the model generating the sequence profile s. P(s|B) is the corresponding background model, designated as a single Markov state with emission probabilities equivalent to the Database amino acid composition. Since extreme bits scores seem to depend on protein length, the background bits score distribution must be characterized as a function of protein length. The Z-score adjustment we use is that described by Anders Krogh and Richard Hughey here or from:
Hidden Markov models for sequence analysis:
extension and analysis of the basic method Richard Hughey and Anders Krogh Computer Applications in the biosciences: CABIOS Issue 12, vol 2 1996 Apr pp. 95-107
In this procedure, the average length and bits score is calculated for moving windows of 500 proteins, using an threshold of 2.0 standard deviations as the cutoff for outlier removal. In this way, a sequence of means and standard deviations for every protein length encountered is generated. See the z-score calibration curve for PROFtmb.
Using the z-score, we ran PROFtmb on a non-redundant database of well-annotated proteins as regards subcellular location. This dataset, called SetROC contained the following numbers of proteins:
- 13 Integral Outer Membrane
- 21 Peripheral Outer Membrane
- 106 Inner Membrane
- 197 Single Membrane
- 1455 Nonmembrane
Figure 2 shows the cluster plot using PROFtmb Z-score:
Shown here is the original cluster plot of protein length vs. Z-score. This plot reveals the overall shape of the background distribution, as well as the fact that about half of the 13 TMBs actually score moderately to very poorly.
We evaluated 4-state per-residue performance using the jack-knife (leave-one-out) procedure. To do this, a model is generated which is trained on 7 of the 8 training profiles, and tested on the 8th. This is repeated for each of the 8 profiles, and the results are compiled together. We use Q2, MCC (Matthew's Correlation Coefficient), and Sov (segment-overlap measure of prediction accuracy) to evaluate the accuracy of this compiled set of results. Such a set of jack-knifed predictions and the compiled results are shown here. Whole-protein discrimination was evaluated on a few different datasets as described in the accompanying paper. In each test, ROCn curves were calculated using bits scores as the criterion for cutoff, and the set of positives were well-annotated TMBs (none of which had significant homology to the 8 proteins in the training set), the negatives were well-annotated non-TMBs. Annotations were based on SWISS-PROT keywords or by manual inspection of the SUBCELLULAR LOCATION field. As can be seen in the figure, the estimate of coverage vs. accuracy widely varies depending on which evaluation set is used.
Download supporting material
Whole proteome predictions (z-value >= 4.0) README
GramNeg 2056 hits from 78 Gram Negative Genomes
TypGramPos 65 hits from 14 Typical Gram Positive genomes
AtypGramPos 31 hits from 5 Atypical Gram Positive genomes (mycolata)
GenomesGramClass Gram classification for list of 92 searched genomes
Complete set of input files for PROFtmb training
Complete SWISSPROT Localizations Assigned with Meta_Annotator SwissLocExp
SetROC: The sequence-unique, low-complexity filtered, length-filtered Discrimination test set SetROC