bottom - TOC - CUBIC-papers - CUBIC - Rost group

Title: Online tools for predicting integral membrane proteins
Author:Henry Bigelow and Burkhard Rost
Quote: In: M Peirce & R Wait (Eds.): Proteomic analysis of membrane proteins: methods and protocols Humana, 2007, in press

Online tools for predicting integral membrane proteins

Henry Bigelow 1,2,* & Burkhard Rost 1,2,3

1 Dept. of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA
2 Columbia University Center for Computational Biology and Bioinformatics (C2B2), 1130 St. Nicholas Avenue Rm 802, New York, NY 10032, USA
3 North East Structural Genomics Consortium (NESG), Department of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY 10032, USA
* Corresponding author: URL  Tel: +1-212-851-4669
Table of contents


We identify and describe a set of tools readily available for integral membrane protein prediction. These tools address two problems: finding potential transmembrane proteins in a pool of new sequences, and identifying their transmembrane regions. All methods involve comparing the query protein against one or more target models. In the simplest of these, the target "model" is another protein sequence, while the more elaborate methods group together the entire set of transmembrane helical or transmembrane beta barrel proteins. In general, prediction accuracy either in identifying new integral membrane proteins or transmembrane regions of known integral membrane proteins, depends strongly on how closely the query fits the model. Because of this, the best approach is an opportunistic one: submit the protein of interest to all methods and choose the results with the highest confidence scores.

Key words: membrane protein structure prediction; transmembrane helix, transmembrane beta barrel, hidden markov model, neural network, remote homolog detection, proteome searching


Basic concept of alignments. At a basic level, all methods work by the same paradigm. The simplest of these is BLAST. BLAST aligns the query sequence with each target sequence in a database. The alignment algorithm assigns a score to each alignment of query and target using a 20 x 20 matrix of scores called a "substitution matrix". The substitution matrix quantifies how often proteins whose sequences are aligned, based on known structure, have the same or different amino acids at each position. The alignment score involves summing substitution matrix values along with scores associated with gaps. Finally, taking all these alignments, a score threshold identifies a subset as target homologs.

Homology-transfer through alignments. Available experimental information for any of the targets can be transferred to the query (homology-transfer). For example, if one of the targets (database proteins) is experimentally known to be a transmembrane helical (TMH) protein, the homologous query is likely to also be a TMH protein. Moreover, if particular regions of a target protein are known to be TMHs, the regions in the query aligned to these regions are likely to also be TMHs. Of course, both inferences are subject to the accuracy of the alignment and the similarity between the two proteins.

As with all elements of living things, protein sequences originate from an evolutionary process of divergence and selection, creating a tree of proteins related in hierarchical fashion. Extending this idea to the homology search, a query protein can be compared to an entire family of related target proteins that are pre-aligned. Often, where a query might not have apparent similarity to any individual target protein in a family, it may have similarity to the target family taken as a whole. Essentially all advanced methods implement this idea.

Improved profile-based alignment methods A well-known example of this extension is PSI-BLAST [1] , which works as follows. First, the query is searched against a database of individual sequences using ordinary BLAST, resulting in a set of query-target alignments. Next, the query and set of target proteins are aligned to each other in a single multiple sequence alignment. The frequencies of each amino acid as occurring in the columns of the multiple sequence alignment are calculated, resulting in a set of 20-element vectors, one for each position in the original query. This statistical representation, called a position specific score matrix (PSSM) can be seen as a substitution matrix, custom-designed for each position in the query protein. In subsequent rounds, the PSSM, rather than the original query, is searched against the original database of individual sequences. For statistical reasons, conserved regions tend to be more influential in scoring subsequent alignments, allowing for improved detection of more diverged sequences.

Like PSI-BLAST, Pfam [2] uses multiple sequence alignments. There are two differences, however. First, while PSI-BLAST iteratively re-queries a database of individual sequences with a PSSM, Pfam is the inverse: it is a database of protein families, and the individual query protein is aligned against each family in the database. Second, while PSI-BLAST uses PSSMs to represent a protein family multiple sequence alignment, Pfam uses a hidden Markov model (HMM). An HMM extends the idea of position specific substitution scores to include gap insertion and deletion scores that are also position-specific. These are possible to derive from the original multiple sequence alignment by observing how many aligned proteins contain insertions or deletions relative to the query protein at each position in the query. As in PSI-BLAST or BLAST, the query protein is aligned against each HMM in the Pfam database and assigned an e-value comparable to BLAST e-value, representing the expected number of matches as good or better occurring by chance. Since HMM-based alignment methods are often more sensitive than BLAST or PSI-BLAST, they may succeed in finding a homologous family.

BLAST, PSI-BLAST and Pfam are very general methods capable of identifying sequence or family homologs of virtually any kind of protein, including specific kinds of membrane proteins. For integral membrane protein prediction however, another generalization yields further improvement.

Two major classes of transmembrane proteins: TMB and TMH. Integral membrane proteins come in two general structural classes. Transmembrane alpha helical (TMH) proteins span the plasma membrane in one or more alpha helices in alternating direction ( Fig. 2 >

Specific prediction methods. Methods designed to predict TMH or TMB proteins in general are built on each class taken as a group. Because of the diversity in specific structure (different numbers of transmembrane helices or strands), it is impossible to derive a single multiple sequence alignment for such a class. Instead, these methods extract features in common to all TMH or all TMB proteins without the need for explicit multiple sequence alignment. Technically, this is achieved by assigning one of a set of discrete labels to each position in each sequence, based on its structure. For example, the set of labels T, I, and O can be assigned one per residue to each TMH sequence, identifying the transmembrane helices, inner, and outer loops. From the resulting set of labeled protein sequences, a general model (also often a HMM) can be derived that recognizes features common to all labeled protein sequences. Such general models are potentially able to detect TMH or TMB proteins even further diverged from any sequence homolog, (perhaps an example of a previously undiscovered subfamily), than sequence-alignment based methods such as PSI-BLAST and Pfam.

A different homology approach is exemplified in the PROSITE [3] and PRINTS [4] databases. They contain a set of local sequence patterns defined by strong association with a specific protein function or structure. Because protein function and structure can be modular, some of these patterns may be found within a collection of proteins differing in overall structure. Others are very well correlated with overall structure despite their sequence-local nature. For identifying TMH or TMB proteins, several patterns prove useful (Methods). Potentially, such patterns may be conserved in a protein whose overall sequence is so diverged from any homologs as to be unidentifiable by alignment-based methods.

In general, all methods relying on alignment of proteins work optimally in aligning proteins in a specific similarity range corresponding to the range of sequences from which they are derived. In a degenerate sense, BLAST can be thought of as searching a database of "models" consisting of individual sequences. It is optimized to find close-range homologs. PSI-BLAST and Pfam build statistical models from multiple alignments of very similar sequences, and they work best to find medium-range homologs. TMH and TMB-specific methods are single statistical models built from a diverse set of TMH or TMB proteins only related by broad structural category. Thus, they are optimized to find long-range homologs.

Optimal results of each of these methods will be obtained fortuitously when the query happens to have a single sequence homolog, homology to a sequence family, or homology to a structure family. It is impossible to know in advance which if any of these will be the case. Because of this, we recommend an opportunistic approach: run all prediction methods and select those giving highest confidence scores. We provide a guide for obtaining as much relevant information about your protein as possible, and some general principles for interpreting the information.

This guide is in three parts. Firstly, we describe how to obtain a quick, comprehensive set of homology based information and possible experimental information about your protein, and how to use it to identify whether it is an integral membrane protein. Secondly, we describe those methods suitable for screening an entire set of proteins for potential TMH or TMB proteins. Thirdly, we present the methods for predicting which residues in a known or suspected transmembrane protein are in the membrane, and the overall orientation in the membrane. For quick reference, we provide a list of selected programs ( Table 1 ) and databases ( Table 2 ).

Table 1
Table 1: Selected programs *
Method Scope Service URL Ref.
BLAST and PSI-BLAST general WP [16, 1]
PiMohtm TMH PR3 [18]
Phobius TMH PR3, SP, S [19]
Split4 TMH PR2 [22]
PRED-TMBB TMB PR3 [23, 24]
PROFtmb TMB PR3, S [26]
TMB-HUNT TMB S [27, 28]
SignalP SP SP, S [30]
Pfam domain WP [2]
Superfamily domain WP [31]
Panther domain WP [32]
SMART domain WP [33, 34]
PROSITE motif WP [3]
PRINTS motif WP [4]

* Selected programs and databases for identification and per-residue prediction of integral membrane proteins. Scope. TMH, TMB: built on a representative collection of TMH or TMB proteins. motif: built on short sequence motifs associated with particular function or structure. domain: built on medium to long sequence regions of particular structure. Service. Per residue predictions PRn: all residues are assigned to one of a number of discrete structural states. PR2: (TM, non-TM). PR3: (TMB: TM-strand, extracellular loop, periplasmic loop; TMH: TM-helix, cytoplasmic loop, non-cytoplasmic loop. PR5: PR3, but distinguishing non-TM portions of helical overhang on both sides. SP: Signal peptide and cleavage site prediction. S: suitable for whole-proteome screening; these methods all allow multiple-sequence submission and have been evaluated for accuracy and coverage in whole protein discrimination. WP: whole protein prediction of individual proteins.

Table 2
Table 2: Selected databases *
Database Common Name/ Description URL Ref.
GO Gene Ontology [35]
PIR Protein Information Resource [6]
PDB Protein Data Bank [36]
InterPro Database of Protein Families, Domains and Functional Sites [37]
SCOP Structural Classification of Proteins [38, 39, 40]
InterProScan Scanning of InterPro Database [41, 42]
UniProt Universal Protein Resource [43]
OPM Orientations of Proteins in Membranes [44]
PDBTM Protein Data Bank of Transmembrane Proteins [45, 46]
MPtopo Membrane Protein Topology Database [47]


Determining if your protein is integral to the membrane

There are TMH- or TMB-specific and general methods available. The general methods are motif- and domain-based, and potentially identify the protein as one of a subtype of TMH or TMB proteins. TMH- or TMB-specific methods are designed to identify features common to all TMH (or all TMB) proteins, and do not identify subtypes. InterProScan is a portal that allows querying the general methods at once. UniProt provides a comprehensive view of previously analyzed results on many proteins and accompanying experimental information on structure or function.

TMB-specific methods. BOMP (β-barrel outer membrane protein predictor), TMB-HUNT and PROFtmb are specially designed to identify TMB proteins in a pool. They have all been evaluated for accuracy in discriminating TMBs from background. Unfortunately, a definitive comparison is complicated by the fact that the evaluations are all done on different data sets. It is recommended that you submit your query to all three and scrutinize the results. Taking a consensus of predictors has been found consistently to yield better accuracy than relying on one individual predictor.

TMH-specific methods. Of the six TMH-specific methods, only TMHMM has been rigorously evaluated for accuracy in discriminating TMH proteins from others. While all methods implicitly predict whether a protein is a TMH by the presence of one or more predicted TM-helices, since the others are not evaluated for accuracy, it is not recommended to use them to screen a pool for potential TMH proteins.

InterProScan. InterProScan submits your query to up to 13 individual predictors at once. Go to InterProScan, make sure all Applications to Run are selected, paste your sequence, and Submit. When results are returned, select Table View to see the individual scores associated with each hit. Individual scores are unfortunately not in any standard units. Though a thorough statistical comparison between different scoring systems has not been done, we will discuss this issue below.

UniProt. UniProt joins together all sequences from SWISS-PROT, TrEMBL [5] and PIR [6] (Protein Information Resource). Release 6.0 of September 2005 contains 2,299,834 sequences (see Each protein is linked with a set of pre-run predictions and annotations in databases in an advanced searchable framework. Results of searches contain links to the original sources of prediction or annotation.

  1. Go to UniProt. Select Searches/Tools -> Blast. Select UniProtKB in the pull-down menu of databases to search. Paste your query and Submit.
  2. You will see a table with several columns. Select checkboxes of all proteins with convincing alignments. The goal is to discover homologs with associated structural or functional information. Though your query may have several very close homologs, only those with associated experimental information are useful. Therefore, it is best to be liberal here, say to E-value ˛ 1.0. If you have a particular region of interest of your protein, you can base this choice on suitable position and length of alignment overlap, shown graphically in the Alignment column. Clicking on the colored bar in that column pops up the BLAST alignment.
  3. To view complete information, you must first re-query UniProt with the IDs of your selected homologs. To do this, select Save Options ->Table, and extract the list from the resulting tab-separated file.
  4. Select Searches/Tools -> Useful Tools/Links -> Batch Retrieval (PIR). Select database UniProtKB, and define Entry IDs as UniProt KB ID, paste your list of IDs and Search.
  5. Now you will have a table of all homologs with complete information. ClickDisplay Options and move all columns into Columns in Display, then select Apply.

Analyzing the results. Looking at your InterProScan and UniProt results, your protein has hopefully matched homologs with associated structural or functional annotation in databases PIR, PDB, InterPro, SCOP, and Gene Ontology (GO). Matches to alignment-based models of interest are Pfam, SMART, Superfamily and TigrFAM. Finally, functional motif databases are PROSITE and PRINTS.

The goal is to identify whether your query is indeed a TMH or TMB. But, it is not always easy to determine how closely associated these features are with integral membrane status. Because of the diversity of integral membrane proteins, there is no simple way to identify which annotations or features definitively identify with integral membrane status. Therefore, it will be necessary to do a careful reading of the descriptions, available through the links on the InterProScan and UniProt results tables. Below, we describe some of these in more detail. For PROSITE we present our own quick analysis of particular motifs closely associated with TMH or TMB proteins.

Pfam. As discussed above, Pfam is a database of HMMs, with each HMM built from an alignment of sequence-related proteins. As of December 2005, Pfam contains 8,183 families. 94% of all known protein sequences match at least one Pfam family. Its main purpose is to identify to what family or families a protein of unknown structure or function belongs. To search Pfam, one submits the protein of interest, and just as in BLAST, it is aligned to each HMM and significant alignments are reported with an accompanying log2-odds (or "bits") and e-value scores. The Pfam E-value is comparable to a BLAST e-value. Bits is the log2 (logarithm of base 2) of the odds of the sequence being a true match. For example, a bits score of 3 means the protein is 8 times more likely to be a true match than false match (8/9, or 89% chance of being a true match). 

Pfam families are categorized as "family", "domain", "motif" and "repeat" and have an accompanying average length of alignment. Family and domain type families are most closely associated with structure. Families are organized into clans, one of which is "Outer membrane beta-barrel". There is no corresponding clan covering TMH associated Pfam models, but a keyword search "membrane" reveals many relevant families.

Using Pfam to determine integral membrane status is complicated by the possibility that TMH or TMB proteins may contain N- or C-terminal domains or extra-membrane loops that form domains also found in soluble proteins.

SMART. Like Pfam, SMART is a collection of HMMs built from seed alignments. SMART focuses exclusively on protein domains, and describes them as "extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues". A large fraction of SMART domains will overlap with Pfam.

Superfamily. Similar to Pfam and SMART, Superfamily is a database of HMMs built on multiply aligned protein sequences. While Pfam uses similar sequences, Superfamily groups together sequences which have no detectable homology but are structurally similar at the SCOP superfamily level of structural classification. Because these HMMs are built from more diverse groups of sequences, they may have greater power to detect TMH or TMB proteins evolutionarily more distant from known homologs than Pfam. Depending on whether your sequence has a close or remote homolog, you may see reliable hits to any or all of these HMM-based databases.

Gene Ontology (GO). According to their homepage, "The Gene Ontology project provides a controlled vocabulary to describe gene and gene product attributes in any organism". It is useful in computer searches in which the existing words typically used to describe a given function are imprecise or ambiguous. Using the Gene Ontology, the European Bioinformatics Institute in a project called GOA [7, 8] (Gene Ontology Annotation) manually assigns GO terms to existing proteins in UniProt based on the literature, thus allowing comprehensive searches for proteins by GO terms.

PROSITE. PROSITE is a database of usually short sequence patterns tightly associated with function or overall protein structure. The patterns are all evaluated for their predictive power of protein class. For example, PROSITE pattern [LIVMFYC]-{A}-[HY]-x-D-[LIVMFY]-[RSTAC]-{D}-x-N-[LIVMFYC](3), called "Tyrosine protein kinases specific active-site signature" detects 97.9% of all known protein tyrosine kinases at 95.4% accuracy. Since each PROSITE pattern is designed to identify specific elements of function or structure, it would be useful to know if any of these happen to correlate well with TMH or TMB proteins ( Table 3 and 4 and below).

PROSITE motifs specific for integral membrane proteins. Since PROSITE motifs are defined by association to specific function or structure which is often local, some motifs are found in more than one overall protein structure. Many other motifs are well correlated with overall structure, and are potentially useful in identifying integral membrane proteins.

To estimate how well each motif correlates with integral membrane protein structure, we did the following. We prepared lists of TMH and TMB proteins by querying UniProt as follows. For TMBs, we queried with GO: integral to membrane AND GO: outer membrane, returning 3,464 proteins. For TMH the query was GO: integral to membrane NOT GO: outer membrane, returning 309,360 proteins. Technically, we carried out these queries by parsing a file called gene_association.goa_uniprot from the GOA project at since TMH proteins exceeded the download limit.

With the lists of known TMH and TMB proteins, we counted, for each PROSITE motif, the number of TMH (or TMB) proteins containing the motif, (true positives) and the number of non-TMH (or non-TMB) proteins containing the motif (false positives). We calculated the accuracy of a given motif in identifying TMH (TMB) proteins, selecting those patterns with a significant number of true positives at a given accuracy ( Table 3 and 4). This is the same procedure for PROSITE pattern accuracies but taken with respect to TMH or TMB proteins. Since our lists of TMH and TMB will be incomplete due to missing GO annotation, the accuracies are necessarily lower bound estimates.

Similar analysis can be done for PRINTS but we did not perform this. Since Pfam, SMART and Superfamily domains tend to match long (>100 residues) stretches of sequence, they tend to correlate well with overall protein structure. We advise simply to read the descriptions of each domain your protein may match.

Overall recommendation. Taking all information you can gather from both homology-based models and individual sequence homology to any proteins with useful annotation, decide which annotations you trust based on the respective E-values and other scores given by the search methods. Read the descriptions carefully, keeping in mind the possibility that integral membrane proteins contain motifs also found in soluble proteins. In general, matches to short regions of your protein are less reliable indicators of overall protein structure. This includes Pfam families of the "motif" and "repeat" types. Hopefully, this first step will give a very comprehensive view of what kind of domains or motifs your protein contains and what kind of protein it is likely to be.

Table 3
Table 3 : TMH-specific PROSITEmotifs *
TP FP Minimum Accuracy Accession PROSITE Motif Description
303 0 100% PS50928 ABC transporter integral membrane type-1 domain profile.
212 0 100% PS00236 Neurotransmitter-gated ion-channels signature.
209 0 100% PS50261 G-protein coupled receptors family 2 profile 2.
209 0 100% PS00077 Heme-copper oxidase catalytic subunit, copper B binding region signature.
1,273 5 100% PS51003 Cytochrome b/b6 C-terminal region profile.
1,326 9 99% PS51002 Cytochrome b/b6 N-terminal region profile.
265 2 99% PS50999 Cytochrome oxidase subunit II transmembrane region profile.
2,136 20 99% PS50262 G-protein coupled receptors family 1 profile.
209 2 99% PS50855 Cytochrome oxidase subunit I profile.
250 4 98% PS00232 Cadherin domain signature.
646 11 98% PS50850 Major facilitator superfamily (MFS) profile.
253 5 98% PS50268 Cadherins domain profile.
205 6 97% PS00238 Visual pigments (opsins) retinal binding site.
334 11 97% PS00154 E1-E2 ATPases phosphorylation site.
2,122 95 96% PS00237 G-protein coupled receptors family 1 signature.
274 14 95% PS50857 Cytochrome oxidase subunit II copper A binding domain profile.
271 14 95% PS00078 CO II and nitrous oxide reductase dinuclear copper centers signature.
263 46 85% PS00216 Sugar transport proteins signature 1.
256 45 85% PS50929 ABC transporter integral membrane type-1 fused domain profile.
223 97 70% PS50109 Histidine kinase domain profile.
245 113 68% PS00217 Sugar transport proteins signature 2.
305 192 61% PS50853 Fibronectin type-III domain profile.
993 754 57% PS50835 Ig-like domain profile.
461 417 53% PS01186 EGF-like domain signature 2.
209 204 51% PS00109 Tyrosine protein kinases specific active-site signature.
292 289 50% PS00290 Immunoglobulins and major histocompatibility complex proteins signature.
355 355 50% PS50026 EGF-like domain profile.
418 441 49% PS00022 EGF-like domain signature 1.
215 321 40% PS00152 ATP synthase alpha and beta subunits signature.
337 522 39% PS00142 Neutral zinc metallopeptidases, zinc-binding region signature.
317 1,301 20% PS50893 ATP-binding cassette, ABC transporter-type domain profile.
333 1,412 19% PS00211 ABC transporters family signature.
338 1,809 16% PS50011 Protein kinase domain profile.
326 1,800 15% PS00107 Protein kinases ATP-binding region signature.

* PROSITE motifs specific for TMH proteins. TP: True Positives; the number of TMH proteins containing the motif. FP: False Positives; the number of non-TMH proteins also containing the motif. Minimum Accuracy: TP / (TP + FP), a lower bound estimate of the probability that an unknown protein containing the motif in question is a TMH. It is a lower bound because some TMH proteins will be missing the appropriate GO terms, and will incorrectly be considered false positives in these lists (when in fact they should be true positives).

Table 4
Table 4 : TMB-specific PROSITEmotifs *
TP FP Minimum Accuracy Accession PROSITE Motif Description
39 0 100% PS00576 General diffusion Gram-negative porins signature.
36 0 100% PS00558 Eukaryotic mitochondrial porin signature.
34 0 100% PS01151 Fimbrial biogenesis outer membrane usher protein signature.
10 0 100% PS00695 Enterobacterial virulence outer membrane protein signature 2.
9 0 100% PS00694 Enterobacterial virulence outer membrane protein signature 1.
46 7 87% PS01068 OmpA-like domain.
5 1 83% PS00835 Aspartyl proteases, omptin family signature 2.
5 1 83% PS00834 Aspartyl proteases, omptin family signature 1.
46 18 72% PS51123 OmpA-like domain profile.
28 46 38% PS01156 TonB-dependent receptor proteins signature 2.
10 23 30% PS00439 Acyltransferases ChoActase / COT / CPT family signature 1.
10 23 30% PS00440 Acyltransferases ChoActase / COT / CPT family signature 2.
2 16 11% PS01098 Lipolytic enzymes "G-D-S-L" family, serine active site.
28 332 8% PS00430 TonB-dependent receptor proteins signature 1.
3 42 7% PS50304 Tudor domain profile.
4 83 5% PS50255 Cytochrome b5 family, heme-binding domain profile.
4 88 4% PS00191 Cytochrome b5 family, heme-binding domain signature.
3 76 4% PS50209 CARD caspase recruitment domain profile.
43 1,359 3% PS00013 Prokaryotic membrane lipoprotein lipid attachment site.
3 242 1% PS50084 Type-1 KH domain profile.
5 467 1% PS50005 TPR repeat profile.
5 496 1% PS50293 TPR repeat region circular profile.

* PROSITE motifs specific for TMB proteins. Abbreviations as in Table 3.

Screening a pool of sequences for integral membrane proteins

A few methods identify potential integral membrane proteins among a pool of sequences. They have been evaluated for accuracy and coverage and allow multiple sequence submission. BOMP and PROFtmb screen for Gram-negative bacterial TMB proteins, while TMB-HUNT additionally screens for eukaryotic TMB proteins and those of atypical Gram-positive bacteria. TMHMM and Phobius screen for TMH proteins.

TMB screening methods. BOMP scores sequences on a 0 to 5 scale. The special score 0 means that BOMP itself did not predict the query protein to be a TMB but it had one or more known TMB homologs. The scores 1 through 5 represent varying qualitative degrees of confidence in the prediction. BOMP is extremely fast, screening about one protein per second. It can be run with or without using BLAST for additional information (we recommend running with BLAST). Pre-computed predictions for entire genomes are available for download.

TMB-HUNT discriminates TMBs from non-TMBs based on whole-protein amino acid composition. It is also extremely fast and runs with or without the use of BLAST internally. Unfortunately multiple sequence submission is only available when it is run without BLAST, which is less accurate. TMB-HUNT reports for each protein an odds ratio and E-value calculated from the odds ratio. Although it is not the same formula as the BLAST E-value, the authors claim it can be interpreted the same way.

PROFtmb is much slower, requiring about four minutes per protein. It builds a PSI-BLAST profile for each query, allowing user-specified parameters. Internally, each protein is assigned a bits score in log odds units. For technical reasons, the traditional E-value measure cannot be calculated; instead a so-called Z-score is reported. The Z-score is the number of standard deviations the query bits score is separated from the average bits scores of background proteins of similar length in a test set. Also reported are the accuracy and coverage for the given Z-score. The accuracy can be interpreted as the probability that the given query protein is a TMB.

TMH screening programs. Of the six TMH prediction programs, only TMHMM is evaluated for accuracy in discriminating TMH from non-TMH proteins. It reports the position and orientation of each predicted helix and the expected number of residues in TM-helices. While there is no note on the website, in the original work they recommend using the expected number of residues, with a cutoff at 18, as the criterion for whole-proteome screening. For a test set of 160 TMH proteins and 645 soluble proteins, only 0.5 to 1% of the soluble proteins are predicted to have over 18 residues in TM-helices and only 1 or 2 TMH proteins (1%) were predicted to have 18 or fewer TM-helical residues. They caution that signal peptides are often mis-predicted as TM-helices.

Phobius combines TMHMM and SignalP to optimize discrimination between signal peptides and transmembrane helices, and prediction of protein orientation (N-terminal out or in) using the fact that presence of a cleavage site for a TMH protein is equivalent to N-terminal out orientation. While Phobius accuracy in whole-protein discrimination hasn't been formally measured, the results have been thoroughly compared with TMHMM. It is likely that Phobius is an improvement over TMHMM for the purpose of whole-protein screening. Choosing short output format, a single table is provided with columns TM (number of predicted TM helices), SP (Y=signal peptide present; N=absent), and Prediction (positions and orientations of each predicted TM-helix). The resulting list of proteins can easily be screened in the TM column. If more than one version of the sequence is available, the longer precursor sequence should be submitted. Polyphobius is also available on the website, although it isn't documented in the original paper. It first builds a multiple sequence alignment using BLAST before running the model.

Per-residue predictions

Per-residue prediction is the prediction of each residue to be in a structural state. For TMH proteins, it is usually one of three states TM-helix, cytoplasmic loop and non-cytoplasmic loop. For TMB proteins they are TM-strand, extracellular loop and periplasmic loop. Such prediction is not provided explicitly by Pfam, PROSITE or any of the family-, domain-, or motif-specific methods. The methods described below do provide such a prediction. In the rare case that your protein has close homology to a target protein of known structure, you will get a more accurate prediction simply by transferring structural annotation to your query through the alignment. In this case you may use 3D structure-based methods to estimate position and thickness of the membrane bilayer relative to the protein. These methods include MPTopo, PDBTM and OPM. They are also used as reference standards for evaluating accuracy of per-residue prediction methods that predict TM regions from sequence alone. Low-resolution experiments also reveal transmembrane regions, which are annotated in the FEATURES section of the UniProt record (click on the UniProt ID) or "FT" fields in the flat file (go to Save Options -> Flatfile).

Fig. 1


Fig. 1 : TMB per-residue prediction methods. Per-residue predictions from the three best-performing TMB methods are displayed for long-chain fatty acid transport protein from E. coli, both linearly and mapped onto the 3D structure (PDB: 1T16 [13] ). PDBTM and OPM, both 3D structure-based estimates of relative membrane orientation, are used for comparison against per-residue predictions. Note that the agreement between the methods for this example is not representative. A. 3D structure illustration. Inner loops, TM-β-strands, and outer loops are depicted respectively as colored C-α trace, colored ribbons, and grey ribbons. Note that none of the programs actually predicts such a 3D ribbon plot, instead the actual predictions are as shown in B. B. Linear display. Inner loop, TM-β-strands, and outer loop are depicted respectively by a thin line, thick line, and no lone. 3D structure images rendered by GRASP [14] .

TMB proteins. A recent evaluation by Bagos et al. [9] evaluate per-residue prediction accuracy of eleven TMB methods. They used 20 TMB PDB structures as a reference set, and defined just the bilayer-spanning region (as defined by PDBTM) of each TM β-strand. Three methods were found most accurate by far, especially in distinguishing between periplasmic and extracellular loops. They were PRED-TMBB, HMM-B2TMR, and PROFtmb. All three are HMMs which provide three-state (TM-strand, extracellular loop, periplasmic loop) per-residue prediction, probability plot roughly indicating the confidence in the prediction for the given residue, and an overall score indicating how likely the query is to be a TMB. Importantly, PROFtmb was designed to predict the entire span of each TM-β-strand including extracellular extensions. Thus, the choice of evaluation criteria did not match the intended use for this method. Also, it is apparent that the membrane thickness is estimated differently between OPM and PDBTM (Figures 1 and 2) leading to significant differences in the inferred position of protein TM regions along the sequence.

Specific recommendations. PRED-TMBB allows three decoding schemes. In the independent evaluation, posterior decoding gave the best performance, closely followed by n-best. The Viterbi decoding option was not listed among the top predictors at all, and we recommend against using it. Similarly, HMM-B2TMR and B2TMR (NN-based) are both available, but only HMM-B2TMR scored well in the evaluation. Use of NN-based B2TMR is therefore not recommended. Also note that PROFtmb is designed to predict the entire span of TM-β-strands, including any possible extra-membrane extensions.

Interpretation of results . For all three methods, results consist of a 3-state per-residue prediction, and a graphical output of state probabilities at each sequence position. Positions predicted with high probability according to the graph are more likely to be correct although the correlation hasn't been rigorously quantified in any of the models. PROFtmb is designed to predict the position of entire transmembrane strands including extracellular and periplasmic extensions, while the other two are designed to predict just the transmembrane portion of strands.

Overall confidence score . PRED-TMBB gives a whole protein score for which lower is better. The authors do not provide any estimate of probability of TMB according to the score, but simply state that a score below the threshold of 2.995 is probably a TMB. PROFtmb provides a Z-score, defined as the number of standard deviations from the average non-TMB protein of similar length (discussed above).

If you already know your protein is TMB and only need to know the positions of transmembrane strands, the overall score is still useful. A higher score indicates a better fit to the model, which always indicates more accurate prediction of positions of transmembrane strands. Also, a tell-tale sign of an unreliable prediction is a poorly shaped probability graph in which the predicted transmembrane strands do not have near 100% probability for a majority of their length.

TMH proteins. Two studies comparing accuracies of available methods reveal overlapping results. In 2002, Chen et al. [10] tested a set of 28 methods using the positions of TM-helices, estimated by MPtopo, from 36 PDB structures as a reference set. The three most accurate methods were PiMohtm (then known as PHDhtm), HMMTOP, and TMHMM. They all predicted 80% of residues correctly as being either TM-helix or non-TM-helix. The most recent study in June 2005 by Cuthbertson et al. [11] used 25 TMH PDB structures. They used the DSSP algorithm [12] to define alpha-helical secondary structure, manually joining two helices if they were separated by one or two kinks. Unlike the 2002 study, no attempt was made to define a transmembrane region. Instead, the entire length of the helix including regions likely outside of the bilayer, were considered "TM-helix" for the purpose of evaluation. They reported the most accurate methods (by 2-state accuracy) as Split4 (85.2%), TMHMM (83.3%) and MEMSAT (82.6%).

Fig. 2


Fig. 2 : TMH per-residue prediction methods. Per-residue predictions from the six best-performing TMH methods are displayed for cow rhodopsin linearly and mapped onto the 3D structure (PDB: 1GZM [15] ). PDBTM and OPM are used as in Fig. 1. Note that the agreement between the methods for this example is not representative. A. 3D structure illustration. Inner loops, TM-helices, and outer loops are depicted respectively as colored C-α trace, colored ribbons, and grey ribbons. Note that none of the programs actually predicts such a 3D ribbon plot, instead the actual predictions are as shown in B. B. Linear display. Inner loop, TM-helix, and outer loop are depicted respectively by a thin line, thick line, and no lone. For MEMSAT, predicted TM-helix caps are depicted as an intermediate thickness line and darker green ribbons on the structure. Note that all methods except Split4 distinguish inner and outer loops. 3D structure images rendered by GRASP [14] .

Comparison of 3D structure-based estimation of transmembrane regions by PDBTM and OPM reveals differences that have significant effect on the inferred position of TM-helices ( Fig. 2 ). The different per-residue prediction methods have a tendency to over- or under-predict helices on the inside or outside end. Unfortunately, a detailed analysis of particular strengths and weaknesses of individual predictors is not available. All predictors except Split4 distinguish inner from outer loops, while MEMSAT further distinguishes transmembrane portion of the helix from extra-membrane helical overhangs.

As with TMB predictors, PiMohtm, Split4 and TMHMM provide per-residue probability plots which aid in distinguishing the relative reliability in the prediction at different locations in the sequence (not shown).




The main strategy in homology search is to be opportunistic. One cannot know in advance whether the query will turn out to have a close, medium or remote homolog. Various methods are optimized to search for homologs in different ranges of similarity, from BLAST (closest) to Pfam (medium) to the set of TMB- or TMH-specific methods (remote). Because of this, all of these methods must be tried and the prediction with the most reliable scores chosen. In the case where more than one prediction is equally reliable, take a consensus method. Lastly, continue to check methods every few months as they are steadily improving, as is the underlying databases of structural and functional annotation.




We would like to thank Gunnar von Heijne, Henrik Nielsen, Jannick Bendtsen and Lukas KŠll for very informative and useful advice through email correspondence in clarifying issues related to integral membrane protein biology and specific prediction programs. We also gratefully acknowledge Erroll Rueckert for twice reading the manuscript; his many suggestions greatly improved organization and readability. This work was supported by the grant RO1-LM07329-01 from the National Library of Medicine (NLM). Last, not least, thanks to all those who deposit their experimental data in public databases, and to those who maintain these databases.





1.Altschul, S. F.,Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z. et al. (1997). GappedBLAST and PSI-BLAST: a new generation of protein database search programs. NucleicAcids Res, 25, 3389-402.
2.Bateman, A., Coin, L.,Durbin, R., Finn, R. D., Hollich, V. et al. (2004). The Pfam protein familiesdatabase. Nucleic Acids Res,32, D138-41.
3.Hulo, N., Bairoch, A.,Bulliard, V., Cerutti, L., De Castro, E. et al. (2006). The PROSITE database. NucleicAcids Res, 34, D227-30.
4.Attwood, T. K.,Bradley, P., Flower, D. R., Gaulton, A., Maudling, N. et al. (2003). PRINTS andits automatic supplement, prePRINTS. Nucleic Acids Res, 31,400-2.
5.Boeckmann, B.,Bairoch, A., Apweiler, R., Blatter, M. C., Estreicher, A. et al. (2003). TheSWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. NucleicAcids Res, 31, 365-70.
6.Wu, C. H., Yeh, L. S.,Huang, H., Arminski, L., Castro-Alvear, J. et al. (2003). The ProteinInformation Resource. Nucleic Acids Res, 31, 345-7.
7.Camon, E., Barrell,D., Lee, V., Dimmer, E. & Apweiler, R. (2004). The Gene Ontology Annotation(GOA) Database--an integrated resource of GO annotations to the UniProtKnowledgebase. In Silico Biol,4, 5-6.
8.Camon, E., Magrane,M., Barrell, D., Lee, V., Dimmer, E. et al. (2004). The Gene Ontology Annotation(GOA) Database: sharing knowledge in Uniprot with Gene Ontology. NucleicAcids Res, 32, D262-6.
9.Bagos, P. G.,Liakopoulos, T. D. & Hamodrakas, S. J. (2005). Evaluation of methods forpredicting the topology of beta-barrel outer membrane proteins and a consensusprediction method. BMC Bioinformatics, 6, 7.
10.Chen, C. P.,Kernytsky, A. & Rost, B. (2002). Transmembrane helix predictions revisited.Protein Sci, 11, 2774-91.
11.Cuthbertson, J. M.,Doyle, D. A. & Sansom, M. S. (2005). Transmembrane helix prediction: acomparative evaluation and analysis. Protein Eng Des Sel, 18,295-308.
12.Kabsch, W. &Sander, C. (1983). Dictionary of protein secondary structure: patternrecognition of hydrogen-bonded and geometrical features. Biopolymers, 22,2577-637.
13.van den Berg, B.,Black, P. N., Clemons, W. M., Jr. & Rapoport, T. A. (2004). Crystalstructure of the long-chain fatty acid transporter FadL. Science, 304,1506-9.
14.Petrey, D. &Honig, B. (2003). GRASP2: visualization, surface properties, and electrostaticsof macromolecular structures and sequences. Methods Enzymol, 374,492-509.
15.Li, J., Edwards, P.C., Burghammer, M., Villa, C. & Schertler, G. F. (2004). Structure ofbovine rhodopsin in a trigonal crystal form. J Mol Biol, 343,1409-38.
16.Altschul, S. F.,Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). Basic localalignment search tool. J Mol Biol,215, 403-10.
17.Sonnhammer, E. L.,von Heijne, G. & Krogh, A. (1998). A hidden Markov model for predictingtransmembrane helices in protein sequences. Proc Int Conf Intell Syst MolBiol, 6, 175-82.
18.Rost, B., Fariselli,P. & Casadio, R. (1996). Topology prediction for helical transmembraneproteins at 86% accuracy. Protein Sci, 5, 1704-18.
19.Kall, L., Krogh, A.& Sonnhammer, E. L. (2004). A combined transmembrane topology and signalpeptide prediction method. J Mol Biol, 338, 1027-36.
20.Tusnady, G. E. &Simon, I. (2001). The HMMTOP transmembrane topology prediction server. Bioinformatics, 17,849-50.
21.McGuffin, L. J.,Bryson, K. & Jones, D. T. (2000). The PSIPRED protein structure predictionserver. Bioinformatics, 16, 404-5.
22.Juretic, D., Zoranic,L. & Zucic, D. (2002). Basic charge clusters and predictions of membraneprotein topology. J Chem Inf Comput Sci, 42, 620-32.
23.Bagos, P. G.,Liakopoulos, T. D., Spyropoulos, I. C. & Hamodrakas, S. J. (2004).PRED-TMBB: a web server for predicting the topology of beta-barrel outermembrane proteins. Nucleic Acids Res, 32, W400-4.
24.Bagos, P. G.,Liakopoulos, T. D., Spyropoulos, I. C. & Hamodrakas, S. J. (2004). A HiddenMarkov Model method, capable of predicting and discriminating beta-barrel outermembrane proteins. BMC Bioinformatics, 5, 29.
25.Martelli, P. L.,Fariselli, P., Krogh, A. & Casadio, R. (2002). A sequence-profile-based HMMfor predicting and discriminating beta barrel membrane proteins. Bioinformatics, 18 Suppl 1, S46-53.
26.Bigelow, H. R.,Petrey, D. S., Liu, J., Przybylski, D. & Rost, B. (2004). Predictingtransmembrane beta-barrels in proteomes. Nucleic Acids Res, 32,2566-77.
27.Garrow, A. G., Agnew,A. & Westhead, D. R. (2005). TMB-Hunt: a web server to screen sequence setsfor transmembrane beta-barrel proteins. Nucleic Acids Res, 33,W188-92.
28.Garrow, A. G., Agnew,A. & Westhead, D. R. (2005). TMB-Hunt: an amino acid composition basedmethod to screen proteomes for beta-barrel transmembrane proteins. BMCBioinformatics, 6, 56.
29.Berven, F. S.,Flikka, K., Jensen, H. B. & Eidhammer, I. (2004). BOMP: a program topredict integral beta-barrel outer membrane proteins encoded within genomes ofGram-negative bacteria. Nucleic Acids Res, 32, W394-9.
30.Bendtsen, J. D.,Nielsen, H., von Heijne, G. & Brunak, S. (2004). Improved prediction ofsignal peptides: SignalP 3.0. J Mol Biol, 340, 783-95.
31.Gough, J., Karplus,K., Hughey, R. & Chothia, C. (2001). Assignment of homology to genomesequences using a library of hidden Markov models that represent all proteinsof known structure. J Mol Biol,313, 903-19.
32.Mi, H.,Lazareva-Ulitsky, B., Loo, R., Kejariwal, A., Vandergriff, J. et al. (2005).The PANTHER database of protein families, subfamilies, functions and pathways. NucleicAcids Res, 33, D284-8.
33.Schultz, J., Milpetz,F., Bork, P. & Ponting, C. P. (1998). SMART, a simple modular architectureresearch tool: identification of signaling domains. Proc Natl Acad Sci U S A, 95,5857-64.
34.Letunic, I., Copley,R. R., Pils, B., Pinkert, S., Schultz, J. et al. (2006). SMART 5: domains inthe context of genomes and networks. Nucleic Acids Res, 34,D257-60.
35.Ashburner, M., Ball,C. A., Blake, J. A., Botstein, D., Butler, H. et al. (2000). Gene ontology:tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 25,25-9.
36.Berman, H. M.,Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N. et al. (2000). The ProteinData Bank. Nucleic Acids Res,28, 235-42.
37.Apweiler, R.,Attwood, T. K., Bairoch, A., Bateman, A., Birney, E. et al. (2000). InterPro--anintegrated documentation resource for protein families, domains and functionalsites. Bioinformatics, 16, 1145-50.
38.Murzin, A. G.,Brenner, S. E., Hubbard, T. & Chothia, C. (1995). SCOP: a structuralclassification of proteins database for the investigation of sequences andstructures. J Mol Biol, 247, 536-40.
39.Lo Conte, L.,Brenner, S. E., Hubbard, T. J., Chothia, C. & Murzin, A. G. (2002). SCOPdatabase in 2002: refinements accommodate structural genomics. Nucleic AcidsRes, 30, 264-7.
40.Andreeva, A.,Howorth, D., Brenner, S. E., Hubbard, T. J., Chothia, C. et al. (2004). SCOPdatabase in 2004: refinements integrate structure and sequence family data. NucleicAcids Res, 32, D226-9.
41.Zdobnov, E. M. &Apweiler, R. (2001). InterProScan--an integration platform for thesignature-recognition methods in InterPro. Bioinformatics, 17,847-8.
42.Quevillon, E.,Silventoinen, V., Pillai, S., Harte, N., Mulder, N. et al. (2005).InterProScan: protein domains identifier. Nucleic Acids Res, 33,W116-20.
43.Wu, C. H., Apweiler,R., Bairoch, A., Natale, D. A., Barker, W. C. et al. (2006). The UniversalProtein Resource (UniProt): an expanding universe of protein information. NucleicAcids Res, 34, D187-91.
44.Lomize, M. A.,Lomize, A. L., Pogozheva, I. D. & Mosberg, H. I. (2006). OPM: orientationsof proteins in membranes database. Bioinformatics, 22,623-5.
45.Tusnady, G. E.,Dosztanyi, Z. & Simon, I. (2004). Transmembrane proteins in the ProteinData Bank: identification and classification. Bioinformatics, 20,2964-72.
46.Tusnady, G. E.,Dosztanyi, Z. & Simon, I. (2005). PDB_TM: selection and membranelocalization of transmembrane proteins in the protein data bank. NucleicAcids Res, 33, D275-8.
47.Jayasinghe, S.,Hristova, K. & White, S. H. (2001). MPtopo: A database of membrane proteintopology. Protein Sci, 10, 455-8.  

Contact: Version:    Apr 27, 2007
 top - TOC - CUBIC-papers - CUBIC - Rost group