PredictProtein - Documentation
- 1 Introduction
- 2 Input
- 3 PredictProtein Output
- 3.1 Visual PredictProtein Output
- 3.2 Detailed Results (HTML Format)
- 3.2.1 Secondary Structure, Solvent Accessibility and Transmembrane Helices Prediction
- 3.2.2 Multiple Sequence Alignment
- 3.2.3 Iterated profile-based search (PSI-BLAST)
- 3.2.4 Filtering Input Alignment
- 3.2.5 Definition of Family
- 3.2.6 Functional sequence motifs (ProSite)
- 3.2.7 Low-complexity regions (SEG)
- 3.2.8 Prediction of nuclear localisation signal (PredictNLS)
- 3.2.9 Secondary structure (PROF)
- 3.2.10 Solvent accessibility (PHDacc)
- 3.2.11 Globularity of proteins (GLOBE)
- 3.2.12 Transmembrane helices (PHDhtm)
- 3.2.13 Coiled-coil regions (COILS)
- 3.2.14 Structural switches (ASP)
- 3.2.15 Contact Prediction (PROFcon)
- 3.2.16 CHOP
- 3.2.17 PROFTMB
- 3.2.18 InteractionSites
- 3.2.19 METADISORDER
- 3.2.20 PROFBVAL
- 3.2.21 SNAP
- 3.2.22 LOCTREE
- 3.2.23 NORSp
- 4 Hints and Tips
- 4.1 What can you expect from secondary structure prediction?
- 4.2 In a Nutshell: how to avoid pitfalls?
- 4.3 Nuts and Bolts: What to Keep in Mind?
- 4.4 Cut-off For including Homologues in Alignment
- 4.5 Quality of Multiple Sequence Alignment
- 4.6 PROFphd Minimal Length of Sequences Limitation
- 4.7 Insertions in Multiple Sequence Alignment
- 4.8 Untypical proteins
- 4.9 Prediction of transmembrane helices (HTM's) and topology
- 4.10 Reliability indices for PROFPHD predictions
- 4.11 Combination of results with that of other methods
- 5 Credits
- 6 Copyright
- 7 Contact
- 8 Information Regarding Subscription
What Is PredictProtein
What is PredictProtein (PP)?
PP is an automatic service for protein database searches and the prediction of aspects of protein structure and function. Given an amino acid sequence or an alignment input PP returns:
- Multiple sequence alignment
- ProSite sequence motifs
- low-complexity retions (SEG)
- Nuclear localisation signals
- and predictions of
- secondary structure
- solvent accessibility
- globular regions
- transmembrane helices
- coiled-coil regions
- structural switch regions
- disordered regions
- intra-residue contacts
- protein protein and protein/DNA binding sites
- sub-cellular localization
- domain assignment
- beta barrels
- cysteine predictions and disulphide bridges
How does PredictProtein work?
Generating an alignment. The following steps are performed.
- The sequence database (compiled of SWISSPROT+TrEMBL+PDB) is scanned for similar sequences (by BLASTP).
- a multiple sequence alignment is generated by iterative blast searches PSI-BLAST.
- ProSite motifs are retrieved from the ProSite database,
- low-complexity regions (e.g. composition bias) are marked by the program SEG,
Prediction of protein structure in 1D. The multiple alignment is used as input for profile-based neural network predictions (PROF methods). The following levels of prediction accuracy have been evaluated in cross-validation experiments:
- Secondary structure prediction (PHDsec or PROFsec):
expected three-state (helix, strand, rest) overall accuracy >72% (PHD) >76% (PROF) for water-soluble globular proteins. You may find details about accuracy in graphs, on tables, and in the literature: Rost 1997 (paper) and 1996 (paper); Rost & Sander 1993  and 1994 .
- Solvent accessibility prediction (PHDacc or PROFacc):
Expected correlation between observed and predicted relative accessibility > 0.5. You may find details about accuracy in graphs, on tables, and in the literature: Rost 1997 (paper) and 1996 (paper), Rost & Sander 1994 (abstract).
- Transmembrane helix prediction (PHDhtm): Expected overall two-state accuracy (transmembrane, non-transmembrane) > 95%; refined prediction of transmembrane helices and topology & expected likelihood of predicting all helices correctly about 89%, expected accuracy of topology prediction > 86%. You may find details about accuracy on tables, and in the literature: Rost, Casadio & Fariselli 1996 , and Rost, Casadio, Fariselli & Sander 1995 .
- Other predictions
Predictions of secondary structure and accessibility are aligned against PDB to detect remote homologues (prediction-based threading).You may find details about accuracy in the literature: Rost, Schneider & Sander, 1996 (paper), Rost 1995 (abstract) and 1994 (abstract).
What can we do for you?
- You have a protein sequence and want to find out anything we can say about structure and function?
In general, we can provide multiple sequence alignments and predictions of secondary structure, residue solvent accessibility and the location of transmembrane helices (examples for: request; and output).
- You have a helical transmembrane protein sequence and want a refined prediction of the helix locations and topology?
We provide multiple sequence alignments and refined predictions for the location of transmembrane helices and for the topology, i.e. the orientation of the N-term with respect to the membrane (examples for: request; and output).
- You have a protein sequence and search for remote homologues (i.e., homologues with <25% sequence identity)?
We find secondary structure and accessibility motifs similar between a known structure and your protein by prediction-based threading (examples for: request; and output).
- You have a multiple sequence alignment and want to obtain a prediction of 1D structure based on that alignment?
We use your alignment as input to the methods predicting secondary structure, solvent accessibility and transmembrane helices (examples for: request; and output).
- You have a list of sequences not in current databases and want it to be used for 1D predictions?
We align your sequences and use the resulting alignment as input to the structure and function (examples for: request; and output).
- You have a prediction of secondary structure and accessibility and search similar motifs in known structures?
We base the threading procedure on your prediction (examples for: request; and output).
- You have a prediction and an observation of secondary structure and you want to compile the prediction accuracy?
We compile per-residue and per-segment based score for the evaluation of prediction accuracy (examples for: request; and output).
Currently the server accepts only protein sequences in one letter code.
Visual PredictProtein Output
PredictProtein presents a single image in which predictions and database search results are shown schematically in a single figure. Following is a brief explanation of the colors and shapes presented on the VisualPP page. NOTE: we are working on a more interactive legend for this image.
- PROFacc: blue: exposed, yellow: buried, green: between exposed and buried; intensity: confidence of prediction
- PROFsec: yellow: strand, red: helix
- MD: red: disorder prediction, green: order prediction; intensity: confidence
- Other predictions: the colored areas represent predictions in the indicated location
- PSI-BLAST: lines represent subjects of HSPs (high-scoring segment pairs)
Detailed Results (HTML Format)
The output format is self-documenting. The output contains:
- A list of likely homologues found in the protein database (UniProt+PDB), and the multiple sequence alignment of these sequence (by default in 'MSF' format)
- If found: a list of the putative ProSite motifs.
- If found: a prediction of coiled-coil regions.
Secondary Structure, Solvent Accessibility and Transmembrane Helices Prediction
The HTML and TEXT versions of the predictions provide the following information:
- Expected levels of accuracy of structure predictions. (We suggest that newcomers read this carefully.)
- Prediction of aspects of protein structure. These are grouped in the following way:
- Prediction of secondary structure for all residues, with an expected average three-state accuracy of > 72%;
- Prediction of secondary structure for reliably scored residues only, with an expected three-state accuracy for these residues of > 82%;
- Prediction of solvent accessibility for all residues, with an expected average correlation to the experimentally observed values of 0.54;
- Prediction of solvent accessibility for reliably scored residues only, with an expected correlation between experimental observation and prediction of 0.69;
- Prediction of transmembrane helices and their topology (if any detected), with an expected prediction accuracy of about 95% in two states.
Note: for the prediction of transmembrane helices a conservative threshold is chosen. Thus, your protein may not be reported to contain a HTM although it may have one. If you opt explicitly for the refined prediction of transmembrane helices and topology ("predict htm"), four predictions are given (example for output)):
- neural network prediction (expected accuracy for HTM's about 78%);
- result of empirical filter (expected accuracy for HTM's about 97%);
- refined prediction (expected accuracy for HTM's about 99%);
- prediction of topology (expected accuracy about 86%).
If your sequence has at least one non-trivial homologue in the database of protein sequences, you receive a multiple sequence alignment and the annotated prediction in the following form: Block with multiple sequence alignment in the Alignment section. Block with explanations about the prediction method in the Secondary Structure prediction section. Block with prediction (example for secondary structure prediction follows).
.........1.........2.........3.........4.........5.........6 AA KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLD PHD EEEEEE EEEEEE EEEEEE EEEE EEE Rel 854777641334566643102441577762566642443213663122112234155
Multiple Sequence Alignment
The multiple sequence alignments is built up in essentially three steps (MaxHom, Sander & Schneider, Proteins, 1991, 9, 56-68).
- The protein database (currently SWISS-PROT) is searched by a fast alignment program (currently BLASTP).
- In sweep 1, sequences are aligned consecutively to the search sequence by a standard dynamic programming method. After each sequence has been added a profile is compiled, and used to align the next sequence.
- In sweep 2, after all sequences with significant homology have been picked from the BLASTP output, the profile is recompiled, and the dynamic programming algorithm starts once again to align consecutively the sequences, this time using the conservation profile as derived after completion of sweep 1.
Iterated profile-based search (PSI-BLAST)
PSIblast is a fast, yet sensitive database search program.We are running the iterated PSI-BLAST on a subset of the BIG database with UniProt + PDB sequences. The number of iteration, the cut-off thresholds and the particular details of which sequences are used from BIG has been optimized in our group.
Filtering Input Alignment
If the divergence found in your family is not 'well' spread, prediction accuracy may drop. In particular, too many highly similar sequences may be problematic in absence of further diverged family members. This problem came up only in the post-genome era, i.e. since the number of sequences is exploding. To correct for this problem we run a crude filter on the alignment, by default. To tick the 'no filter' in the submission form.
Definition of Family
Family = structural family, i.e. all proteins in family have similar structures. We search with your input sequence against the specified sequence database (SWISS-PROT by default). All proteins P which have a level of sequence similarity to your query protein Q that allows to predict that P and Q have a similar three-dimensional structure are returned in the alignment. The iterated PSI-BLAST frequently finds more diverged proteins P2 that also have a similar structure as Q.
Functional homology. In general, much higher levels of sequence similarity are required to infer particular aspects of function. Thus, the members of a protein family may or may not share particular functional motifs.
Functional sequence motifs (ProSite)
ProSite is a method of determining what is the function of uncharacterized proteins translated from genomic or cDNA sequences. It consists of a database of biologically significant sites, patterns and profiles that help to reliably identify to which known family of protein (if any) a new sequence belongs.
Low-complexity regions (SEG)
The following description is from the original SEG documentation (JC Wootton & S Federhen, 1996, Meth Enzymology, 266, 554-571): SEG divides sequences into contrasting segments of low-complexity and high-complexity. Low-complexity segments defined by the algorithm represent "simple sequences" or "compositionally-biased regions". Locally-optimized low-complexity segments are produced at defined levels of stringency, based on formal definitions of local compositional complexity. The segment lengths and the number of segments per sequence are determined automatically by the algorithm.
Prediction of nuclear localisation signal (PredictNLS)
PredictNLS finds experimentally known nuclear localisation signals present in your protein. The program produces an output if and only if a known NLS was found. Note that the original version of the program at http://cubic.bioc.columbia.edu/predictNLS also allows you to obtain statistics for putative NLS motifs.
Secondary structure (PROF)
Secondary structure is predicted by a system of neural networks rating at an expected average accuracy > 72% for the three states helix, strand and loop (Rost & Sander, PNAS, 1993 , 90, 7558-7562; Rost & Sander, JMB, 1993 , 232, 584-599; and Rost & Sander, Proteins, 1994 , 19, 55-72; evaluation of accuracy). Evaluated on the same data set, PHDsec is rated at ten percentage points higher three-state accuracy than methods using only single sequence information, and at more than six percentage points higher than, e.g., a method using alignment information based on statistics (Levin, Pascarella, Argos & Garnier, Prot. Engng., 6, 849-54, 1993). PHDsec predictions have three main features:
- improved accuracy through evolutionary information from multiple sequence alignments
- improved beta-strand prediction through a balanced training procedure
- more accurate prediction of secondary structure segments by using a multi-level system
Solvent accessibility (PHDacc)
Solvent accessibility is predicted by a neural network method rating at a correlation coefficient (correlation between experimentally observed and predicted relative solvent accessibility) of 0.54 cross-validated on a set of 238 globular proteins (Rost & Sander, Proteins, 1994, 20, 216-226; evaluation of accuracy). The output of the neural network codes for 10 states of relative accessibility. Expressed in units of the difference between prediction by homology modelling (best method) and prediction at random (worst method), PHDacc is some 26 percentage points superior to a comparable neural network using three output states (buried, intermediate, exposed) and using no information from multiple alignments.
Globularity of proteins (GLOBE)
An additional result from the prediction of solvent accessibility is that of protein globularity.
Transmembrane helices (PHDhtm)
Transmembrane helices in integral membrane proteins are predicted by a system of neural networks. The shortcoming of the network system is that often too long helices are predicted. These are cut by an empirical filter. The final prediction (Rost et al., Protein Science, 1995, 4, 521-533; evaluation of accuracy) has an expected per-residue accuracy of about 95%. The number of false positives, i.e., transmembrane helices predicted in globular proteins, is about 2% (Rost et al. 1996).
The neural network prediction of transmembrane helices (PHDhtm) is refined by a dynamic programming-like algorithm. This method resulted in correct predictions of all transmembrane helices for 89% of the 131 proteins used in a cross-validation test; more than 98% of the transmembrane helices were correctly predicted. The output of this method is used to predict topology, i.e., the orientation of the N-term with respect to the membrane. The expected accuracy of the topology prediction is > 86%. Prediction accuracy is higher than average for eukaryotic proteins and lower than average for prokaryotes. PHDtopology is more accurate than all other methods tested on identical data sets (Rost, Casadio & Fariselli, 1996a and 1996b; evaluation of accuracy).
Coiled-coil regions (COILS)
COILS is a program that compares a sequence to a database of known parallel two-stranded coiled-coils and derives a similarity score. By comparing this score to the distribution of scores in globular and coiled-coil proteins, the program then calculates the probability that the sequence will adopt a coiled-coil conformation.
Structural switches (ASP)
ASP identifies amino acid subsequences that are the most likely to switch between different types of secondary structure. The program was developed by MM Young, K Kirshenbaum, KA Dill and S Highsmith. ASP was designed to identify the location of conformational switches in proteins with known switches. It is NOT designed to predict whether a given sequence does or does not contain a switch. For best results, ASP should be used on sequences of length >150 amino acids with >10 sequence homologues in the SWISS-PROT data bank. ASP has been validated against a set of globular proteins and may not be generally applicable. Please see Young et al., Protein Science 8(9):1752-64. 1999. and Kirshenbaum et al., Protein Science 8(9):1806-1815. 1999. for details and for how best to interpret this output. We consider ASP to be experimental at this time, and would appreciate any feedback from our users.
Contact Prediction (PROFcon)
PROFcon predicts contacts between residue pairs in single chains. Our definition of contact is based on Cbeta atoms distances (Calpha for glycines). Two residues whose Cbeta's are closer than 8 Ang are considered to be in contact, not in contact otherwise. The last column of the output is the predicted contact score, (contact probability is high if score is close to 1).
A method of dissecting proteins into domain-like fragments based on sequence homology.
Per-residue and whole-proteome prediction of bacterial transmembrane beta barrels
protein-protein interaction sites identified from sequence
a method for predicting residue mobility based on amino-acid sequence.
a method for evaluating effects of single amino acid substitutions on protein function. Server could be find at http://rostlab.org/services/snap/
a prediction method for sub-cellular localization of proteins
predictor of NOn-Regular Secondary Structure
Hints and Tips
Note. The following notes result from the experiences I have gathered by offering, and running the PredictProtein service and during various structure prediction workshops. The comments are tailored in particular to the PROF methods; however, most comments hold also for using other secondary structure prediction methods.
What can you expect from secondary structure prediction?
How accurate are the predictions The expected levels of accuracy (PROFsec = 72¬±11% (three state per-residue accuracy); PROFacc = 75¬±7% (two-state per-residue accuracy); PHDhtm = 94¬±6% (two-state per-residue accuracy)) are valid for typical globular, water-soluble (PROFsec, PROFacc), or helical transmembrane proteins (PROFhtm) when the multiple alignment contains many and diverse sequences. High values for the reliability indices indicate more accurate predictions. (Note: for alignments with little variation in the sequences, the reliability indices adopt misleadingly high values.) PROFsec predictions tend to be relatively accurate for porins; however, for helical membrane proteins other programs ought to be used.
Confusion between strand and helix? PROFphd (as well as other methods) focuses on predicting hydrogen bonds. Consequently, occasionally strongly predicted (high reliability index) helices are observed as strands and vice versa (expected accuracy of PROFsec).
Strong signal from secondary structure caps? The ends of helices and strands contain a strong signal. However, on average PROFPHD predicts the core of helices and strands more accurately than the caps (B. Rost and C. Sander, 1D secondary structure prediction through evolutionary profiles, in: H. Bohr and S. Brunak (eds.), Protein Structure by Distance Analysis, Amsterdam: IOS Press, 257-276 (1994)). This seems to also hold for other methods (Garnier, priv. comm.).
Are internal helices predicted poorly? Steven Benner has indicated that internal buried helices are particularly difficult to predict. On average, this is not the case for PROFPHD predictions (expected accuracy of PROFsec for buried helices).
Accessibility useful to provide upper limits for contacts? The predicted solvent accessibility (PROFacc) can be translated into a prediction of the number of water atoms around a given residue. Consequently, PROFacc can be used to derive upper and lower limits for the number of inter-residue contacts of a certain residue (such an estimate could improve predictions of inter-residue contacts).
How to predict porins? PHDhtm predicts only transmembrane helices, and PROFsec has been trained on globular, water-soluble proteins. How to predict 1D structure for porins then? As porins are partly accessible to solvent, prediction accuracy of PROFsec was relatively high (70%) for the known structures. Thus, PROFsec appears to be applicable.
How to use the prediction of transmembrane helices? One possible application of PHDhtm is to scan, e.g., entire chromosomes for possible transmembrane proteins. The classification as transmembrane protein is not sufficient to have knowledge about function, but may shed some light into the puzzle of genome analyses. When using PHDhtm for this purpose, the user should keep in mind that on average about 5% of the globular proteins are falsely predicted to have transmembrane helices.
What about protein design and synthesised peptides? The PROFPHD networks are trained on naturally evolved proteins. However, the predictions have proven to be useful in some cases to investigate the influence of single mutations (e.g. for Chameleon ), or for Janus, Rost, unpublished). For short poly-peptides, the following should be taken into account: the network input consists of 17 adjacent residues, thus, shorter sequences may be dominated by the ends (which are treated as solvent).
In a Nutshell: how to avoid pitfalls?
70% correct implies 30% incorrect. The most accurate methods for predicting secondary structure reach sustained levels of about 70% accuracy. When interpreting predictions for a particular protein it is often instructive to mark the 30% of the residues you suspect to be falsely predicted.
Spread of prediction accuracy. An expected accuracy of 70% does NOT imply that for your protein U 70% of all residues are correctly predicted. Instead, values published for prediction accuracy are averaged over hundreds of unique proteins. An expected accuracy of 70¬±10% (one standard deviation) implies that, on average, for two thirds of all proteins between 60 and 80% of the residues will be predicted correctly (expected accuracy of PHDsec). Thus, prediction accuracy can be higher than 80% or lower than 60% for your protein. Few methods supply well tested indices for the reliability of predictions. Such indices can help to reduce or increase your trust in a particular prediction.
Special classes of proteins. Prediction methods are usually derived from knowledge contained in subsets of proteins from databases. Consequently, they should not be applied to classes of proteins which have not been included in the subsets. For example, methods for predicting helices in globular proteins are likely to fail when applied to predict transmembrane helices. In general, results should be taken with caution for proteins with unusual features, such as proline-rich regions, unusually many cysteine bonds, or for domain interfaces.
Better alignments yield better predictions. Multiple alignment-based predictions are substantially more accurate than single sequence-based predictions. How many sequences do you need in your alignment to expect an improvement; and how sensitive are prediction methods with respect to errors in the alignment? The more divergent sequences contained in the alignment, the better (two distantly related sequences often improve secondary structure predictions by several percentage points). Regions with few aligned sequences yield less reliable predictions. The sensitivity to alignment errors depends on the methods, e.g., secondary structure prediction is less sensitive to alignment errors than accessibility prediction.
Better + worse = even better? Today, several automatic services accomplish secondary structure predictions. Some users fall into the what-is-common-is-correct trap, i.e., they average over all prediction methods and consider identical regions as more reliable. Exceptionally, such a majority vote may be beneficial. However frequently, the result will be the worst-of-all prediction. Often, it is preferable to use reliability indices provided by some methods. Such indices answer the question: how reliably is the tryptophan at position 307 predicted in a surface loop? (Note: the correlation between such indices and prediction accuracy is sufficiently tested for a few methods, only.)
1D structure may or may not be sufficient to infer 3D structure. Say you obtain as prediction for regular secondary structure: helix-strand-strand-helix-strand-strand (H-E-E-H-E-E). Assume, you find a protein of known structure with the same motif (H-E-E-H-E-E). Can you conclude that the two proteins have the same fold? Yes and no, your guess may be correct, but there are various ways to realize the given motif by completely different structures. For example, the secondary structure motif 'H-E-E-H-E-E' is contained in, at least, 16 structurally unrelated proteins.
Nuts and Bolts: What to Keep in Mind?
Information content in multiple sequence alignment. If the multiple sequence alignment contains only a few proteins very similar to the one you sent (pairwise sequence identity > 90%), the expected accuracy for 1D structure predictions (secondary structure, accessibility, transmembrane helices) drops significantly. Note: this implies a reduction of the expected accuracy for threading. The scores for expected accuracy (PROFsec, PROFacc., PHDhtm) are valid for typical alignments as to be found in the HSSP database. The information content of the alignment is difficult to measure. Two important parameters are:
- Number of aligned sequences: the more sequences in the alignment, the better. The exact number of sequences needed for a 'good prediction' cannot be given, as it depends on the variation and on characteristics of the particular protein family. As a rule of thumb: one is clearly NOT sufficient, more than five sequences can be enough.
- Variation of aligned sequences: the aligned sequences should have a considerable variation with respect to the guide sequence (your protein). Ideally, the alignment should contain sequences at levels of 80%, 60%, 50%, 40%, and about 30% pairwise sequence identity (with respect to the predicted protein). In general, more diverged sequences (30-40%) contribute more to the information content than do very similar ones (> 80%). Note: the levels of sequence identity are summarized in the alignment header of the output returned (example).
- Alignment errors for distant homologues: More distantly related sequences contribute more to the alignment diversity which is the base for an improved prediction accuracy. However, the more distant relative are difficult to align (actually below levels of some 40% sequence identity some alignment errors are guaranteed). Furthermore, even the correct detection of more distant relatives is getting highly complicated below levels of about 35% sequence identity.
- Bias by identical sequences: Growing data bases result in an explosion of highly redundant information. This has recently (1996-7) led to the situation where the previous rule 'the more sequences, the better' is not applicable anymore. Instead, you should leave out some (or all) family members in the high homology (>70%) region, in particular, when there are not many rather diverged sequences present. Furthermore, the current version of PROFPHD does not handle redundant information, i.e., when you have two proteins A and B of say 40% sequence identity to your query, and when A and B are highly similar (>90% sequence identity to one another), you should leave out one of the two from the alignment you use for the prediction!
Cut-off For including Homologues in Alignment
In the multiple sequence alignment returned to you, only homologues down to levels of 30% pairwise sequence identity over 80 or more residues are included. This cut-off is five percentage points above the threshold for structural homology (Sander & Schneider, 1990), in an attempt to stay clearly off the twilight zone of sequence similarity, and provide high-quality multiple alignments in an automated fashion.
Quality of Multiple Sequence Alignment
On average, more residues are falsely aligned for lower levels of pairwise sequence identity. Down to levels of about 30%, the automatic MaxHom alignments are usually quite accurate. However, for many families there are regions for which the 'correct' alignment is, in principle, not well defined. One way to spot such regions is the stability of the alignment with respect to including or excluding some of the aligned sequences. By providing different lists of sequences ("input option 'PIR list'") you can monitor the stability of the alignment. Often such regions may form surface loops. Predictions may be less accurate in such regions.
PROFphd Minimal Length of Sequences Limitation
The PROFphd programs treat N- and C-terminal ends of proteins as solvent molecules. The size of the input window for predicting 1D structure is up to 17 residues. Thus, the first and the last 17 residues of your sequence will 'see solvent'. Especially for short fragments you did cut out from large proteins, this may result in false predictions.
Insertions in Multiple Sequence Alignment
- Insertions in guide sequence: Do NOT use insertions for the guide sequence when you supply your alignment to be used as input for the predictions ("input option 'MSF format'"). In the current implementation, PROFPHD will treat such insertions as if the corresponding positions were occupied by solvent. This may lead to particular prediction errors!
- Split alignment into domains: If your alignment (of say 20 sequences) contains long (> 10 residues) regions for which only very few sequences do not have insertions (in positions R1-R2), split the alignment into fragments that are not full of insertions for all sequences. For the problematic region (R1-R2) it may be better to include only those sequences without insertions. The existence of such regions may indicate that the protein contains various domains (one for residues < R1, another for residues > R2). When you submit your alignment in fragments, mind the minimal length of sequences (see above).
- Globular, water-soluble proteins. The PROFPHD neural networks have been trained on proteins with typical features as contained in the database of known protein structures (PDB). Thus, accuracy may be lower if the methods are applied to other proteins. For instance, PROFsec (secondary structure) correctly predicts only about 50% of the residues in transmembrane helices of integral membrane proteins. However, the network system trained on transmembrane proteins (PHDhtm) predicts residues in transmembrane helices on average at a level of well above 90% accuracy. In general, the PROFPHD methods learn to extract characteristics features of currently known protein structures. Problematic cases are proteins with many cysteine bridges that stabilize the particular protein structure, or proteins for which the structure is stabilised by functional constraints (co-factors).
- Transmembrane proteins. PHDhtm for globular proteins. The rate of false positives, i.e., of globular, water-soluble proteins for which PHDhtm predicts transmembrane helices, is in the order of 5%. Such false positive predictions occur more often for structures with very hydrophobic beta-strands. Consequently, a prediction of transmembrane helices for a globular protein may indicate the existence of very hydrophobic beta-strands. PROFPHD for porin-like beta structures. For the beta-strand transmembrane protein, porin, the accuracy of PROFsec was below the expected average (60%), but it was higher than the average for helical transmembrane proteins (50%). The explanation may be that the barrels formed by porins share features of globular, water-soluble proteins and thus can be predicted relatively well. MaxHom alignments for transmembrane proteins. The alignment procedure MaxHom is optimized on globular water-soluble proteins. For transmembrane proteins, the alignments of the more hydrophobic transmembrane segments may require changes in the alignment details. Furthermore, in particular in transmembrane regions, often more distantly related sequences could be aligned by hand based on, e.g., hydrophobicity analyses. Unfortunately, we do not yet provide such refinements of the alignment automatically.
- Multi-domain proteins. The accuracy for predicting solvent accessibility (PROFacc) for single-domain proteins is higher than for multi-domain proteins. Predictions are more likely to be wrong at interfaces between domains. This shortcoming may be used to predict inter-domain interfaces in regions where PROFacc predicts buried residues that would otherwise not be compatible with your guess about the fold of the protein.
- Novel folds are NOT 'untypical proteins'. The expected prediction accuracy for the PROFPHD programs has been re-evaluated several times over the last years. So far, the results have always confirmed our estimates (Rost & Sander, 1995, Proteins, 1995, 23, 295-300).
Prediction of transmembrane helices (HTM's) and topology
- False positives: globular proteins predicted with HTM's. By default we search for possible transmembrane helices in your sequence. The rate of false positive detection (i.e. proteins falsely predicted to contain transmembrane helices) is about 1.6%. Thus, a reported transmembrane segment may just indicate a rather hydrophobic patch in a globular protein. If you explicitly request a prediction of transmembrane helices (HTM's), we assume that you know the protein to contain HTM's and consequently apply a lower threshold to eliminate false positives.
- Refined prediction of transmembrane helices and topology. By default we use the neural network system PHDhtm and an empirical filter to predict the locations of transmembrane helices. A refined (more accurate) version of that program, as well as, the prediction of transmembrane topology (orientation of N-terminal non-transmembrane region with respect to cell) is available upon request ("predict htm topology"). All predicted HTM's are sorted according to the reliability of the prediction. This may help experts to spot HTM's predicted falsely based on a reliability index provided. Note: try NOT to provide sequences that start or end with HTM regions as this may result in wrong topology predictions!
Reliability indices for PROFPHD predictions
The reliability indices of the PROFPHD methods correlate well with prediction accuracy. In other words, residues predicted with high reliability (0 = low, 9 = high) are more likely to be predicted correctly. However, when basing the prediction on single sequences (rather than multiple alignments) the scale has to be shifted. For instance, values of RI > 4 usually imply an expected accuracy of > 80% for PROFsec. When using a single sequence as input the same level of accuracy is reached only for residues predicted at RI > 7.
Combination of results with that of other methods
A combination of two prediction methods is likely to improve the accuracy only if the following points are met:
- The predictions are based on methods using independent information, e.g., prediction-based threading and potential-based threading,
- The accuracy of the two methods is comparable, e.g., NOT for combining Chou-Fasman (about 50% accuracy) and PROFsec (> 72% accuracy),
Say you want to focus on the most likely secondary structure segments. You may hope that the best predicted segments are those for which methods X and PROFsec agree. This may or may not be correct. However, it may be more reasonable to identify such regions based on the reliability index provided by PROFPHD. The PROFPHD methods have been tailored to provide a reasonable estimate for the reliability of the prediction, whereas a combination of two arbitrary prediction methods, at best, yields improvements, at random.
Homologue of known structure Ab-initio prediction (by e.g. PROFPHD) is, in general, less accurate than is homology modeling. Thus, if we find a protein of known structure that has > 25% pairwise sequence identity to your sequence, you ought to make use of the known structure by homology modeling.
- Burkhard Rost (Technische Universität München) pioneered the PredictProtein service; wrote the PHD prediction programs: PHDsec, PHDacc, PHDhtm, and PHDtopology; developed the prediction-based threading method PHDthreader/TOPITS; programmed, and proposed the scores compiled by EvalSec; hacked some of the PredictProtein scripts; is helping to keep the service up and running, and is responsible for the documentation.
- Guy Yachdav (Columbia Univ, New York) is currently maintaining the PredictProtein server and providing on-going support
- Laszlo Kajan (Technische Universität München) - visualPP, packaging of PredictProtein and subcomponents.
- Others who contributed (in the past):
- Reinhard Schneider (now EMBL, Heidelberg,) wrote the program MaxHom for multiple sequence, and 1D structure alignments, and helped the service take off.
- Antoine de Daruvar (now Univ. Bordeaux and LION, Heidelberg) rewrote the scripts managing requests, and maintained the service running for twelve months.
- Volker Eyrich (Chemistry Dept, Columbia Univ, New York,) wrote the scripts for the META server which allows you to access a large variety of selected servers world-wide from a single-page interface.
- Jinfeng Liu provided scientific support for the PredictProtein server.
- Chris Sander (now Memorial Sloan Kettering Medical Center,) organized resources, contributed ideas, and simulated the grand guru.
- Authors of other programs used:
- Stephen F Altschul and Samuel Karlin wrote the initial database search bestseller BLAST.
- Stephen F Altschul and colleagues wrote the recent hit PSI-BLAST.
- Amos Bairoch maintains SWISS-PROT, and initialised ProSite (as well as, numerous other services!).
- Amos Bairoch, Philip Bucher, Kay Hofmann maintain ProSite, and wrote the scripts returning the ProSite output.
- Andrei Lupas wrote the program COILS (detection of coiled-coil regions).Note: the latest version of the source code was provided by Rob Russell.
- Reinhard Schneider wrote the program MaxHom (multiple sequence alignment).
- John C Wootton and Scott Federhen wrote the classic program for detecting regions of low-complexity (composition bias).
- Malin M Young, Kent Kirshenbaum, Ken Dill, and Stefan Highsmith wrote the program for detecting regions likely to undergo structural switches (ASP).
- Jinfeng Liu ‚wrote CHOP and NORS
- Dariusz Przybylski developed fold recognition
- Rajesh Nair ‚sub-cellular localization
- Yanay Ofran ‚protein-protein and protein DNA binding sites
- Avner Schlessinger‚ protein disorder
- Marco Punta‚intra residue contacts
- Andrea Passserini‚ disulphide bridges
- Yana Bormberg‚ functional changed affected by SNP
- Henry Bigelow, beta-barrels prediction.
Comment: if you find the following list too long, please note that PredictProtein combines a large number of independent tools. Where other servers adopt the philosophy 'one mail/WWW interface per tool' (for a few examples), PP attempts to provide one single interface to many tools. Many of the tools are at the forefront of bio-informatics (academics may 'pay' these services by quoting).
Links to more literature related to PP.
PredictProtein: B Rost, G Yachdav and J Liu (2003) The PredictProtein Server. Nucleic Acids Research 32(Web Server issue):W321-W326.
- Author: B Rost
- URL: http://www.predictprotein.org
- Description: PredictProtein is the acronym for all prediction programs run.
PROSITE: A Bairoch, P Bucher & K Hofmann (1997) Nucleic Acids Research, 25:217-221
- Author: A Bairoch, P Bucher & K Hofmann
- URL: http://www.expasy.ch/prosite
- Version: 99.07
- Description: PROSITE is a database of functional motifs. ScanProsite, finds all functional motifs in your sequence that are annotated in the ProSite db.
SEG: J C Wootton & S Federhen (1996) Methods in Enzymology, 266:554-571
- Author: J C Wootton & S Federhen
- Version: 1994
- Description: SEG divides sequences into regions of low-, and high-complexity. Low-complexity regions typically correspond to 'simple sequences' or 'compositionally-biased' regions.
ProDom: ELL Sonnhammer & D Kahn (1994) Protein Science, 3:482-492
- Author: LL Sonnhammer; J Gouzy, F Corpet, F Servant, D Kahn, firstname.lastname@example.org
- URL: http://protein.toulouse.inra.fr/prodom.html
- Version: 2000.1
- Description: ProDom is a database of putative protein domains. The database is searched with BLAST for domains corresponding to your protein.
PHD: B Rost (1996) Methods in Enzymology, 266:525-539
- Author: B Rost
- URL: http://www.predictprotein.org
- Description: PHD is a suite of programs predicting 1D structure (secondary structure, solvent accessibility) from multiple sequence alignments.
PHDsec: B Rost & C Sander (1993) J. of Molecular Biology, 232:584-599
- Author: B Rost
- Description: PHDsec predicts secondary structure from multiple sequence alignments.
PHDacc: B Rost & C Sander (1994) Proteins, 20:216-226
- Author: B Rost
- Description: PHDacc predicts per residue solvent accessibility from multiple sequence alignments.
PHDhtm: B Rost, P Fariselli & R Casadio (1996) Protein Science, 7:1704-1718
- Author: B Rost
- Description: PHDhtm predicts the location and topology of transmembrane helices from multiple sequence alignments.
PROF: B Rost (2004) Meth. Mol. Biol., submitted.
- Author: B Rost
- Description: PROF is a suite of programs predicting 1D structure (secondary structure, solvent accessibility) from multiple sequence alignments.
PROFsec: B Rost (2004) Meth. Mol. Biol., submitted.
- Author: B Rost
- Description: PROFsec predicts secondary structure from multiple sequence alignments.
PROFACC: B Rost (2004) Meth. Mol. Biol., submitted.
- Author: B Rost
- Contact: email@example.com
- Description: PROFacc predicts per residue solvent accessibility from multiple sequence alignments.
GLOBE: B Rost (1998) unpublished
- Author: B Rost
- Description: GLOBE predicts the globularity of a protein
DISULFIND: A.Ceroni, P.Frasconi, A.Passerini and A.Vullo (2004) Bioinformatics, 20, 653-659, 2004
- Author:A.Ceroni, P.Frasconi, A.Passerini and A.Vullo
- Version: 1.0-rg2
- Description: DISULFIND is a disulphide bridges predictor based on a two steps process.
A conformational switch prediction program: Young et al. Protein Science(1999) 8:1752-64.
- Author: Young M, Kirshenbaum K, Dill KA and Highsmith S.
- Version: 1.0
- Description: ASP finds regions that are most likely to behave as switches in proteins known to exhibit this behavior
NORS CHOP InteractionSites DISIS NORSnet PROFbval MD PROFcon PROFtmb SNAP LOCtree
The PredictProtein web service and the preictprotein software are copyrighted by the ROSTLAB. Commercial users should apply for a license.
Address questions, suggestions, bug reports, or comments firstname.lastname@example.org
Please see our webpage for additional contact and support info contat
Information Regarding Subscription
All sign ups to PredictProtein from academia are free . If you want a premium service (receive your predictions earlier or want an improved support), you can support us with a minimal fee (below). Subscriptions for as little as $25 to the premium service will give you many benefits:
- Powerful, Faster Processors - we completed testing a new server that cuts the processing and wait time of a query significantly. Queries on the new system are now completed within a few minutes.
- Additional Support - premium subscribers get priority technical and scientific support. Get help on parsing your results, integrating them with additional services, convert to presentable graphics and more.
- Storage space - you will no longer need to download your results immediately, instead we will store your results for three months so that you can conveniently and securely access them from anywhere.
- Improved user interface - we are continuously streamlining the user interface for better experience and more intuitive display.
Frequently asked questions about subscription to PredictProtein
Do I need to register to access PredictProtein?
Yes, you need to create an account to use PredictProtein. The account is for free. Having an account allows us to help you more efficiently.
Which subscription should I pick?
All subscriptions to PredictProtein from academia are free. However, if you want to sign up for a faster turn-around (receive your predictions earlier), and for an improved support, you will have to pay. Users who want the premium services and may query the server more than that will need to consider the length of time they will be using the server; for instance if your research calls for results within the coming weeks, we recommend you pick the shortest subscription period (3 months).
How do I renew my subscription?
The easiest way to renew a subscription is to check the recurring payment check-box during sign up. That way your subscription will be renewed automatically by the system without disrupting your work. In case you didn't sign up for an automatic renewal
How and when can i cancel my subscription?
You can cancel your membership at any time by going to the unsubscribe page.
I assumed PredictProtein was a free service, how come I need to pay?
You were right: PredictProtein is free for academia. Most users will find that they only need to sign up for the free account and those who pay are charged a minimal fee. This minimal fee helps us to maintain the best possible service for all of you. Your money will contribute toward machine resources
Do you automatically mail an invoice
Yes. The system will automatically send an invoice to the subscriber's email in-box with full transaction details.
Which company sells the subscription to PredictProtein
PredictProtein licenses are sold through Biosof LLC. The contact info for Biosof appears below
Mailing address: 138 W. 25th Street, 10th floor New York, NY 10001 telephone +1 (212) 749-4294 email email@example.com
How do I contact customer support if i have more questions?
The easiest way is to write to firstname.lastname@example.org for technical and customer service support.