Distinguishing protein-coding from non-coding RNAs through support vector machines.

TitleDistinguishing protein-coding from non-coding RNAs through support vector machines.
Publication TypeJournal Article
Year of Publication2006
AuthorsLiu, J, Gough, J, Rost, B
JournalPLoS Genet
Date Published2006 Apr
KeywordsAnimals, Databases, Nucleic Acid, Expressed Sequence Tags, Genetic Code, Genetic Variation, Genetic Vectors, Mice, Models, Genetic, Models, Theoretical, Molecular Sequence Data, Protein Biosynthesis, Proteins, Reproducibility of Results, RNA, RNA, Messenger, ROC Curve, Transcription, Genetic

RIKEN's FANTOM project has revealed many previously unknown coding sequences, as well as an unexpected degree of variation in transcripts resulting from alternative promoter usage and splicing. Ever more transcripts that do not code for proteins have been identified by transcriptome studies, in general. Increasing evidence points to the important cellular roles of such non-coding RNAs (ncRNAs). The distinction of protein-coding RNA transcripts from ncRNA transcripts is therefore an important problem in understanding the transcriptome and carrying out its annotation. Very few in silico methods have specifically addressed this problem. Here, we introduce CONC (for "coding or non-coding"), a novel method based on support vector machines that classifies transcripts according to features they would have if they were coding for proteins. These features include peptide length, amino acid composition, predicted secondary structure content, predicted percentage of exposed residues, compositional entropy, number of homologs from database searches, and alignment entropy. Nucleotide frequencies are also incorporated into the method. Confirmed coding cDNAs for eukaryotic proteins from the Swiss-Prot database constituted the set of true positives, ncRNAs from RNAdb and NONCODE the true negatives. Ten-fold cross-validation suggested that CONC distinguished coding RNAs from ncRNAs at about 97% specificity and 98% sensitivity. Applied to 102,801 mouse cDNAs from the FANTOM3 dataset, our method reliably identified over 14,000 ncRNAs and estimated the total number of ncRNAs to be about 28,000.

Alternate JournalPLoS Genet.
PubMed ID16683024
PubMed Central IDPMC1449884
Grant ListR01-LM07329-01 / LM / NLM NIH HHS / United States
U54-GM074958-01 / GM / NIGMS NIH HHS / United States