




















|
|
|
CUBIC: NLProt / Help
|
Page Index:
Index
Submit
License
Data
|
| Who should use NLProt |
NLProt should be used by researchers who want to build databases on a fully or partially automatic basis. NLProt is highly accurate in finding protein names in free
language text and optimally assigns database IDs (SWISS-PROT, TrEMBL) to the found names.
|
| Example Files |
|
| Input |
-
Create a simple ASCII-file on your machine containing the text you want to scan for protein names. Copy and paste this file into the text box on the submit-page and press
the Submit Text button. Please note that your input text has to consist of full sentences, since the algorithm needs the surrounding context of protein names in order to work properly.
-
Each request only takes a few seconds to finish. After that time, the output will appear on the screen.
|
| Output |
The output of the program is either an ASCII- or html-file depending on the user's preferences. It contains the tagged input text (if html-format, names are indicated in red) followed by a detailled table listing all found (tagged) names. Each found name is listed
together with its position, its score and sometimes a database ID (SWISS-PROT, TrEMBL).
For ASCII-output, the < n> tag indicates the beginning of a protein name and the < /n> tag indicates the end of a name. In the table at the end of the output-file, TXT-POS means the position of the name in the text, SCORE is the output-score
of NLProt for this name and METHOD is the method by which the name was found. The following methods can be applied:
SVM: the name was found by the SVM-system
projected: the name was found the SVM-system, but at a different position of the text (thus the name was 'projected' to the rest of the text).
dictionary: the name is a long name, found in the dictionary (high length of names + name is in dictionary = strong indication for a protein name)
abbr.-ext.: name is the long form of an abbreviation that was found by the SVM-system.
Additionally, NLProt searches the text for tissue types and species names in order to assign the correct UniProt ID (SWISSPROT and TrEMBL) to each found name.
In html-output, tissues and species are marked with green and blue, respectively. In ASCII-format, they are tagged with < t> or < s>.
|
|
|