NER Resources

Currently 20 ner resources
 
Description
Machine-learning-based NER system for tagging biological entities (genes, proteins, cell lines, cell types, RNA, DNA) in text. It is based on conditional random fields (CRFs) and trained on the NLPBA and BioCreative corpora. It is implemented in Java and can be used locally. 
Abstract
ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. ABNER (A Biomedical Named Entity Recognizer) is an open source software tool for molecular biology text mining. At its core is a machine learning system using conditional random fields with a variety of orthographic and contextual features. The latest version is 1.5, which has an intuitive graphical interface and includes two modules for tagging entities (e.g. protein and cell line) trained on standard corpora, for which performance is roughly state of the art. It also includes a Java application programming interface allowing users to incorporate ABNER into their own systems and train models on new corpora. (PMID:15860559) 
Input
Free text (local); Sentences 
Output
Gene/protein names; Bio-entity tagged text; Semantically labelled text; Gene/Protein labelled text; 
State
Online
Download:
Web service:
 
Description
BANNER is a named entity recognition system, primarily intended for biomedical text. It is a machine-learning system based on conditional random fields and contains a wide survey of the best features in recent literature on biomedical named entity recognition (NER). BANNER is portable and is designed to maximize domain independence by not employing semantic features or rule-based processing steps. It is therefore useful to developers as an extensible NER implementation, to researchers as a standard for comparing innovative techniques, and to biologists requiring the ability to find novel entities in large amounts of text. This system has been compared to the official results of the Second BioCreative Challenge Evaluation as well as to other applications such as ABNER, LingPipe and NERBio. 
Abstract
BANNER: an executable survey of advances in biomedical named entity recognition. There has been an increasing amount of research on biomedical named entity recognition, the most basic text extraction problem, resulting in significant progress by different research teams around the world. This has created a need for a freely-available, open source system implementing the advances described in the literature. In this paper we present BANNER, an open-source, executable survey of advances in biomedical named entity recognition, intended to serve as a benchmark for the field. BANNER is implemented in Java as a machine-learning system based on conditional random fields and includes a wide survey of the best techniques recently described in the literature. It is designed to maximize domain independence by not employing brittle semantic features or rule-based processing steps, and achieves significantly better performance than existing baseline systems. It is therefore useful to developers as an extensible NER implementation, to researchers as a standard for comparing innovative techniques, and to biologists requiring the ability to find novel entities in large amounts of text. (PMID:18229723) 
Input
Free text (paste); Sentences 
Output
Confidence score; Gene/protein names; Bio-entity tagged text; Acronyms/Abbreviations 
State
Online
Download:
Web service:
 
Description
BIGNER (Background Information driven Gene Named Entity Recognizer) is a system for automatically tagging gene and protein mentions. This tool is able to locate gene/protein names in biomedical literature. The core of the system is a dictionary generated by semi-supervised learning from huge amount of unlabeled biomedical texts. Two models are provided: (a) maximum match based on the dictionary. (b) The combination of the dictionary and a conditional random field (CRF) model. 
Abstract
Incorporating rich background knowledge for gene named entity classification and recognition. ABSTRACT: BACKGROUND: Gene named entity classification and recognition are crucial preliminary steps of text mining in biomedical literature. Machine learning based methods have been used in this area with great success. In most state-of-the-art systems, elaborately designed lexical features, such as words, n-grams, and morphology patterns, have played a central part. However, this type of feature tends to cause extreme sparseness in feature space. As a result, out-of-vocabulary (OOV) terms in the training data are not modeled well due to lack of information. RESULTS: We propose a general framework for gene named entity representation, called feature coupling generalization (FCG). The basic idea is to generate higher level features using term frequency and co-occurrence information of highly indicative features in huge amount of unlabeled data. We examine its performance in a named entity classification task, which is designed to remove non-gene entries in a large dictionary derived from online resources. The results show that new features generated by FCG outperform lexical features by 5.97 F-score and 10.85 for OOV terms. Also in this framework each extension yields significant improvements and the sparse lexical features can be transformed into both a lower dimensional and more informative representation. A forward maximum match method based on the refined dictionary produces an F-score of 86.2 on BioCreative 2 GM test set. Then we combined the dictionary with a conditional random field (CRF) based gene mention tagger, achieving an F-score of 89.05, which improves the performance of the CRF-based tagger by 4.46 with little impact on the efficiency of the recognition system. A demo of the NER system is available at http://202.118.75.18:8080/bioner. (PMID:19615051) 
Input
Free text (paste); Free text (local); Sentences 
Output
Gene/protein names; Bio-entity tagged text; Semantically labelled text; Gene/Protein labelled text; 
State
Online
Download:
Web service:
 
Description
BioThesaurus is a web-based system designed to map a comprehensive collection of protein and gene names to UniProt Knowledgebase (UniProtKB) protein entries (Liu et al., 2006a, 2006b). It covers all UniProtKB protein entries, and consists of several millions of names extracted from multiple resources based on database cross-references in iProClass (detailed statistics and data sources). The web site allows the retrieval of synonymous names of given protein entries and the identification of ambiguous names shared by multiple proteins. 
Abstract
BioThesaurus: a web-based thesaurus of protein and gene names. BioThesaurus is a web-based system designed to map a comprehensive collection of protein and gene names to protein entries in the UniProt Knowledgebase. Currently covering more than two million proteins, BioThesaurus consists of over 2.8 million names extracted from multiple molecular biological databases according to the database cross-references in iProClass. The BioThesaurus web site allows the retrieval of synonymous names of given protein entries and the identification of protein entries sharing the same names. AVAILABILITY: BioThesaurus is accessible for online searching at http://pir.georgetown.edu/iprolink/biothesaurus (PMID:16267085) 
Input
Undefined 
Output
Undefined 
State
Online
Download:
Web service:
KEX
 
Description
KEX (Knowledge EXtraction) is a protein name annotation tool based on PROPER (PROtien Proper-noun Extraction Rules). The format of your input file should be a plain and simple text format or a 'MEDLINE report' format. The format of the output file is a one_sentence-one_line format. Protein names are annotated with special mark-ups. KEX can be downloaded and was tested on Solaris, dec and irix. You need a C compier (preferably gcc), and Perl version 5. 
Abstract
PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary. MOTIVATION: Since their initial development, integration and construction of databases for molecular-level data have progressed. Though biological molecules are related to each other and form a complex system, the information is stored in the vast archives of the literature or in diverse databases. There is no unified naming convention for biological object, and biological terms may be ambiguous or polysemic. This makes the integration and interaction of databases difficult. In order to eliminate these problems, machine-readable natural language resources appear to be quite promising. We have developed a workbench for protein name abbreviation dictionary (PNAD) building. RESULTS: We have developed PNAD Construction Support System (PNAD-CSS), which offers various convenient facilities to decrease the construction costs of a protein name abbreviation dictionary of which entries are collected from abstracts in biomedical papers. The system allows the users to concentrate on higher level interpretation by removing some troublesome tasks, e.g. management of abstracts, extracting protein names and their abbreviations, and so on. To extract a pair of protein names and abbreviations, we have developed a hybrid system composed of the PROPER System and the PNAD System. The PNAD System can extract the pairs from parenthetical-paraphrases involved in protein names, the PROPER System identified these paris, with 98.95% precision, 95.56% recall and 97.58% complete precision. AVAILABILITY: PROPER System is freely available from http://www.hgc.inc.u-tokyo.ac.jp/service/tooldoc /KeX/intro.html. The other software are also available on request. Contact the authors. CONTACT: mikio@ims.u-tokyo.ac.jp (PMID:10842739) 
Input
Undefined 
Output
Undefined 
State
Online
Download:
Web service:
 
Description
LingPipe is a suite of NLP tools (in Java) including many features such as named-entity detector, an approximate dictionary match named-entity detector, a heuristic sentence boundary detector, a heuristic within-document coreference resolution engine and a set of tools for MEDLINE data. 
Abstract
Phrasal Queries with LingPipe and Lucene: Ad Hoc Genomics Text Retrieval. The hypothesis we explored for the Ad Hoc task of the Genomics track for TREC 2004 was that phrase-level queries would increase precision over a baseline of token-level terms. We implemented our approach using two open source tools: the Apache Jakarta Lucene TF/IDF search engine (version 1.3) and the Alias-i LingPipe tokenizer and named entity annotator (version 1.0.6). Contrary to our intuitions, the baseline system provided better performance in terms of recall and precision for almost every query at almost every precision/recall operating point. (No PubMed ref.) 
Input
Undefined 
Output
Undefined 
State
Online
Download:
Web service:
 
Description
MaxMatcher is a biological concept extractor tool using dicitonary-based approximate matching. UMLS 2004AA version is used as the dictionary. The precision and recall on GENIA3.02 corpus are 71.60% and 75.18%, respectively. 
Abstract
Using Concept-Based Indexing to Improve Language Modeling Approach to Genomic IR. Genomic IR, characterized by its highly specific information need, severe synonym and polysemy problem, long term name and rapid growing literature size, is challenging IR community. In this paper, we are focused on addressing the synonym and polysemy issue within the language model framework. Unlike the ways translation model and traditional query expansion techniques approach this issue, we incorporate concept-based indexing into a basic language model for genomic IR. In particular, we adopt UMLS concepts as indexing and searching terms. A UMLS concept stands for a unique meaning in the biomedicine domain; a set of synonymous terms will share same concept ID. Therefore, the new approach makes the document ranking effective while maintaining the simplicity of language models. A comparative experiment on the TREC 2004 Genomics Track data shows significant improvements are obtained by incorporating concept-based indexing into a basic language model. The MAP (mean average precision) is significantly raised from 29.17% (the baseline system) to 36.94%. The performance of the new approach is also significantly superior to the mean (21.72%) of official runs participated in TREC 2004 Genomics Track and is comparable to the performance of the best run (40.75%). Most official runs including the best run extensively use various query expansion and pseudo-relevance feedback techniques while our approach does nothing except for the incorporation of concept-based indexing, which evidences the view that semantic smoothing, i.e. the incorporation of synonym and sense information into the language models, is a more standard approach to achieving the effects traditional query expansion and pseudo-relevance feedback techniques target. (PMID:) 
Input
Undefined 
Output
Undefined 
State
Online
Download:
Web service:
 
Description
MetaMap is a program for mapping biomedical text to concepts in the UMLS Metathesaurus or, equivalently, to discover Metathesaurus concepts referred to in text. MetaMap uses a knowledge intensive approach based on symbolic, natural language processing (NLP) and computational linguistic techniques. Because of the intensiveness of its computations, it is not appropriate for real time processing. On the other hand, it is thorough and is particularly adept at constructing partial matches when a phrase cannot be described by a single concept. MetaMap has been used to support tasks such as information retrieval, text mining, literature-based discovery, document indexing, classification and question answering. MetaMap's output normally consists of the best 'mappings' for input text phrases, i.e., sets of Metathesaurus concepts which best match the input. Intermetiate results, also available for output, consist of ranked lists of concepts (keywords, gene/protein names, ..., i.e., any concept in the UMLS Metathesaurus), a shallow parse of the text and a list of author-defined acronyms/abbreviations. MetaMap has been used by a range of different tools/applications, some of them are listed in the keyword section. MetaMap is one of the primary components of the NLM Medical Text Indexer (NLM) which is in daily use assisting NLM indexers in creating the MeSH indexing for MEDLINE citations. 
Abstract
Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. The UMLS Metathesaurus, the largest thesaurus in the biomedical domain, provides a representation of biomedical knowledge consisting of concepts classified by semantic type and both hierarchical and non-hierarchical relationships among the concepts. This knowledge has proved useful for many applications including decision support systems, management of patient records, information retrieval (IR) and data mining. Gaining effective access to the knowledge is critical to the success of these applications. This paper describes MetaMap, a program developed at the National Library of Medicine (NLM) to map biomedical text to the Metathesaurus or, equivalently, to discover Metathesaurus concepts referred to in text. MetaMap uses a knowledge intensive approach based on symbolic, natural language processing (NLP) and computational linguistic techniques. Besides being applied for both IR and data mining applications, MetaMap is one of the foundations of NLM's Indexing Initiative System which is being applied to both semi-automatic and fully automatic indexing of the biomedical literature at the library. (PMID:11825149) 
Input
Free text (paste); Free text (upload); Free text (local) 
Output
Ranked list; Nr. documents; Confidence score; Keyword; Gene/protein names; Gene/protein identifiers; POS labelled text; Parses; Acronyms/Abbreviations ; Ranked gene/protein lists; Geographical locations 
State
Online
Download:
Web service:
 
Description
mSTRAPis an automated workflow for mining mutation annotations from full-text biomedical literature using natural language processing (NLP) techniques as well as for their subsequent reuse in protein structure annotation and visualization. This system, called mSTRAPTM (Mutation extraction and STRucture Annotation Pipeline), is designed for both information aggregation and subsequent brokerage of the mutation annotations. It facilitates the coordination of semantically related information from a series of text mining and sequence analysis steps into a formal OWL-DL ontology. The ontology is designed to support application-specific data management of sequence, structure, and literature annotations that are populated as instances of object and data type properties. mSTRAPvizTM is a subsystem that facilitates the brokerage of structure information and the associated mutations for visualization. For mutated sequences without any corresponding structure available in the Protein Data Bank (PDB), an automated pipeline for homology modeling is developed to generate the theoretical model. With mSTRAP, we demonstrate a workable system that can facilitate automation of the workflow for the retrieval, extraction, processing, and visualization of mutation annotations. Tasks which are well known to be tedious, time-consuming, complex, and error-prone.To run this system you need to register (name and e-mail) and install: -Java (version 1.6 and above) -ClustalW -MODELLER (version 9v2) 
Abstract
A workflow for mutation extraction and structure annotation. Rich information on point mutation studies is scattered across heterogeneous data sources. This paper presents an automated workflow for mining mutation annotations from full-text biomedical literature using natural language processing (NLP) techniques as well as for their subsequent reuse in protein structure annotation and visualization. This system, called mSTRAP (Mutation extraction and STRucture Annotation Pipeline), is designed for both information aggregation and subsequent brokerage of the mutation annotations. It facilitates the coordination of semantically related information from a series of text mining and sequence analysis steps into a formal OWL-DL ontology. The ontology is designed to support application-specific data management of sequence, structure, and literature annotations that are populated as instances of object and data type properties. mSTRAPviz is a subsystem that facilitates the brokerage of structure information and the associated mutations for visualization. For mutated sequences without any corresponding structure available in the Protein Data Bank (PDB), an automated pipeline for homology modeling is developed to generate the theoretical model. With mSTRAP, we demonstrate a workable system that can facilitate automation of the workflow for the retrieval, extraction, processing, and visualization of mutation annotations -- tasks which are well known to be tedious, time-consuming, complex, and error-prone. The ontology and visualization tool are available at (http://datam.i2r.a-star.edu.sg/mstrap). (PMID:18172931) 
Input
Undefined 
Output
Undefined 
State
Online
Download:
Web service:
 
Description
MutationFinder: tool to automatically extract mutations of amino acid residues from the literature. Can be downloaded to extract mutation mentions from a large collection of abstracts. This tool has a high precision and also a considerable recall. 
Abstract
MutationFinder: a high-performance system for extracting point mutation mentions from text. Discussion of point mutations is ubiquitous in biomedical literature, and manually compiling databases or literature on mutations in specific genes or proteins is tedious. We present an open-source, rule-based system, MutationFinder, for extracting point mutation mentions from text. On blind test data, it achieves nearly perfect precision and a markedly improved recall over a baseline. AVAILABILITY: MutationFinder, along with a high-quality gold standard data set, and a scoring script for mutation extraction systems have been made publicly available. Implementations, source code and unit tests are available in Python, Perl and Java. MutationFinder can be used as a stand-alone script, or imported by other applications. PROJECT URL: http://bionlp.sourceforge.net. (PMID:17495998) 
Input
Undefined 
Output
Undefined 
State
Online
Download:
Web service: