NER Resources

Currently 20 ner resources
 
Description
Machine-learning-based NER system for tagging biological entities (genes, proteins, cell lines, cell types, RNA, DNA) in text. It is based on conditional random fields (CRFs) and trained on the NLPBA and BioCreative corpora. It is implemented in Java and can be used locally. 
Abstract
ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. ABNER (A Biomedical Named Entity Recognizer) is an open source software tool for molecular biology text mining. At its core is a machine learning system using conditional random fields with a variety of orthographic and contextual features. The latest version is 1.5, which has an intuitive graphical interface and includes two modules for tagging entities (e.g. protein and cell line) trained on standard corpora, for which performance is roughly state of the art. It also includes a Java application programming interface allowing users to incorporate ABNER into their own systems and train models on new corpora. (PMID:15860559) 
Input
Free text (local); Sentences 
Output
Gene/protein names; Bio-entity tagged text; Semantically labelled text; Gene/Protein labelled text; 
State
Online
Download:
Web service:
 
Description
Tool to tag biomedical abbreviations in text using XML. 
Abstract
Identifies and tags the abbreviations in text with xml tags. If the long-form is given in the text or can be guessed from the document context, then the tag surrounding the abbreviation will contain the expansion's normalised form. The system is written in Java and uses SVM light. (No PubMed ref.) 
Input
Free text (local); Sentences 
Output
Articles; Sentences; Abstracts; Acronyms/Abbreviations ; Acronyms/Abbreviations tagged text 
Reference
No reference, contact person: Sylvain Gaudan 
State
Online
Download:
Web service:
 
Description
AIIA::GMT is a XML-RPC client of a web-service server, AIIA gene mention tagger, which provides the service to recognize named entities in the biomedical articles. AIIA gene mention tagger, developed by Adaptive Internet Intelligent Agents Lab, Institute of Information Science, Academia Sinica, Taiwan and I-Fang Chung's Lab, Institute of Bioinformatics, National Yang-Ming University, Taiwan, is a named entity recognition tool which participated in the BioCreative II challenge evaluation and attained a 0.8683 of F-score (ranked 2nd) in the final system assessment of the Gene Mention task. This module is developed to help those who want to use this remote service with XML-RPC, rather than with its web interface. Finally, this module and service is released under a GPLv3 License. You're free to use it for both academic or personal use. 
Abstract
Integrating high dimensional bi-directional parsing models for gene mention tagging. MOTIVATION: Tagging gene and gene product mentions in scientific text is an important initial step of literature mining. In this article, we describe in detail our gene mention tagger participated in BioCreative 2 challenge and analyze what contributes to its good performance. Our tagger is based on the conditional random fields model (CRF), the most prevailing method for the gene mention tagging task in BioCreative 2. Our tagger is interesting because it accomplished the highest F-scores among CRF-based methods and second over all. Moreover, we obtained our results by mostly applying open source packages, making it easy to duplicate our results. RESULTS:We first describe in detail how we developed our CRF-based tagger. We designed a very high dimensional feature set that includes most of information that may be relevant. We trained bi-directional CRF models with the same set of features, one applies forward parsing and the other backward, and integrated two models based on the output scores and dictionary filtering. One of the most prominent factors that contributes to the good performance of our tagger is the integration of an additional backward parsing model. However, from the definition of CRF, it appears that a CRF model is symmetric and bi-directional parsing models will produce the same results. We show that due to different feature settings, a CRF model can be asymmetric and the feature setting for our tagger in BioCreative 2 not only produces different results but also gives backward parsing models slight but constant advantage over forward parsing model. To fully explore the potential of integrating bi-directional parsing models, we applied different asymmetric feature settings to generate many bi-directional parsing models and integrate them based on the output scores. Experimental results show that this integrated model can achieve even higher F-score solely based on the training corpus for gene mention tagging. 
Input
Output
Gene mention 
State
Online
Download:
Web service:
 
Description
BANNER is a named entity recognition system, primarily intended for biomedical text. It is a machine-learning system based on conditional random fields and contains a wide survey of the best features in recent literature on biomedical named entity recognition (NER). BANNER is portable and is designed to maximize domain independence by not employing semantic features or rule-based processing steps. It is therefore useful to developers as an extensible NER implementation, to researchers as a standard for comparing innovative techniques, and to biologists requiring the ability to find novel entities in large amounts of text. This system has been compared to the official results of the Second BioCreative Challenge Evaluation as well as to other applications such as ABNER, LingPipe and NERBio. 
Abstract
BANNER: an executable survey of advances in biomedical named entity recognition. There has been an increasing amount of research on biomedical named entity recognition, the most basic text extraction problem, resulting in significant progress by different research teams around the world. This has created a need for a freely-available, open source system implementing the advances described in the literature. In this paper we present BANNER, an open-source, executable survey of advances in biomedical named entity recognition, intended to serve as a benchmark for the field. BANNER is implemented in Java as a machine-learning system based on conditional random fields and includes a wide survey of the best techniques recently described in the literature. It is designed to maximize domain independence by not employing brittle semantic features or rule-based processing steps, and achieves significantly better performance than existing baseline systems. It is therefore useful to developers as an extensible NER implementation, to researchers as a standard for comparing innovative techniques, and to biologists requiring the ability to find novel entities in large amounts of text. (PMID:18229723) 
Input
Free text (paste); Sentences 
Output
Confidence score; Gene/protein names; Bio-entity tagged text; Acronyms/Abbreviations 
State
Online
Download:
Web service:
 
Description
BIGNER (Background Information driven Gene Named Entity Recognizer) is a system for automatically tagging gene and protein mentions. This tool is able to locate gene/protein names in biomedical literature. The core of the system is a dictionary generated by semi-supervised learning from huge amount of unlabeled biomedical texts. Two models are provided: (a) maximum match based on the dictionary. (b) The combination of the dictionary and a conditional random field (CRF) model. 
Abstract
Incorporating rich background knowledge for gene named entity classification and recognition. ABSTRACT: BACKGROUND: Gene named entity classification and recognition are crucial preliminary steps of text mining in biomedical literature. Machine learning based methods have been used in this area with great success. In most state-of-the-art systems, elaborately designed lexical features, such as words, n-grams, and morphology patterns, have played a central part. However, this type of feature tends to cause extreme sparseness in feature space. As a result, out-of-vocabulary (OOV) terms in the training data are not modeled well due to lack of information. RESULTS: We propose a general framework for gene named entity representation, called feature coupling generalization (FCG). The basic idea is to generate higher level features using term frequency and co-occurrence information of highly indicative features in huge amount of unlabeled data. We examine its performance in a named entity classification task, which is designed to remove non-gene entries in a large dictionary derived from online resources. The results show that new features generated by FCG outperform lexical features by 5.97 F-score and 10.85 for OOV terms. Also in this framework each extension yields significant improvements and the sparse lexical features can be transformed into both a lower dimensional and more informative representation. A forward maximum match method based on the refined dictionary produces an F-score of 86.2 on BioCreative 2 GM test set. Then we combined the dictionary with a conditional random field (CRF) based gene mention tagger, achieving an F-score of 89.05, which improves the performance of the CRF-based tagger by 4.46 with little impact on the efficiency of the recognition system. A demo of the NER system is available at http://202.118.75.18:8080/bioner. (PMID:19615051) 
Input
Free text (paste); Free text (local); Sentences 
Output
Gene/protein names; Bio-entity tagged text; Semantically labelled text; Gene/Protein labelled text; 
State
Online
Download:
Web service:
 
Description
BioThesaurus is a web-based system designed to map a comprehensive collection of protein and gene names to UniProt Knowledgebase (UniProtKB) protein entries (Liu et al., 2006a, 2006b). It covers all UniProtKB protein entries, and consists of several millions of names extracted from multiple resources based on database cross-references in iProClass (detailed statistics and data sources). The web site allows the retrieval of synonymous names of given protein entries and the identification of ambiguous names shared by multiple proteins. 
Abstract
BioThesaurus: a web-based thesaurus of protein and gene names. BioThesaurus is a web-based system designed to map a comprehensive collection of protein and gene names to protein entries in the UniProt Knowledgebase. Currently covering more than two million proteins, BioThesaurus consists of over 2.8 million names extracted from multiple molecular biological databases according to the database cross-references in iProClass. The BioThesaurus web site allows the retrieval of synonymous names of given protein entries and the identification of protein entries sharing the same names. AVAILABILITY: BioThesaurus is accessible for online searching at http://pir.georgetown.edu/iprolink/biothesaurus (PMID:16267085) 
Input
Undefined 
Output
Undefined 
State
Online
Download:
Web service:
 
Description
DNorm is an automated method for determining which diseases are mentioned in biomedical text, the task of disease normalization. Diseases have a central role in many lines of biomedical research, making this task important for many lines of inquiry, including etiology (e.g. gene-disease relationships) and clinical aspects (e.g. diagnosis, prevention, and treatment). DNorm is a high-performing and mathematically principled framework for learning similarities between mentions and concept names directly from training data. DNorm is the first technique to use machine learning to normalize disease names and also the first method employing pairwise learning to rank in a normalization task. DNorm achieved the best performance in the 2013 ShARe/CLEF shared task on disease normalization in clinical notes. https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/#DNorm 
Abstract
DNorm: disease name normalization with pairwise learning to rank. MOTIVATION: Despite the central role of diseases in biomedical research, there have been much fewer attempts to automatically determine which diseases are mentioned in a text-the task of disease name normalization (DNorm)-compared with other normalization tasks in biomedical text mining research. METHODS: In this article we introduce the first machine learning approach for DNorm, using the NCBI disease corpus and the MEDIC vocabulary, which combines MeSH® and OMIM. Our method is a high-performing and mathematically principled framework for learning similarities between mentions and concept names directly from training data. The technique is based on pairwise learning to rank, which has not previously been applied to the normalization task but has proven successful in large optimization problems for information retrieval. RESULTS: We compare our method with several techniques based on lexical normalization and matching, MetaMap and Lucene. Our algorithm achieves 0.782 micro-averaged F-measure and 0.809 macro-averaged F-measure, an increase over the highest performing baseline method of 0.121 and 0.098, respectively. AVAILABILITY: The source code for DNorm is available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/DNorm, along with a web-based demonstration and links to the NCBI disease corpus. Results on PubMed abstracts are available in PubTator: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator  
Input
plain text 
Output
disease mentions 
State
Online
Download:
Web service:
 
Description
GNormPlus: an end-to-end system that handles both gene/protein name and identifier detection in biomedical literature, including gene/protein mentions, family names and domain names. Moreover, GNormPlus also integrates several advanced text-mining techniques (i.e., GenNorm, SR4GN, SimConcept, Ab3P and CRF++) for resolving composite gene names. On two public benchmarking datasets, we show that GNormPlus compares favorably to the other state-of-the-art methods. 
Abstract
GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains. The automatic recognition of gene names and their associated database identifiers from biomedical text has been widely studied in recent years, as these tasks play an important role in many downstream text-mining applications. Despite significant previous research, only a small number of tools are publicly available and these tools are typically restricted to detecting only mention level gene names or only document level gene identifiers. In this work, we report GNormPlus: an end-to-end and open source system that handles both gene mention and identifier detection. We created a new corpus of 694 PubMed articles to support our development of GNormPlus, containing manual annotations for not only gene names and their identifiers, but also closely related concepts useful for gene name disambiguation, such as gene families and protein domains. GNormPlus integrates several advanced text-mining techniques, including SimConcept for resolving composite gene names. As a result, GNormPlus compares favorably to other state-of-the-art methods when evaluated on two widely used public benchmarking datasets, achieving 86.7% F1-score on the BioCreative II Gene Normalization task dataset and 50.1% F1-score on the BioCreative III Gene Normalization task dataset. The GNormPlus source code and its annotated corpus are freely available, and the results of applying GNormPlus to the entire PubMed are freely accessible through our web-based tool PubTator. 
Input
plain text: PubTator (tab-delimited text file), BioC (xml), and JSON  
Output
gene mentions 
State
Online
Download:
Web service:
KEX
 
Description
KEX (Knowledge EXtraction) is a protein name annotation tool based on PROPER (PROtien Proper-noun Extraction Rules). The format of your input file should be a plain and simple text format or a 'MEDLINE report' format. The format of the output file is a one_sentence-one_line format. Protein names are annotated with special mark-ups. KEX can be downloaded and was tested on Solaris, dec and irix. You need a C compier (preferably gcc), and Perl version 5. 
Abstract
PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary. MOTIVATION: Since their initial development, integration and construction of databases for molecular-level data have progressed. Though biological molecules are related to each other and form a complex system, the information is stored in the vast archives of the literature or in diverse databases. There is no unified naming convention for biological object, and biological terms may be ambiguous or polysemic. This makes the integration and interaction of databases difficult. In order to eliminate these problems, machine-readable natural language resources appear to be quite promising. We have developed a workbench for protein name abbreviation dictionary (PNAD) building. RESULTS: We have developed PNAD Construction Support System (PNAD-CSS), which offers various convenient facilities to decrease the construction costs of a protein name abbreviation dictionary of which entries are collected from abstracts in biomedical papers. The system allows the users to concentrate on higher level interpretation by removing some troublesome tasks, e.g. management of abstracts, extracting protein names and their abbreviations, and so on. To extract a pair of protein names and abbreviations, we have developed a hybrid system composed of the PROPER System and the PNAD System. The PNAD System can extract the pairs from parenthetical-paraphrases involved in protein names, the PROPER System identified these paris, with 98.95% precision, 95.56% recall and 97.58% complete precision. AVAILABILITY: PROPER System is freely available from http://www.hgc.inc.u-tokyo.ac.jp/service/tooldoc /KeX/intro.html. The other software are also available on request. Contact the authors. CONTACT: mikio@ims.u-tokyo.ac.jp (PMID:10842739) 
Input
Undefined 
Output
Undefined 
State
Online
Download:
Web service:
 
Description
LingPipe is a suite of NLP tools (in Java) including many features such as named-entity detector, an approximate dictionary match named-entity detector, a heuristic sentence boundary detector, a heuristic within-document coreference resolution engine and a set of tools for MEDLINE data. 
Abstract
Phrasal Queries with LingPipe and Lucene: Ad Hoc Genomics Text Retrieval. The hypothesis we explored for the Ad Hoc task of the Genomics track for TREC 2004 was that phrase-level queries would increase precision over a baseline of token-level terms. We implemented our approach using two open source tools: the Apache Jakarta Lucene TF/IDF search engine (version 1.3) and the Alias-i LingPipe tokenizer and named entity annotator (version 1.0.6). Contrary to our intuitions, the baseline system provided better performance in terms of recall and precision for almost every query at almost every precision/recall operating point. (No PubMed ref.) 
Input
Undefined 
Output
Undefined 
State
Online
Download:
Web service: