NER Resources

Currently 20 ner resources
 
Description
LINNAEUS is a general-purpose dictionary matching software, capable of processing multiple types of document formats in the biomedical domain (MEDLINE, PMC, BMC, OTMI, text, etc.). It can produce multiple types of output (XML, HTML, tab-separated-value file, or save to a database). It also contains methods for acting as a server (including load balancing across several servers), allowing clients to request matching over a network. A package with files for recognizing and identifying species names is available for LINNAEUS, showing 94% recall and 97% precision compared to LINNAEUS-species-corpus. LINNAEUS can be run in two different ways: using an internal dictionary, or using an external dictionary. The external dictionaries are available for download below. The internal dictionaries (subsets of the external dictionaries, containing the 10,000 most frequently mentioned species in MEDLINE, representing ~99% of mentions) are contained in the Java .jar archive, and do not need any configuration. Due to the small size of the internal dictionaries, they require very little memory.  
Abstract
Background The task of recognizing and identifying species names in biomedical literature has recently been regarded as critical for a number of applications in text and data mining, including gene name recognition, species-specific document retrieval, and semantic enrichment of biomedical articles. Results In this paper we describe an open-source species name recognition and normalization software system, LINNAEUS, and evaluate its performance relative to several automatically generated biomedical corpora, as well as a novel corpus of full-text documents manually annotated for species mentions. LINNAEUS uses a dictionary-based approach (implemented as an efficient deterministic finite-state automaton) to identify species names and a set of heuristics to resolve ambiguous mentions. When compared against our manually annotated corpus, LINNAEUS performs with 94% recall and 97% precision at the mention level, and 98% recall and 90% precision at the document level. Our system successfully solves the problem of disambiguating uncertain species mentions, with 97% of all mentions in PubMed Central full-text documents resolved to unambiguous NCBI taxonomy identifiers. Conclusions LINNAEUS is an open source, stand-alone software system capable of recognizing and normalizing species name mentions with speed and accuracy, and can therefore be integrated into a range of bioinformatics and text-mining applications. The software and manually annotated corpus can be downloaded freely at http://linnaeus.sourceforge.net/. 
Input
multiple types of document formats in the biomedical domain (MEDLINE, PMC, BMC, OTMI, text, etc.) 
Output
multiple types of output (XML, HTML, tab-separated-value file, or save to a database) 
State
Online
Download:
Web service:
 
Description
MaxMatcher is a biological concept extractor tool using dicitonary-based approximate matching. UMLS 2004AA version is used as the dictionary. The precision and recall on GENIA3.02 corpus are 71.60% and 75.18%, respectively. 
Abstract
Using Concept-Based Indexing to Improve Language Modeling Approach to Genomic IR. Genomic IR, characterized by its highly specific information need, severe synonym and polysemy problem, long term name and rapid growing literature size, is challenging IR community. In this paper, we are focused on addressing the synonym and polysemy issue within the language model framework. Unlike the ways translation model and traditional query expansion techniques approach this issue, we incorporate concept-based indexing into a basic language model for genomic IR. In particular, we adopt UMLS concepts as indexing and searching terms. A UMLS concept stands for a unique meaning in the biomedicine domain; a set of synonymous terms will share same concept ID. Therefore, the new approach makes the document ranking effective while maintaining the simplicity of language models. A comparative experiment on the TREC 2004 Genomics Track data shows significant improvements are obtained by incorporating concept-based indexing into a basic language model. The MAP (mean average precision) is significantly raised from 29.17% (the baseline system) to 36.94%. The performance of the new approach is also significantly superior to the mean (21.72%) of official runs participated in TREC 2004 Genomics Track and is comparable to the performance of the best run (40.75%). Most official runs including the best run extensively use various query expansion and pseudo-relevance feedback techniques while our approach does nothing except for the incorporation of concept-based indexing, which evidences the view that semantic smoothing, i.e. the incorporation of synonym and sense information into the language models, is a more standard approach to achieving the effects traditional query expansion and pseudo-relevance feedback techniques target. (PMID:) 
Input
Undefined 
Output
Undefined 
State
Online
Download:
Web service:
 
Description
MetaMap is a program for mapping biomedical text to concepts in the UMLS Metathesaurus or, equivalently, to discover Metathesaurus concepts referred to in text. MetaMap uses a knowledge intensive approach based on symbolic, natural language processing (NLP) and computational linguistic techniques. Because of the intensiveness of its computations, it is not appropriate for real time processing. On the other hand, it is thorough and is particularly adept at constructing partial matches when a phrase cannot be described by a single concept. MetaMap has been used to support tasks such as information retrieval, text mining, literature-based discovery, document indexing, classification and question answering. MetaMap's output normally consists of the best 'mappings' for input text phrases, i.e., sets of Metathesaurus concepts which best match the input. Intermetiate results, also available for output, consist of ranked lists of concepts (keywords, gene/protein names, ..., i.e., any concept in the UMLS Metathesaurus), a shallow parse of the text and a list of author-defined acronyms/abbreviations. MetaMap has been used by a range of different tools/applications, some of them are listed in the keyword section. MetaMap is one of the primary components of the NLM Medical Text Indexer (NLM) which is in daily use assisting NLM indexers in creating the MeSH indexing for MEDLINE citations. 
Abstract
Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. The UMLS Metathesaurus, the largest thesaurus in the biomedical domain, provides a representation of biomedical knowledge consisting of concepts classified by semantic type and both hierarchical and non-hierarchical relationships among the concepts. This knowledge has proved useful for many applications including decision support systems, management of patient records, information retrieval (IR) and data mining. Gaining effective access to the knowledge is critical to the success of these applications. This paper describes MetaMap, a program developed at the National Library of Medicine (NLM) to map biomedical text to the Metathesaurus or, equivalently, to discover Metathesaurus concepts referred to in text. MetaMap uses a knowledge intensive approach based on symbolic, natural language processing (NLP) and computational linguistic techniques. Besides being applied for both IR and data mining applications, MetaMap is one of the foundations of NLM's Indexing Initiative System which is being applied to both semi-automatic and fully automatic indexing of the biomedical literature at the library. (PMID:11825149) 
Input
Free text (paste); Free text (upload); Free text (local) 
Output
Ranked list; Nr. documents; Confidence score; Keyword; Gene/protein names; Gene/protein identifiers; POS labelled text; Parses; Acronyms/Abbreviations ; Ranked gene/protein lists; Geographical locations 
State
Online
Download:
Web service:
 
Description
mSTRAPis an automated workflow for mining mutation annotations from full-text biomedical literature using natural language processing (NLP) techniques as well as for their subsequent reuse in protein structure annotation and visualization. This system, called mSTRAPTM (Mutation extraction and STRucture Annotation Pipeline), is designed for both information aggregation and subsequent brokerage of the mutation annotations. It facilitates the coordination of semantically related information from a series of text mining and sequence analysis steps into a formal OWL-DL ontology. The ontology is designed to support application-specific data management of sequence, structure, and literature annotations that are populated as instances of object and data type properties. mSTRAPvizTM is a subsystem that facilitates the brokerage of structure information and the associated mutations for visualization. For mutated sequences without any corresponding structure available in the Protein Data Bank (PDB), an automated pipeline for homology modeling is developed to generate the theoretical model. With mSTRAP, we demonstrate a workable system that can facilitate automation of the workflow for the retrieval, extraction, processing, and visualization of mutation annotations. Tasks which are well known to be tedious, time-consuming, complex, and error-prone.To run this system you need to register (name and e-mail) and install: -Java (version 1.6 and above) -ClustalW -MODELLER (version 9v2) 
Abstract
A workflow for mutation extraction and structure annotation. Rich information on point mutation studies is scattered across heterogeneous data sources. This paper presents an automated workflow for mining mutation annotations from full-text biomedical literature using natural language processing (NLP) techniques as well as for their subsequent reuse in protein structure annotation and visualization. This system, called mSTRAP (Mutation extraction and STRucture Annotation Pipeline), is designed for both information aggregation and subsequent brokerage of the mutation annotations. It facilitates the coordination of semantically related information from a series of text mining and sequence analysis steps into a formal OWL-DL ontology. The ontology is designed to support application-specific data management of sequence, structure, and literature annotations that are populated as instances of object and data type properties. mSTRAPviz is a subsystem that facilitates the brokerage of structure information and the associated mutations for visualization. For mutated sequences without any corresponding structure available in the Protein Data Bank (PDB), an automated pipeline for homology modeling is developed to generate the theoretical model. With mSTRAP, we demonstrate a workable system that can facilitate automation of the workflow for the retrieval, extraction, processing, and visualization of mutation annotations -- tasks which are well known to be tedious, time-consuming, complex, and error-prone. The ontology and visualization tool are available at (http://datam.i2r.a-star.edu.sg/mstrap). (PMID:18172931) 
Input
Undefined 
Output
Undefined 
State
Online
Download:
Web service:
 
Description
MutationFinder: tool to automatically extract mutations of amino acid residues from the literature. Can be downloaded to extract mutation mentions from a large collection of abstracts. This tool has a high precision and also a considerable recall. 
Abstract
MutationFinder: a high-performance system for extracting point mutation mentions from text. Discussion of point mutations is ubiquitous in biomedical literature, and manually compiling databases or literature on mutations in specific genes or proteins is tedious. We present an open-source, rule-based system, MutationFinder, for extracting point mutation mentions from text. On blind test data, it achieves nearly perfect precision and a markedly improved recall over a baseline. AVAILABILITY: MutationFinder, along with a high-quality gold standard data set, and a scoring script for mutation extraction systems have been made publicly available. Implementations, source code and unit tests are available in Python, Perl and Java. MutationFinder can be used as a stand-alone script, or imported by other applications. PROJECT URL: http://bionlp.sourceforge.net. (PMID:17495998) 
Input
Undefined 
Output
Undefined 
State
Online
Download:
Web service:
 
Description
Neji is a flexible and powerful platform for biomedical information extraction from scientific texts, such as patents, publications and electronic health records. New in Neji 2 * Neji Web Server: - Management of annotation services and respective dictionaries and machine-learning models - Web page with interactive annotation for each service - REST API for each service * Gimli for machine learning NER training - Gimli is now easier to use with faster training and processing times. Its functionalities are now integrated into Neji, providing the same high accuracy previously achieved  
Abstract
Concept recognition is an essential task in biomedical information extraction, presenting several complex and unsolved challenges. The development of such solutions is typically performed in an ad-hoc manner or using general information extraction frameworks, which are not optimized for the biomedical domain and normally require the integration of complex external libraries and/or the development of custom tools. This article presents Neji, an open source framework optimized for biomedical concept recognition built around four key characteristics: modularity, scalability, speed, and usability. It integrates modules for biomedical natural language processing, such as sentence splitting, tokenization, lemmatization, part-of-speech tagging, chunking and dependency parsing. Concept recognition is provided through dictionary matching and machine learning with normalization methods. Neji also integrates an innovative concept tree implementation, supporting overlapped concept names and respective disambiguation techniques. The most popular input and output formats, namely Pubmed XML, IeXML, CoNLL and A1, are also supported. On top of the built-in functionalities, developers and researchers can implement new processing modules or pipelines, or use the provided command-line interface tool to build their own solutions, applying the most appropriate techniques to identify heterogeneous biomedical concepts. Neji was evaluated against three gold standard corpora with heterogeneous biomedical concepts (CRAFT, AnEM and NCBI disease corpus), achieving high performance results on named entity recognition (F1-measure for overlap matching: species 95%, cell 92%, cellular components 83%, gene and proteins 76%, chemicals 65%, biological processes and molecular functions 63%, disorders 85%, and anatomical entities 82%) and on entity normalization (F1-measure for overlap name matching and correct identifier included in the returned list of identifiers: species 88%, cell 71%, cellular components 72%, gene and proteins 64%, chemicals 53%, and biological processes and molecular functions 40%). Neji provides fast and multi-threaded data processing, annotating up to 1200 sentences/second when using dictionary-based concept identification. Considering the provided features and underlying characteristics, we believe that Neji is an important contribution to the biomedical community, streamlining the development of complex concept recognition solutions.  
Input
BioC, PubMed XML, PubMedCentral XML, XML, HTML, RAW 
Output
JSON, A1, BC2, Base64, BioC, CoNLL, IeXML, Pipe, PipeExtended 
State
Online
Download:
Web service:
OSIRISv1.2
 
Description
Sequence variants, in particular Single Nucleotide Polymorphisms (SNPs), are considered key elements in fields such as genetic epidemiology and pharmacogenomics [Palmer and Cardon, 2005]. Researchers in these areas are interested in finding genes associated with diseases or with drug responses, as well as in selecting the relevant sequence variants on candidate genes for genotyping studies. Several public databases are available containing sequence information on genes and proteins (NCBI Entrez, SwissProt and many others). Data on sequence variants can be found at other public resources such as NCBI dbSNP and HapMap. In contrast, information about phenotypic consequences of the sequence variants of genes is generally found as non-structured text in the biomedical literature. However, the identification of the relevant documents and the extraction of the information from them are often hampered by the lack of widely accepted standard notation for genes, proteins and sequence variants in the biomedical literature, and by the large size of current literature databases. Bearing this in mind, automatic systems for the identification of gene/protein entities and their corresponding sequence variants from biomedical texts are required. Our group have previously reported the development of OSIRIS, a search system that integrates different sources of information and incorporates ad-hoc tools for synonymy generation with the aim of retrieving literature about sequence variation of a gene using PubMed search engine. We have developed a new version of OSIRIS as a first step towards an integrated text mining system for the extraction of information about genes, sequence variants and related phenotypes. The new implementation of OSIRIS (OSIRISv1.2) incorporates a new entity recognition module and is built on top of a local mirror of MEDLINE collection and HgenetInfoDB. HgenetInfoDB is a database that integrates data of human genes from the NCBI Gene database and dbSNP. The entity recognition module is based on a corpus of articles annotated with gene identifiers and the new search algorithm, which uses a pattern-based search strategy and a sequence variant nomenclature dictionary for the identification of terms denoting SNPs and other sequence variants and their mapping to dbSNP entries. The use of OSIRISv1.2 generates a corpus of annotated literature linked to sequence database entries (NCBI Gene and dbSNP). The results of the searches are stored in a database that can be used to query the results and, in the future, for the extraction of relationships among biological entities. The performance of OSIRISv1.2 was evaluated on a manually annotated corpus, resulting in a 99 % precision at a 82 % recall, and a F-score of 0.9. 
Abstract
(PMID:18251998) 
Input
Undefined 
Output
Undefined 
State
Online
Download:
Web service:
 
Description
T2K Gene Tagger is a web tool that take a medical text file or a list of gene names and tag genes with <'gene'> tag with taxonomy and sequence information. Note that it will take quiet a long time if you are tagging a long paragraphs or a full paper. 
Abstract
None (No PubMed ref.) 
Input
Free text (paste); 
Output
Gene/protein names; Bio-entity tagged text; Semantically labelled text; Gene/Protein labelled text 
Reference
- 
State
Online
Download:
Web service:
 
Description
TaxonGrab is a tool written in PHP for the purpose of taxonomic name extraction. 
Abstract
TaxonGrab: extracting taxonomic names from text. Abstract identification of organism names in biological texts is essential for the management of archival resources to facilitate comparative biological investigation. Because organism nomenclature conforms closely to prescribed rules, automated techniques may be useful for identifying organism names from existing documents, and may also support the completion of comprehensive indices of taxonomic names; such comprehensive lists are not yet available. Using a combination of contextual rules and a language lexicon, we have developed a set of simple computational techniques for extracting taxonomic names from biological text. Our proposed method consistently performs at greater than 96% Precision and 94% Recall, and at a much higher speed than manual extraction techniques. An implementation of the described method is available as a Web based tool written in PHP. Additionally, the PHP source code is available from SourceForge: http://sourceforge.net/projects/taxongrab, and the project website is http://research.amnh.org/informatics/taxlit/apps/ (No PubMed ref.) 
Input
Undefined 
Output
Undefined 
State
Online
Download:
Web service:
 
Description
Text-mining mutation information from the literature becomes a critical part of the bioinformatics approach for the analysis and interpretation of sequence variations in complex diseases in the post-genomic era. Current approaches are mostly rule-based and focus on limited types of sequence variations such as protein point mutations. Here we report tmVar, a text-mining approach based on conditional random field (CRF) for extracting a wide range of sequence variants in both protein and gene levels according to a standard sequence variants nomenclature developed by the human genome variation society (HGVS). By doing so, we cover several important types of mutations that were not considered in past studies. Using a novel CRF label model with a set of customized features, our method achieves high performance of over 90% in F-measure on both our own corpus and a publicly available benchmarking data set and compares favorably to the state of the art methods.  
Abstract
tmVar: a text mining approach for extracting sequence variants in biomedical literature. MOTIVATION: Text-mining mutation information from the literature becomes a critical part of the bioinformatics approach for the analysis and interpretation of sequence variations in complex diseases in the post-genomic era. It has also been used for assisting the creation of disease-related mutation databases. Most of existing approaches are rule-based and focus on limited types of sequence variations, such as protein point mutations. Thus, extending their extraction scope requires significant manual efforts in examining new instances and developing corresponding rules. As such, new automatic approaches are greatly needed for extracting different kinds of mutations with high accuracy. RESULTS: Here, we report tmVar, a text-mining approach based on conditional random field (CRF) for extracting a wide range of sequence variants described at protein, DNA and RNA levels according to a standard nomenclature developed by the Human Genome Variation Society. By doing so, we cover several important types of mutations that were not considered in past studies. Using a novel CRF label model and feature set, our method achieves higher performance than a state-of-the-art method on both our corpus (91.4 versus 78.1% in F-measure) and their own gold standard (93.9 versus 89.4% in F-measure). These results suggest that tmVar is a high-performance method for mutation extraction from biomedical literature. AVAILABILITY: tmVar software and its corpus of 500 manually curated abstracts are available for download at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/pub/tmVar 
Input
plain text 
Output
mutation mentions 
State
Online
Download:
Web service: