OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents


OrganismTagger example result annotationOrganismTagger example result annotation
Motivation: Semantic tagging of organism mentions in full-text articles is an important part of literature mining and semantic enrichment solutions. Tagged organism mentions also play a pivotal role in disambiguating other entities in a text, such as proteins. A high-precision organism tagging system must be able to detect the numerous forms of organism mentions, including common names as well as the traditional taxonomic groups: genus, species and strains. In addition, such a system must resolve abbreviations and acronyms, assign the scientific name and if possible link the detected mention to the NCBI Taxonomy database for further semantic queries and literature navigation.

Results: We present the OrganismTagger, a hybrid rule-based/machine learning system to extract organism mentions from the literature. It includes tools for automatically generating lexical and ontological resources from a copy of the NCBI Taxonomy database, thereby facilitating system updates by end users. Its novel ontology-based resources can also be reused in other semantic mining and linked data tasks. Each detected organism mention is normalized to a canonical name through the resolution of acronyms and abbreviations and subsequently grounded with an NCBI Taxonomy database ID. In particular, our system combines a novel machine-learning approach with rule-based and lexical methods for detecting strain mentions in documents. On our manually annotated OT corpus, the OrganismTagger achieves a precision of 95%, a recall of 94% and a grounding accuracy of 97.5%. On the manually annotated corpus of Linnaeus-100, the results show a precision of 99%, recall of 97% and grounding accuracy of 97.4%.

Availability: The OrganismTagger, including supporting tools, resources, training data and manual annotations, as well as end user and developer documentation, is freely available under an open-source license at

Ontology Design for Biomedical Text Mining


Text Mining in biology and biomedicine requires a large amount of domain-specific knowledge. Publicly accessible resources hold much of the information needed, yet their practical integration into natural language processing (NLP) systems is fraught with manifold hurdles, especially the problem of semantic disconnectedness throughout the various resources and components. Ontologies can provide the necessary framework for a consistent semantic integration, while additionally delivering formal reasoning capabilities to NLP.

In this chapter, we address four important aspects relating to the integration of ontology and NLP: (i) An analysis of the different integration alternatives and their respective vantages; (ii) The design requirements for an ontology supporting NLP tasks; (iii) Creation and initialization of an ontology using publicly available tools and databases; and (iv) The connection of common NLP tasks with an ontology, including technical aspects of ontology deployment in a text mining framework. A concrete application example—text mining of enzyme mutations—is provided to motivate and illustrate these points.

Keywords: Text Mining, NLP, Ontology Design, Ontology Population, Ontological NLP

Syndicate content