OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents

Abstract

OrganismTagger example result annotationOrganismTagger example result annotation
Motivation: Semantic tagging of organism mentions in full-text articles is an important part of literature mining and semantic enrichment solutions. Tagged organism mentions also play a pivotal role in disambiguating other entities in a text, such as proteins. A high-precision organism tagging system must be able to detect the numerous forms of organism mentions, including common names as well as the traditional taxonomic groups: genus, species and strains. In addition, such a system must resolve abbreviations and acronyms, assign the scientific name and if possible link the detected mention to the NCBI Taxonomy database for further semantic queries and literature navigation.

Results: We present the OrganismTagger, a hybrid rule-based/machine learning system to extract organism mentions from the literature. It includes tools for automatically generating lexical and ontological resources from a copy of the NCBI Taxonomy database, thereby facilitating system updates by end users. Its novel ontology-based resources can also be reused in other semantic mining and linked data tasks. Each detected organism mention is normalized to a canonical name through the resolution of acronyms and abbreviations and subsequently grounded with an NCBI Taxonomy database ID. In particular, our system combines a novel machine-learning approach with rule-based and lexical methods for detecting strain mentions in documents. On our manually annotated OT corpus, the OrganismTagger achieves a precision of 95%, a recall of 94% and a grounding accuracy of 97.5%. On the manually annotated corpus of Linnaeus-100, the results show a precision of 99%, recall of 97% and grounding accuracy of 97.4%.

Availability: The OrganismTagger, including supporting tools, resources, training data and manual annotations, as well as end user and developer documentation, is freely available under an open-source license at http://www.semanticsoftware.info/organism-tagger.

Reference

Nona Naderi, Thomas Kappler, Christopher J.O. Baker, and René Witte. OrganismTagger: Detection, normalization, and grounding of organism entities in biomedical documents. Bioinformatics, Vol. 27, No. 19, pp.2721–2729, Oxford University Press, 2011. DOI: 10.1093/bioinformatics/btr452. (Impact Factor 4.877; 5-Year Impact Factor: 6.325).

History: Received on March 7, 2011; revised on July 14, 2011; accepted on July 31, 2011. First published online August 9, 2011.

Bibtex entry (also for download):

@article{orgtagger11,
	author = {Nona Naderi and Thomas Kappler and 
                  Christopher J.O. Baker and Ren{\'e} Witte},
	title = {{OrganismTagger:} Detection, normalization, and grounding 
                 of organism entities in biomedical documents},
	journal = {Bioinformatics},
	volume = {27},
	number = {19},
	year = {2011},
	month = {August 9, 2011},
	pages = {2721--2729},
	publisher = {Oxford University Press},
	issn = {1460-2059 (online) 1367-4803 (print)},
	doi = {10.1093/bioinformatics/btr452}
}

More Information