Text Mining

Ontology Design for Biomedical Text Mining


Text Mining in biology and biomedicine requires a large amount of domain-specific knowledge. Publicly accessible resources hold much of the information needed, yet their practical integration into natural language processing (NLP) systems is fraught with manifold hurdles, especially the problem of semantic disconnectedness throughout the various resources and components. Ontologies can provide the necessary framework for a consistent semantic integration, while additionally delivering formal reasoning capabilities to NLP.

In this chapter, we address four important aspects relating to the integration of ontology and NLP: (i) An analysis of the different integration alternatives and their respective vantages; (ii) The design requirements for an ontology supporting NLP tasks; (iii) Creation and initialization of an ontology using publicly available tools and databases; and (iv) The connection of common NLP tasks with an ontology, including technical aspects of ontology deployment in a text mining framework. A concrete application example—text mining of enzyme mutations—is provided to motivate and illustrate these points.

Keywords: Text Mining, NLP, Ontology Design, Ontology Population, Ontological NLP

Enhanced Semantic Access to the Protein Engineering Literature using Ontologies Populated by Text Mining


The biomedical literature is growing at an ever-increasing rate, which pronounces the need to support scientists with advanced, automated means of accessing knowledge. We investigate a novel approach employing description logics (DL)-based queries made to formal ontologies that have been created using the results of text mining full-text research papers. In this paradigm, an OWL-DL ontology becomes populated with instances detected through natural language processing (NLP). The generated ontology can be queried by biologists using DL reasoners or integrated into bioinformatics workflows for further automated analyses. We demonstrate the feasibility of this approach with a system targeting the protein mutation literature.

Keywords: text mining; semantic web; ontological NLP; protein mutations; automated reasoning in bioinformatics; querying OWL-DL ontologies; description logics.

Creating a Fuzzy Believer to Model Human Newspaper Readers

Montreal 2007


We present a system capable of modeling human newspaper readers. It is based on the extraction of reported speech, which is subsequently converted into a fuzzy theory-based representation of single statements. A domain analysis then assigns statements to topics. A number of fuzzy set operators, including fuzzy belief revision, are applied to model different belief strategies. At the end, our system holds certain beliefs while rejecting others.

Text Mining and Software Engineering: An Integrated Source Code and Document Analysis Approach



Documents written in natural languages constitute a major part of the artifacts produced during the software engineering lifecycle. Especially during software maintenance or reverse engineering, semantic information conveyed in these documents can provide important knowledge for the software engineer. In this paper, we present a text mining system capable of populating a software ontology with information detected in documents. A particular novelty is the integration of results from automated source code analysis into an NLP pipeline, allowing to cross-link software artifacts represented in code and natural language on a semantic level.

Towards a Systematic Evaluation of Protein Mutation Extraction Systems


The development of text analysis systems targeting the extraction of information about mutations from research publications is an emergent topic in biomedical research. Current systems differ in both scope and approaches, which prevents a meaningful comparison of their performance and therefore possible synergies. To overcome this "evaluation bottleneck," we developed a comprehensive framework for the systematic analysis of mutation extraction systems, precisely defining tasks and corresponding evaluation metrics that will allow a comparison of existing and future applications.

Keywords: mutation extraction systems; mutation evaluation tasks; mutation evaluation metrics

Protein Domains

Ontological Text Mining of Software Documents

Paris, France


Documents written in natural languages constitute a major part of the software engineering lifecycle artifacts. Especially during software maintenance or reverse engineering, semantic information conveyed in these documents can provide important knowledge for the software engineer. In this paper, we present a text mining system capable of populating a software ontology with information detected in documents.

Processing of Beliefs extracted from Reported Speech in Newspaper Articles

A fuzzy believer?


The growing number of publicly available information sources makes it impossible for individuals to keep track of all the various opinions on one topic. The goal of our artificial believer system presented in this paper is to extract and analyze statements of opinion from newspaper articles.

Beliefs are modeled using a fuzzy-theoretic approach applied after NLP-based information extraction. A fuzzy believer models a human agent, deciding what statements to believe or reject based on different, configurable strategies.

Enriching Protein Structure Visualizations with Mutation Annotations Obtained by Text Mining Protein Engineering Literature

Multiple Sequence Alignment


Protein structure visualization tools render images that allow the user to explore structural features of a protein. Context specific information relating to a particular protein or protein family is not easily integrated and must be uploaded from databases or provided through manual curation of input files. We describe a mixed natural language processing and sequence analysis based approach for the retrieval of mutation specific annotations from full text articles for rendering with protein structures.


Text Mining, Protein Structure Annotation, Protein Function, ProSAT, Xylanase

Engineering a Semantic Desktop for Building Historians and Architects

Page scan from 'Handbuch der Architektur'


We analyse the requirements for an advanced semantic support of users—building historians and architects—of a multi-volume encyclopedia of architecture from the late 19th century. Novel requirements include the integration of content retrieval, content development, and automated content analysis based on natural language processing.

We present a system architecture for the detected requirements and its current implementation. A complex scenario demonstrates how a desktop supporting semantic analysis can contribute to specific, relevant user tasks.

Combining Biological Databases and Text Mining to support New Bioinformatics Applications

Alicante, Spain


A large amount of biological knowledge today is only available from full-text research papers. Since neither manual database curators nor users can keep up with the rapidly expanding volume of scientific literature, natural language processing approaches are becoming increasingly important for bioinformatic projects.

In this paper, we go beyond simply extracting information from full-text articles by describing an architecture that supports targeted access to information from biological databases using the results derived from text mining of research papers, thereby integrating information from both sources within a biological application.

The described architecture is currently being used to extract information about protein mutations from full-text research papers. Text mining results drive the retrieval of sequence information from protein databases and the employment of algorithmic sequence analysis tools, which facilitate further data access from protein structure databases. Complex mapping of NLP derived text annotations to protein structures allows the rendering, with 3D structure visualization, of information not available in databases of mutation annotations.

Syndicate content