The FungalWeb Ontology is a knowledge representation vehicle designed to integrate information relevant to industrial applications of enzymes. The ontology integrates information from established sources and supports complex queries to the instantiated FungalWeb knowledge base. The ontology represents prototype Semantic Web technology customized to the domain of industrial enzymes with a focus on enzyme discovery, commercial enzyme products and vendors, and the industrial applications and benefits of industrial enzymes. Using a series of application scenarios we demonstrate the utility of this 'Semantic Web' infrastructure to the enzyme biotechnologist.
Text Mining in biology and biomedicine requires a large amount of domain-specific knowledge. Publicly accessible resources hold much of the information needed, yet their practical integration into natural language processing (NLP) systems is fraught with manifold hurdles, especially the problem of semantic disconnectedness throughout the various resources and components. Ontologies can provide the necessary framework for a consistent semantic integration, while additionally delivering formal reasoning capabilities to NLP.
In this chapter, we address four important aspects relating to the integration of ontology and NLP: (i) An analysis of the different integration alternatives and their respective vantages; (ii) The design requirements for an ontology supporting NLP tasks; (iii) Creation and initialization of an ontology using publicly available tools and databases; and (iv) The connection of common NLP tasks with an ontology, including technical aspects of ontology deployment in a text mining framework. A concrete application example—text mining of enzyme mutations—is provided to motivate and illustrate these points.
Keywords: Text Mining, NLP, Ontology Design, Ontology Population, Ontological NLP
Enhanced Semantic Access to the Protein Engineering Literature using Ontologies Populated by Text Mining
The biomedical literature is growing at an ever-increasing rate, which pronounces the need to support scientists with advanced, automated means of accessing knowledge. We investigate a novel approach employing description logics (DL)-based queries made to formal ontologies that have been created using the results of text mining full-text research papers. In this paradigm, an OWL-DL ontology becomes populated with instances detected through natural language processing (NLP). The generated ontology can be queried by biologists using DL reasoners or integrated into bioinformatics workflows for further automated analyses. We demonstrate the feasibility of this approach with a system targeting the protein mutation literature.
Keywords: text mining; semantic web; ontological NLP; protein mutations; automated reasoning in bioinformatics; querying OWL-DL ontologies; description logics.
Software maintainers routinely have to deal with a multitude of artifacts, like source code or documents. These artifacts often end up disconnected from each other, due to their different representations and levels of abstractions. One of the main challenges in software maintenance therefore is to recover and maintain the semantic connections among these artifacts. In this research, we present a novel approach that addresses this traceability issue by creating formal ontological representations for both software documentation and source code artifacts. The resulting representations are then aligned to establish traceability links at semantic level. Ontological queries and reasoning can be applied on these representations to infer and establish additional traceability links to support specific maintenance tasks.
Categories and Subject Descriptors: D2.7 [Distribution, Maintenance, and Enhancement]: Documentation, Restructuring, reverse engineering
General Terms: Software, Documentation, Management
Keywords: Ontologies, Traceability, Software Maintenance
Update summaries as defined for the new DUC 2007 task deliver focused information to a user who has already read a set of older documents covering the same topic. In this paper, we show how to generate this kind of summary from the same data structure—fuzzy coreference cluster graphs—as all other generic and focused multi-document summaries. Our system ERSS 2007 implementing this algorithm also participated in the DUC 2007 main task, without any changes from the 2006 version.
We present a system capable of modeling human newspaper readers. It is based on the extraction of reported speech, which is subsequently converted into a fuzzy theory-based representation of single statements. A domain analysis then assigns statements to topics. A number of fuzzy set operators, including fuzzy belief revision, are applied to model different belief strategies. At the end, our system holds certain beliefs while rejecting others.
Large document collections, such as those delivered by Internet search engines, are difficult and time-consuming for users to read and analyse. The detection of common and distinctive topics within a document set, together with the generation of multi-document summaries, can greatly ease the burden of information management. We show how this can be achieved with a clustering algorithm based on fuzzy set theory, which (i) is easy to implement and integrate into a personal information system, (ii) generates a highly flexible data structure for topic analysis and summarization, and (iii) also delivers excellent performance.
Documents written in natural languages constitute a major part of the artifacts produced during the software engineering lifecycle. Especially during software maintenance or reverse engineering, semantic information conveyed in these documents can provide important knowledge for the software engineer. In this paper, we present a text mining system capable of populating a software ontology with information detected in documents. A particular novelty is the integration of results from automated source code analysis into an NLP pipeline, allowing to cross-link software artifacts represented in code and natural language on a semantic level.
Software maintainers routinely have to deal with a multitude of artifacts, like source code or documents, which often end up disconnected, due to their different representations and the size and complexity of legacy systems. One of the main challenges in software maintenance is to establish and maintain the semantic connections among all the different artifacts. In this paper, we show how Semantic Web technologies can deliver a unified representation to explore, query and reason about a multitude of software artifacts. A novel feature is the automatic integration of two important types of software maintenance artifacts, source code and documents, by populating their corresponding sub-ontologies through code analysis and text mining. We demonstrate how the resulting "Software Semantic Web" can support typical maintenance tasks through ontology queries and DL reasoning, such as security analysis, architectural evolution, and traceability recovery between code and documents.
Keywords: Software Maintenance, Ontology Population, Text Mining.
The development of text analysis systems targeting the extraction of information about mutations from research publications is an emergent topic in biomedical research. Current systems differ in both scope and approaches, which prevents a meaningful comparison of their performance and therefore possible synergies. To overcome this "evaluation bottleneck," we developed a comprehensive framework for the systematic analysis of mutation extraction systems, precisely defining tasks and corresponding evaluation metrics that will allow a comparison of existing and future applications.
Keywords: mutation extraction systems; mutation evaluation tasks; mutation evaluation metrics