Recent posts

Enhancing the OpenOffice.org Word Processor with Natural Language Processing Capabilities


Abstract

Today's knowledger workers are often overwhelmed by the vast amount of readily available natural language documents that are potentially relevant for a given task. Natural language processing (NLP) and text mining techniques can deliver automated analysis support, but they are often not integrated into commonly used desktop clients, such as word processors. We present a plug-in for the OpenOffice.org word processor Writer that allows to access any kind of NLP analysis service mediated through a service-oriented architecture. Semantic Assistants can now provide services such as information extraction, question-answering, index generation, or automatic summarization directly within an end user's application.

Professional Activities

Table of Contents [hide]

I have been involved in a number of review and event organization activities.

New Job, New Website

As of June 1st, 2008, I'm now working as an assistant professor in the Department of Computer Science and Software Engineering at Concordia University in Montréal, Canada. Coinciding with the new position, I'm also building a new website, www.semanticsoftware.info. There are two main ideas behind this website: First, to inform about the research and teaching activities of my Semantic Software Lab, which I'm establishing at Concordia; and second, to establish a community portal for selected topics in the area of semantic systems — for example, for people interested in the applications of NLP in software engineering.

A Semantic Wiki Approach to Cultural Heritage Data Management

Abstract

Providing access to cultural heritage data beyond book digitization and information retrieval projects is important for delivering advanced semantic support to end users, in order to address their specific needs. We introduce a separation of concerns for heritage data management by explicitly defining different user groups and analyzing their particular requirements. Based on this analysis, we developed a comprehensive system architecture for accessing, annotating, and querying textual historic data. Novel features are the deployment of a Wiki user interface, natural language processing services for end users, metadata generation in OWL ontology format, SPARQL queries on textual data, and the integration of external clients through Web Services. We illustrate these ideas with the management of a historic encyclopedia of architecture.

Minding the Source: Automatic Tagging of Reported Speech in Newspaper Articles


Abstract

Reported speech in the form of direct and indirect reported speech is an important indicator of evidentiality in traditional newspaper texts, but also increasingly in the new media that rely heavily on citation and quotation of previous postings, as for instance in blogs or newsgroups. This paper details the basic processing steps for reported speech analysis and reports on performance of an implementation in form of a GATE resource.

Deadline extended for STSM

We extended the paper submission deadline for our workshop on Semantic Technologies in System Maintenance (STSM) to April 25th.

Traceability in Software Engineering - Past, Present and Future

CASCON 2007 Workshop Report

IBM Technical Report: TR-74-211

October 25, 2007

Abstract

Many changes have occurred in software engineering research and practice since 1968, when software engineering as a research domain was established. One of these research areas is traceability, a key aspect of any engineering discipline, enables engineers to understand the relations and dependencies among various artifacts in a system.

Call for Papers: International Workshop on Semantic Technologies in System Maintenance (STSM 2008)

Together with Jürgen Rilling, Dragan Gaševi?, and Jeff Z. Pan I'm organizing the first International Workshop on Semantic Technologies in System Maintenance (STSM 2008), which will be co-located with the 16th IEEE International Conference on Program Comprehension (ICPC 2008) in Amsterdam, The Netherlands.

Detailed information on the workshop, submission guidelines, and other news are now available from the workshop's webpage.

Workshop on Semantic Technologies in System Maintenance at ICPC 2008

It's official: I'm co-organizing the (first) International Workshop on Semantic Technologies in System Maintenance (STSM) at the next IEEE International Conference on Program Comprehension (ICPC 2008) in Amsterdam, The Netherlands. Some preliminary information are available on the ICPC website. A call for papers and more details are coming soon!

A Unified Ontology-Based Process Model for Software Maintenance and Comprehension

Abstract

In this paper, we present a formal process model to support the comprehension and maintenance of software systems. The model provides a formal ontological representation that supports the use of reasoning services across different knowledge resources. In the presented approach, we employ our Description Logic knowledge base to support the maintenance process management, as well as detailed analyses among resources, e.g., the traceability between various software artifacts. The resulting unified process model provides users with active guidance in selecting and utilizing these resources that are context-sensitive to a particular comprehension task. We illustrate both, the technical foundation based on our existing SOUND environment, as well as the general objectives and goals of our process model.

Keywords: Software maintenance, process modeling, ontological reasoning, software comprehension, traceability, text mining.

An Ontological Software Comprehension Process Model

Abstract

Comprehension is an essential part of software maintenance. Only software that is well understood can evolve in a controlled manner. In this paper, we present a formal process model to support the comprehension of software systems by using Ontology and Description Logic. This formal representation supports the use of reasoning services across different knowledge resources and therefore, enables us to provide users with guidance during the comprehension process that is context sensitive to their particular comprehension task.

Keywords: Software maintenance, program comprehension, process modeling, ontological reasoning

An Ontology-based Approach for the Recovery of Traceability Links

Abstract

Traceability links provide support for software engineers in understanding the relations and dependencies among software artifacts created during the software development process. In this research, we focus on re-establishing traceability links between existing source code and documentation to support reverse engineering. We present a novel approach that addresses this issue by creating formal ontological representations for both the documentation and source code artifacts.

A Context-Driven Software Comprehension Process Model

Abstract

Comprehension is an essential part of software evolution. Only software that is well understood can evolve in a controlled manner. In this paper, we present a formal process model to support the comprehension of software systems by using Ontology and Description Logic. This formal representation supports the use of reasoning services across different knowl- edge resources and therefore, enables us to provide users with guidance during the comprehension process that is context sensitive to their particular comprehension task. As part of the process model, we also adopt a new interactive story metaphor, to represent the interactions between users and the comprehension process.

Keywords: Software evolution, program comprehension, process modeling, story metaphor, ontological reasoning

Ontology-based Program Comprehension Tool Supporting Website Architectural Evolution

Abstract

A challenge of existing program comprehension approaches is to provide consistent and flexible representations for software systems. Maintainers have to match their mental models with the different representations these tools provide. In this paper, we present a novel approach that addresses this issue by providing a consistent ontological representation for both source code and documentation. The ontological representation unifies information from various sources, and therefore reduces the maintainers’ comprehension efforts. In addition, representing software artifacts in a formal ontology enables maintainers to formulate hypotheses about various properties of software systems. These hypotheses can be validated through an iterative exploration of information derived by our ontology inference engine. The implementation of our approach is presented in detail, and a case study is provided to demonstrate the applicability of our approach during the architectural evolution of a website content management system.

Keywords: Program Comprehension, Software Evolution, Ontology, Automated Reasoning

Tutorial: Applications for the Semantic Web

Description

The Semantic Web vision is considered the next generation of the Web that enables sharing data, resources and knowledge between parties that belong to different organizations, different cultures, and/or different communities. Ontologies and rules play the main role in the Semantic Web for publishing community vocabularies and policies, for annotating resources and for turning Web applications into inference-enabled collaboration platforms. After a short introduction into the basic concepts, standards, and tools of the Semantic Web, we present how today's Semantic Web tools, languages, and techniques can be used in various application. We first start from the use of the Semantic Web technologies for providing online educators with feedback about how their students use online courses in learning management systems. Next, we demonstrate the use of the Semantic Web technologies and text mining techniques to improve software development process and software maintenance. Finally, we explain the use of the Semantic Web technologies in multimedia-enhanced applications.

Tutorial: Introduction to Text Mining

Tutorial Description

Do you have a lack of information? Or do you rather feel overwhelmed by the sheer amount of (online) available content, like emails, news, web pages, and electronic documents? The rather young field of Text Mining developed from the observation that most knowledge today - more than 80% of the data stored in databases - is hidden within documents written in natural languages and thus cannot be automatically processed by traditional information systems.

Text Mining, "also known as intelligent text analysis, text data mining or knowledge-discovery in text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text." Text Mining is a highly interdisciplinary field, drawing on foundations and technologies from fields like computational linguistics, database systems, and artificial intelligence, but applying these in new and often unconventional ways.

Text Mining: Wissensgewinnung aus natürlichsprachigen Dokumenten

(This webpage is about a technical report on Text Mining, written in German. Try Google Translate for an English version.)
Text Mining Bericht Titelseite

Interner Bericht 2006-5, Fakultät für Informatik, Universität Karlsruhe (TH), Germany

Herausgegeben von René Witte und Jutta Mülle

ISSN 1432-7864

200 Seiten, 75 Abbildungen

Mutation Miner - Textual Annotation of Protein Structures

Abstract

Protein structure visualization tools render images that allow the user to explore structural features of a protein. Context specific information relating to a particular protein or protein family is not easily integrated and must be uploaded from databases or provided through manual curation of input files. We describe a mixed natural language processing and protein sequence analysis approach for the retrieval of mutation specific annotations from full text articles for rendering with protein structures.

Fuzzy Extensions for Reverse Engineering Repository Models

Abstract

Slide from the WCRE 2003 talkReverse Engineering is a process fraught with imperfections. The importance of dealing with non-precise, possibly inconsistent data explicitly when interacting with the reverse engineer has been pointed out before.

In this paper, we go one step further: we argue that the complete reverse engineering process must be augmented with a formal representation model capable of modeling imperfections. This includes automatic as well as human-centered tools.

We show how this can be achieved by merging a fuzzy set-theory based knowledge representation model with a reverse engineering repository. Our approach is not only capable of modeling a wide range of different kinds of imperfections (uncertain as well as vague information), but also admits robust processing models by defining explicit degrees of certainty and their modification through fuzzy belief revision operators.

The repository-centered approch is proposed as the foundation for a new generation of reverse engineering tools. We show how various RE tasks can benefit from our approach and state first design ideas for fuzzy reverse engineering tools.

Mutation Miner (CPI 2005)

Introduction

Biological researchers today have access to vast amounts of exponentially growing research data in a structured form within several publicly accessible databases. A large proportion of salient information is however still hidden within individual research papers, since costly manual database curation efforts are overwhelmed by the scale of new information being generated. In the domain of protein engineering, critical units of information required from the literature include: the identity of the mutated protein, the identity and position of wild type residues that are mutated, the identity of the resulting mutant residues and the impacts of the mutations on functional properties of the proteins.
Mutation Miner is a system designed to automate the extraction of mutations and textual annotations describing the impacts of mutations on protein properties (mutation annotations) from full text scientific literature. Furthermore, the system retrieves and carries out bioinformatic analyses on mutated sequences providing the mapped coordinates of mutants on a selected structure. Integration of multiple formatted mutation annotations with associated residue coordinates facilitates their rendering with structure visualization tools. We describe the architecture and tools that support Mutation Miner (Text mining-NLP, Sequence Analysis, Structure Visualization) and present performance evaluations that demonstrate the feasibility of this approach.

Mutation Miner (ISMB 2005)

Introduction

Biological researchers today have access to vast amounts of exponentially growing research data in a structured form within several publicly accessible databases. A large proportion of salient information is however still hidden within individual research papers, since costly manual database curation efforts are overwhelmed by the scale of new information being generated. In the domain of protein engineering, critical units of information required from the literature include: the identity of the mutated protein, the identity and position of wild type residues that are mutated, the identity of the resulting mutant residues and the impacts of the mutations on functional properties of the proteins.
Mutation Miner is a system designed to automate the extraction of mutations and textual annotations describing the impacts of mutations on protein properties (mutation annotations) from full text scientific literature. Furthermore, the system retrieves and carries out bioinformatic analyses on mutated sequences providing the mapped coordinates of mutants on a selected structure. Integration of multiple formatted mutation annotations with associated residue coordinates facilitates their rendering with structure visualization tools. We describe the architecture and tools that support Mutation Miner (Text mining-NLP, Sequence Analysis, Structure Visualization) and present performance evaluations that demonstrate the feasibility of this approach.

Empowering the Enzyme Biotechnologist with Ontologies

Introduction

The FungalWeb Ontology is a knowledge representation vehicle designed to integrate information relevant to industrial applications of enzymes. The ontology integrates information from established sources and supports complex queries to the instantiated FungalWeb knowledge base. The ontology represents prototype Semantic Web technology customized to the domain of industrial enzymes with a focus on enzyme discovery, commercial enzyme products and vendors, and the industrial applications and benefits of industrial enzymes. Using a series of application scenarios we demonstrate the utility of this 'Semantic Web' infrastructure to the enzyme biotechnologist.

Ontology Design for Biomedical Text Mining

Abstract

Text Mining in biology and biomedicine requires a large amount of domain-specific knowledge. Publicly accessible resources hold much of the information needed, yet their practical integration into natural language processing (NLP) systems is fraught with manifold hurdles, especially the problem of semantic disconnectedness throughout the various resources and components. Ontologies can provide the necessary framework for a consistent semantic integration, while additionally delivering formal reasoning capabilities to NLP.

In this chapter, we address four important aspects relating to the integration of ontology and NLP: (i) An analysis of the different integration alternatives and their respective vantages; (ii) The design requirements for an ontology supporting NLP tasks; (iii) Creation and initialization of an ontology using publicly available tools and databases; and (iv) The connection of common NLP tasks with an ontology, including technical aspects of ontology deployment in a text mining framework. A concrete application example—text mining of enzyme mutations—is provided to motivate and illustrate these points.

Keywords: Text Mining, NLP, Ontology Design, Ontology Population, Ontological NLP

Fuzzy Set Theory-Based Belief Processing for Natural Language Texts

Introduction

The growing number of publicly available information sources makes it impossible for individuals to keep track of all the various opinions on one topic. The goal of our artificial believer system we present in this paper is to extract and analyze opinionated statements from newspaper articles.

Beliefs are modeled with a fuzzy-theoretic approach applied after NLP-based information extraction. A fuzzy believer models a human agent, deciding what statements to believe or reject based on different, configurable strategies.

Enhanced Semantic Access to the Protein Engineering Literature using Ontologies Populated by Text Mining

Abstract

The biomedical literature is growing at an ever-increasing rate, which pronounces the need to support scientists with advanced, automated means of accessing knowledge. We investigate a novel approach employing description logics (DL)-based queries made to formal ontologies that have been created using the results of text mining full-text research papers. In this paradigm, an OWL-DL ontology becomes populated with instances detected through natural language processing (NLP). The generated ontology can be queried by biologists using DL reasoners or integrated into bioinformatics workflows for further automated analyses. We demonstrate the feasibility of this approach with a system targeting the protein mutation literature.

Keywords: text mining; semantic web; ontological NLP; protein mutations; automated reasoning in bioinformatics; querying OWL-DL ontologies; description logics.