Recent posts

Enhanced Semantic Access to the Protein Engineering Literature using Ontologies Populated by Text Mining


The biomedical literature is growing at an ever-increasing rate, which pronounces the need to support scientists with advanced, automated means of accessing knowledge. We investigate a novel approach employing description logics (DL)-based queries made to formal ontologies that have been created using the results of text mining full-text research papers. In this paradigm, an OWL-DL ontology becomes populated with instances detected through natural language processing (NLP). The generated ontology can be queried by biologists using DL reasoners or integrated into bioinformatics workflows for further automated analyses. We demonstrate the feasibility of this approach with a system targeting the protein mutation literature.

Keywords: text mining; semantic web; ontological NLP; protein mutations; automated reasoning in bioinformatics; querying OWL-DL ontologies; description logics.

Automatic Traceability Recovery: An Ontological Approach


Software maintainers routinely have to deal with a multitude of artifacts, like source code or documents. These artifacts often end up disconnected from each other, due to their different representations and levels of abstractions. One of the main challenges in software maintenance therefore is to recover and maintain the semantic connections among these artifacts. In this research, we present a novel approach that addresses this traceability issue by creating formal ontological representations for both software documentation and source code artifacts. The resulting representations are then aligned to establish traceability links at semantic level. Ontological queries and reasoning can be applied on these representations to infer and establish additional traceability links to support specific maintenance tasks.

Categories and Subject Descriptors: D2.7 [Distribution, Maintenance, and Enhancement]: Documentation, Restructuring, reverse engineering
General Terms: Software, Documentation, Management
Keywords: Ontologies, Traceability, Software Maintenance

Generating Update Summaries for DUC 2007


Update summaries as defined for the new DUC 2007 task deliver focused information to a user who has already read a set of older documents covering the same topic. In this paper, we show how to generate this kind of summary from the same data structure—fuzzy coreference cluster graphs—as all other generic and focused multi-document summaries. Our system ERSS 2007 implementing this algorithm also participated in the DUC 2007 main task, without any changes from the 2006 version.

An Initial Fuzzy Coreference Cluster Graph

Creating a Fuzzy Believer to Model Human Newspaper Readers

Montreal 2007


We present a system capable of modeling human newspaper readers. It is based on the extraction of reported speech, which is subsequently converted into a fuzzy theory-based representation of single statements. A domain analysis then assigns statements to topics. A number of fuzzy set operators, including fuzzy belief revision, are applied to model different belief strategies. At the end, our system holds certain beliefs while rejecting others.

Fuzzy Clustering for Topic Analysis and Summarization of Document Collections

Montreal 2007


Large document collections, such as those delivered by Internet search engines, are difficult and time-consuming for users to read and analyse. The detection of common and distinctive topics within a document set, together with the generation of multi-document summaries, can greatly ease the burden of information management. We show how this can be achieved with a clustering algorithm based on fuzzy set theory, which (i) is easy to implement and integrate into a personal information system, (ii) generates a highly flexible data structure for topic analysis and summarization, and (iii) also delivers excellent performance.

Text Mining and Software Engineering: An Integrated Source Code and Document Analysis Approach



Documents written in natural languages constitute a major part of the artifacts produced during the software engineering lifecycle. Especially during software maintenance or reverse engineering, semantic information conveyed in these documents can provide important knowledge for the software engineer. In this paper, we present a text mining system capable of populating a software ontology with information detected in documents. A particular novelty is the integration of results from automated source code analysis into an NLP pipeline, allowing to cross-link software artifacts represented in code and natural language on a semantic level.

Empowering Software Maintainers with Semantic Web Technologies

Achtung Seilbahn!


Software maintainers routinely have to deal with a multitude of artifacts, like source code or documents, which often end up disconnected, due to their different representations and the size and complexity of legacy systems. One of the main challenges in software maintenance is to establish and maintain the semantic connections among all the different artifacts. In this paper, we show how Semantic Web technologies can deliver a unified representation to explore, query and reason about a multitude of software artifacts. A novel feature is the automatic integration of two important types of software maintenance artifacts, source code and documents, by populating their corresponding sub-ontologies through code analysis and text mining. We demonstrate how the resulting "Software Semantic Web" can support typical maintenance tasks through ontology queries and DL reasoning, such as security analysis, architectural evolution, and traceability recovery between code and documents.

Keywords: Software Maintenance, Ontology Population, Text Mining.

Outdated Link!

Hello. The link you are using to access no longer exists. Please use the search and navigation functions to find the content you are looking for, or start over from the homepage.

Towards a Systematic Evaluation of Protein Mutation Extraction Systems


The development of text analysis systems targeting the extraction of information about mutations from research publications is an emergent topic in biomedical research. Current systems differ in both scope and approaches, which prevents a meaningful comparison of their performance and therefore possible synergies. To overcome this "evaluation bottleneck," we developed a comprehensive framework for the systematic analysis of mutation extraction systems, precisely defining tasks and corresponding evaluation metrics that will allow a comparison of existing and future applications.

Keywords: mutation extraction systems; mutation evaluation tasks; mutation evaluation metrics

Protein Domains

Ontological Text Mining of Software Documents

Paris, France


Documents written in natural languages constitute a major part of the software engineering lifecycle artifacts. Especially during software maintenance or reverse engineering, semantic information conveyed in these documents can provide important knowledge for the software engineer. In this paper, we present a text mining system capable of populating a software ontology with information detected in documents.

Task-Dependent Visualization of Coreference Resolution Results

A single coreference chains visualized as a Topic Map


Graphical visualizations of coreference chains support a system developer in analyzing the behavior of a resolution algorithm. In this paper, we state explicit use cases for coreference chain visualizations and show how they can be resolved by transforming chains into other, standardized data formats, namely Topic Maps and Ontologies.

Processing of Beliefs extracted from Reported Speech in Newspaper Articles

A fuzzy believer?


The growing number of publicly available information sources makes it impossible for individuals to keep track of all the various opinions on one topic. The goal of our artificial believer system presented in this paper is to extract and analyze statements of opinion from newspaper articles.

Beliefs are modeled using a fuzzy-theoretic approach applied after NLP-based information extraction. A fuzzy believer models a human agent, deciding what statements to believe or reject based on different, configurable strategies.

Next-Generation Summarization: Contrastive, Focused, and Update Summaries

Conference Hotel, Borovets, Bulgaria


Classical multi-document summaries focus on the common topics of a document set and omit distinctive themes particular to a single document—thereby often suppressing precisely that kind of information a user might need for a specific task. This can be avoided through advanced multi-document summaries that take a user's context and history into account, by delivering focused, contrastive, or update summaries. To facilitate the generation of these different summaries, we propose to generate all types from a single data structure, topic clusters, which provide for an abstract representation of a set of documents. Evaluations carried out on five years' worth of data from the DUC summarization competition prove the feasibility of this approach.

Connecting Wikis and Natural Language Processing Systems

Palais de Congres, Montreal, Canada


We investigate the integration of Wiki systems with automated natural language processing (NLP) techniques. The vision is that of a "self-aware" Wiki system reading, understanding, transforming, and writing its own content, as well as supporting its users in information analysis and content development. We provide a number of practical application examples, including index generation, question answering, and automatic summarization, which demonstrate the practicability and usefulness of this idea. A system architecture providing the integration is presented, as well as first results from an initial implementation based on the GATE framework for NLP and the MediaWiki system.

General Terms: Design, Human Factors, Languages
Keywords: Self-aware Wiki System, Wiki/NLP Integration

LockMe! for PalmOS

LockMe icon Current Version is 1.1.
  Works on PalmOS 2.x and higher
  Developed under Linux with gcc, pilrc and CoPilot.

(This web page is about an old PalmOS security utility of mine, LockMe! Although no longer being maintained, the tool and its source code are still available.)


LockMe! periodically locks your Palm, starting at a specified time.

Fuzzy Belief Revision


Fuzzy sets, having been the long-standing mainstay of modeling and manipulating imperfect information, are an obvious candidate for representing uncertain beliefs.

Unfortunately, unadorned fuzzy sets are too limited to capture complex or potentially inconsistent beliefs, because all too often they reduce to absurdities ("nothing is possible") or trivialities ("everything is possible").

However, we show that by combining the syntax of propositional logic with the semantics of fuzzy sets a rich framework for expressing and manipulating uncertain beliefs can be created, admitting Gärdenfors-style expansion, revision, and contraction operators and being moreover amenable to easy integration with conventional ``crisp'' information processing.

The model presented here addresses many of the shortcomings of traditional approaches for building fuzzy data models, which will hopefully lead to a wider adoptance of fuzzy technologies for the creation of information systems.


fuzzy belief revision, fuzzy information systems, soft computing, fuzzy object-oriented data model

Fuzzy Coreference Resolution for Summarization



We present a fuzzy-theory based approach to coreference resolution and its application to text summarization.

Automatic determination of coreference between noun phrases is fraught with uncertainty. We show how fuzzy sets can be used to design a new coreference algorithm which captures this uncertainty in an explicit way and allows us to define varying degrees of coreference.

The algorithm is evaluated within a system that participated in the 10-word summary task of the DUC 2003 competition.

Using Knowledge-poor Coreference Resolution for Text Summarization


We present a system that produces 10-word summaries based on the single summarization strategy of outputting noun phrases representing the most important text entities (as represented by noun phrase coreference chains). The coreference chains were computed using fuzzy set theory combined with knowledge-poor corefernce heuristics.

An Integration Architecture for User-Centric Document Creation, Retrieval, and Analysis



The different stages in the life-cycle of content—creation, storage, retrieval, and analysis—are usually regarded as distinct and isolated steps. In this paper we examine the synergies resulting from their integration within a single architecture.

Our goal is to employ such an architecture to improve user support for knowledge-intensive tasks. We present a case study from the area of building architecture, which is currently ongoing.



We present here the outline of an ongoing research effort to recognize, represent, and interpret attributive constructions such as reported speech in newspaper articles. The role of reported speech is attribution: the statement does not assert some information as `true' but attributes it to some source. The description of the source and the choice of the reporting verb can express the reporter's level of confidence in the attributed material.

Supporting Reverse Engineering Tasks with a Fuzzy Repository Framework


Bad Honnef, the place to go!
Software reverse engineering (RE) is often hindered not by the lack of available data, but by an overabundance of it: the (semi-)automatic analysis of static and dynamic code information, data, and documentation results in a huge heap of often incomparable data. Additionally, the gathered information is typically fraught with various kinds of imperfections, for example conflicting information found in software documentation vs. program code.

Our approach to this problem is twofold: for the management of the diverse RE results we propose the use of a repository, which supports an iterative and incremental discovery process under the aid of a reverse engineer. To deal with imperfections, we propose to enhance the repository model with additional representation and processing capabilities based on fuzzy set theory and fuzzy belief revision.


fuzzy reverse engineering, meta model, extension framework, iterative process, knowledge evolution

Multi-ERSS and ERSS 2004


Last year, we presented a system, ERSS, which constructed 10 word summaries in form of a list of noun phrases. It was based on a knowledge-poor extraction of noun phrase coreference chains implemented on a fuzzy set theoretic base. This year we present the performance of an improved version, ERSS 2004 and an extension of the same basic system: Multi-ERSS constructs 100-word extract summaries for clusters of texts. With very few modifications we ran ERSS 2004 on Tasks 1 and 3 and Multi-ERSS on Tasks 2, 4, and 5, scoring generally above average in all but the linguistic quality aspects.

Enriching Protein Structure Visualizations with Mutation Annotations Obtained by Text Mining Protein Engineering Literature

Multiple Sequence Alignment


Protein structure visualization tools render images that allow the user to explore structural features of a protein. Context specific information relating to a particular protein or protein family is not easily integrated and must be uploaded from databases or provided through manual curation of input files. We describe a mixed natural language processing and sequence analysis based approach for the retrieval of mutation specific annotations from full text articles for rendering with protein structures.


Text Mining, Protein Structure Annotation, Protein Function, ProSAT, Xylanase

Engineering a Semantic Desktop for Building Historians and Architects

Page scan from 'Handbuch der Architektur'


We analyse the requirements for an advanced semantic support of users—building historians and architects—of a multi-volume encyclopedia of architecture from the late 19th century. Novel requirements include the integration of content retrieval, content development, and automated content analysis based on natural language processing.

We present a system architecture for the detected requirements and its current implementation. A complex scenario demonstrates how a desktop supporting semantic analysis can contribute to specific, relevant user tasks.

Combining Biological Databases and Text Mining to support New Bioinformatics Applications

Alicante, Spain


A large amount of biological knowledge today is only available from full-text research papers. Since neither manual database curators nor users can keep up with the rapidly expanding volume of scientific literature, natural language processing approaches are becoming increasingly important for bioinformatic projects.

In this paper, we go beyond simply extracting information from full-text articles by describing an architecture that supports targeted access to information from biological databases using the results derived from text mining of research papers, thereby integrating information from both sources within a biological application.

The described architecture is currently being used to extract information about protein mutations from full-text research papers. Text mining results drive the retrieval of sequence information from protein databases and the employment of algorithmic sequence analysis tools, which facilitate further data access from protein structure databases. Complex mapping of NLP derived text annotations to protein structures allows the rendering, with 3D structure visualization, of information not available in databases of mutation annotations.