Multi-ERSS and ERSS 2004


Last year, we presented a system, ERSS, which constructed 10 word summaries in form of a list of noun phrases. It was based on a knowledge-poor extraction of noun phrase coreference chains implemented on a fuzzy set theoretic base. This year we present the performance of an improved version, ERSS 2004 and an extension of the same basic system: Multi-ERSS constructs 100-word extract summaries for clusters of texts. With very few modifications we ran ERSS 2004 on Tasks 1 and 3 and Multi-ERSS on Tasks 2, 4, and 5, scoring generally above average in all but the linguistic quality aspects.

Enriching Protein Structure Visualizations with Mutation Annotations Obtained by Text Mining Protein Engineering Literature

Multiple Sequence Alignment


Protein structure visualization tools render images that allow the user to explore structural features of a protein. Context specific information relating to a particular protein or protein family is not easily integrated and must be uploaded from databases or provided through manual curation of input files. We describe a mixed natural language processing and sequence analysis based approach for the retrieval of mutation specific annotations from full text articles for rendering with protein structures.


Text Mining, Protein Structure Annotation, Protein Function, ProSAT, Xylanase

Engineering a Semantic Desktop for Building Historians and Architects

Page scan from 'Handbuch der Architektur'


We analyse the requirements for an advanced semantic support of users—building historians and architects—of a multi-volume encyclopedia of architecture from the late 19th century. Novel requirements include the integration of content retrieval, content development, and automated content analysis based on natural language processing.

We present a system architecture for the detected requirements and its current implementation. A complex scenario demonstrates how a desktop supporting semantic analysis can contribute to specific, relevant user tasks.

Combining Biological Databases and Text Mining to support New Bioinformatics Applications

Alicante, Spain


A large amount of biological knowledge today is only available from full-text research papers. Since neither manual database curators nor users can keep up with the rapidly expanding volume of scientific literature, natural language processing approaches are becoming increasingly important for bioinformatic projects.

In this paper, we go beyond simply extracting information from full-text articles by describing an architecture that supports targeted access to information from biological databases using the results derived from text mining of research papers, thereby integrating information from both sources within a biological application.

The described architecture is currently being used to extract information about protein mutations from full-text research papers. Text mining results drive the retrieval of sequence information from protein databases and the employment of algorithmic sequence analysis tools, which facilitate further data access from protein structure databases. Complex mapping of NLP derived text annotations to protein structures allows the rendering, with 3D structure visualization, of information not available in databases of mutation annotations.

Agents and Databases: Friends or Foes?

Friendly Meetings in Vancouver


On first glance agent technology seems more like a hostile intruder into the database world. On the other hand, the two could easily complement each other, since agents carry out information processes whereas databases supply information to processes. Nonetheless, to view agent technology from a database perspective seems to question some of the basic paradigms of database technology, particularly the premise of semantic consistency of a database. The paper argues that the ensuing uncertainty in distributed databases can be modelled by beliefs, and develops the basic concepts for adjusting peer-to-peer databases to the individual beliefs in single nodes and collective beliefs in the entire distributed database.

The FungalWeb Ontology: Application Scenarios


The FungalWeb Ontology aims to support the data integration needs of enzyme biotechnology from inception to product roll out. Serving as a knowledge base for decision support, the conceptualization seeks to link fungal species with enzymes, enzyme substrates, enzyme classifications, enzyme modifications, enzyme retail and applications. We demonstrate how the FungalWeb Ontology supports this remit by presenting application scenarios, conceptualizations of the ontological frame able to support these scenarios and semantic queries typical of a Biotech Manager. Queries to the knowledge base are answered with description logic (DL) and automated reasoning tools.

A Self-Learning Context-Aware Lemmatizer for German

Vancouver Waterfront


Accurate lemmatization of German nouns mandates the use of a lexicon. Comprehensive lexicons, however, are expensive to build and maintain. We present a self-learning lemmatizer capable of automatically creating a full-form lexicon by processing German documents.

ERSS 2005: Coreference-Based Summarization Reloaded


Friendly Meetings in Vancouver
We present ERSS 2005, our entry to this year's DUC competition. With only slight modifications from last year's version to accommodate the more complex context information present in DUC 2005, we achieved a similar performance to last year's entry, ranking roughly in the upper third when examining the ROUGE-1 and Basic Element score.

We also participated in the additional manual evaluation based on the new Pyramid method and performed further evaluations based on the Basic Elements method and the automatic generation of Pyramids. Interestingly, the ranking of our system differs greatly between the different measures; we attempt to analyse this effect based on correlations between the different results using the Spearman coefficient.

Context-based Multi-Document Summarization using Fuzzy Coreference Cluster Graphs

The IPD cluster computing cluster summaries using a clustering algorithm :)


Constructing focused, context-based multi-document summaries requires an analysis of the context questions, as well as their corresponding document sets. We present a fuzzy cluster graph algorithm that finds entities and their connections between context and documents based on fuzzy coreference chains and describe the design and implementation of the ERSS summarizer implementing these ideas.

Mutation mining - A prospector's tale

Screenshot of ProSAT/Webmol with MutationMiner annotations


Protein structure visualization tools render images that allow the user to explore structural features of a protein. Context specific information relating to a particular protein or protein family is, however, not easily integrated and must be uploaded from databases or provided through manual curation of input files. Protein Engineers spend considerable time iteratively reviewing both literature and protein structure visualizations manually annotated with mutated residues. Meanwhile, text mining tools are increasingly used to extract specific units of raw text from scientific literature and have demonstrated the potential to support the activities of Protein Engineers.

The transfer of mutation specific raw-text annotations to protein structures requires integrated data processing pipelines that can co-ordinate information retrieval, information extraction, protein sequence retrieval, sequence alignment and mutant residue mapping. We describe the Mutation Miner pipeline designed for this purpose and present case study evaluations of the key steps in the process. Starting with literature about mutations made to protein families; haloalkane dehalogenase, bi-phenyl dioxygenase, and xylanase we enumerate relevant documents available for text mining analysis, the available electronic formats, and the number of mutations made to a given protein family. We review the efficiency of NLP driven protein sequence retrieval from databases and report on the effectiveness of Mutation Miner in mapping annotations to protein structure visualizations. We highlight the feasibility and practicability of the approach.


Text mining - Protein structure annotation - Protein mutation - Data mining - Haloalkane dehalogenase - Biphenyl dioxygenase - Xylanase

Syndicate content