Onto

Query

Home
Description
Participants
Events
Publications
Prototypes

Contact:
het.id@cbs.dk

Ontology-based Querying

Project description

This project falls within the priority subfield 'data and information management' (data- og informationshåndtering) mentioned in the report National delstrategi for IT-forskning ('A National Substrategy for Research in Information Technology'), the Danish Ministry of Research, 22.6.98. In addition, the project focusses on problems concerning the development of the information society, also emphasised in the report, especially pertaining to user interfaces and Internet searching of special relevance for Danish-speaking users.

1 Introduction
1.1 Background
1.2 Research Objectives
1.3 Research Aspects
2 Research Approach and Methods
2.1 Ontologies for Application Domains
2.2 A Description Language for Ontologies
2.3 Ontology-based Semantic Analysis of Texts
2.4 Development of an Experimental System
2.5 Related Contemporary Research and International Collaboration
3 Organisation of the project
4 Project Research Plan
4.1 Form and Extent of Results to be Reported
5 References

1 Introduction

The intended outcome of the project is a coherent, general theory for

ontological representation of domain knowledge,
ontological semantics for natural language phrases, and
ontology-based search in text databases.

A key idea in the project is to introduce and further develop a formal concept language in which to express this theory. The theoretical results will, for purposes of validation and demonstration, be exploited in the development of a prototype system with a set of accompanying tools and resources on selected real world domains.

1.1 Background

The rapid development of information technology with recent media such as the Internet and CD-ROM, which provide huge sources of often loosely structured information, calls for improved methods and systems for content-oriented computational querying. Contemporary IT applications are characterised by complex information sources where formatted and free-format text sources are mixed. Such information sources require content-based flexible querying and data-structuring methodologies. Traditional tools for data-structuring such as relational databases and more recent methods such as object-oriented data models fail to meet these demands. The present project aims to investigate and develop flexible, ontology-based formal description languages for data structuring. Interdisciplinary research in this area is an important prerequisite for the development of information handling tools of general public utility.

1.2 Research Objectives

The overall goal of the proposed project is to make scientific contributions in the form of theories and methods involving ontologies for the representation of domain knowledge, for ontological semantics for natural language phrases, and for ontology-based search strategies.
Ontologies are conceptual models, which transcend linguistic dictionaries for lexical semantics in the methods used for incorporating domain-oriented semantic relations. On the other hand, the ontology is not a knowledge base, since only part of the relevant information of the target application is formalised in the ontology. Rather, ontologies constitute type structure skeletons, which can be gradually extended to form full-fledged knowledge bases. Such an extension is facilitated by the ideas laid out below, and is in accordance with an ambitious long-term objective of the research project.
The project aims to contribute to the development of general solutions to the querying of databases and to the extraction of descriptions of database objects through limited computational natural language understanding. More precisely, the project addresses content-based retrieval and access to Danish text sources such as online document databases and encyclopaedias.
Stressing the use of ontologies, the project will provide a content-based query and retrieval functionality going beyond the superficial key word recognition typical of contemporary search engines, whilst not attempting a full semantic analysis of source texts. Thus this project calls for the integration of contributions from knowledge representation, natural language processing, terminology systems and database querying.

1.3 Research Aspects

The general goal stated above can be split up into the following research issues.

Development of theories, methods and tools for establishing formal ontologies integrated with language specific terminology and lexical networks. A key research idea is to introduce a distinguished formal language for ontologies, the Unified Description Language, UDL, which combines classification taxonomies with object and relational expression forms. This formal language is intended for representations of abstracted information from text sources and queries.
Development of methods for ontology-based linguistic analysis of source texts and queries. This will primarily concern the identification and analysis of noun phrases, comprising morphological, syntactic and semantic analysis. In our approach the noun phrases are central for the specification of particular concepts in an ontology.
Development of methodologies for ontology-based query processing that efficiently compare an internal formal description of a query with the ontological descriptions of database objects. The query processing reduces to a matching of the query description with the descriptions of text database objects in the framework of the given ontology.

2 Research Approach and Methods

In our opinion, the interaction between theoretical research and experimental work is essential for ensuring the viability of the project. Therefore theoretical research will procede hand in hand with the development of a prototype in which ontologies serve to obtain the central functionality described below.
The prototype will comprise a data- and knowledge base, a Description Generator (DG) and a Query Evaluator (QE). The data- and knowledge base contains domain knowledge (comprising a concept dictionary, the ontology and additional concept relations), a number of text-objects and formal UDL-descriptions of the text-objects. These descriptions are generated by the DG module, which applies an NP-grammar and uses the domain knowledge, primarily the ontology, for disambiguation of expressions in the texts. A user can search for texts in the database by posing a query in natural language. The DG will produce a UDL-description of the query, and the Query Evaluator will match this description and the text-descriptions to find relevant texts or text fragments.
The use of an ontology will make it possible for a user to find a text containing the phrase tilskud til energibesparende foranstaltninger ('contributions to energy-saving devices') by querying for støtte til solvarme ('support for solar heating'). The ontology will make it clear that the expressions tilskud and støtte both refer to the same concept, 'financial support'), and that the concept referred to with solvarme is a specialisation of the concept referred to with energibesparende foranstaltning. Hence retrieval is content-based (meaning-based) in contrast to the string-based retrieval in text-processors, databases, and search engine systems. This intended system functionality creates interesting research problems involving integration and comparison of information from text queries and ontologies.

2.1 Ontologies for Application Domains

An ontology is a, possibly simplified, description of an application domain comprising the essential concepts in the domain and the important semantic relationships between these concepts. The core of an ontology is a taxonomy describing the domain concepts and their conceptual inclusion relationships (hypo-/hyperonomy).
Besides the taxonomy the ontology comprises a number of semantic relations pertaining to the understanding of the target domain. Some of these relations are linguistic in nature such as (near) synonymy and antonomy relationships; others are more oriented towards a logical understanding of the domain with relationships such as part-of (partonomy), causes, serves etc. Thus ontologies stress the analysis of logical relationships between concepts in a domain.
Ontologies bear similarity to the conceptual models applied in database modelling and to terminological concept systems. Ontological approaches are, however, typically distinguished by the adoption of more expressive logical description formalisms, a higher level of ambition with respect to the identification of conceptual commonalities across varieties of application domains, and the incorporation of linguistic description components

2.2 A Description Language for Ontologies

A central novel methodological characteristic of the project is the adoption of a common formal language of concepts and relationships between concepts, referred to above as UDL. The key idea is to unify taxonomy languages with object classification structures and relational database objects.
The UDL is intended primarily as a theoretical, logical framework, in which the different traditional representations at the various levels of analysis is conceptualised, analysed, and integrated. Thus a strategic purpose is to facilitate coherence in the resulting system architecture. The language consists of descriptions, which serve multiple purposes ranging from feature structures in the linguistic analysis, via lexical semantic bases and terminology bases to ontologies and query descriptions.
It is proposed that this description language be a binary relational logic, thereby complying with contemporary relational and algebraic approaches in lexical semantics, as well as the use of description logic in knowledge bases and feature-structure analysis. An extended relational approach is distinguished by supporting database-relational, object-oriented, logical and algebraic approaches in one formal language. This means that the framework simultaneously supports ontology building, linguistic analysis, automatic abstraction of text descriptions, and computation of database queries.
The relational algebraic descriptions become theoretically well-founded as an algebraic logic, which is distinguished also by supporting the notion of lattices, which is crucial to ontologies. It contains operators for combining descriptions to form compound descriptions, and for matching descriptions, thus forming a logical calculus. It situates the well-known record/frame/feature structures within algebraic lattices for handling the dimensions of a taxonomy and a partonomy.

2.3 Ontology-based Semantic Analysis of Texts

The use of natural language for database querying traditionally relies on a logical semantics, which determines the translation from syntax trees into a logical query language. This technique has to restrict the query language to a small fragment of natural language, since a full computational semantic treatment of comprehensive natural language fragments is far beyond the scope of current language technology.
As an alternative approach this project introduces an ontology-based semantic analysis for natural language texts and query phrases, which refrains from a full logical analysis of the meaning of natural language texts. As a modest starting point we will apply the ontological semantics characterised above to the analysis and disambiguation of NP heads.
At a later stage, we will extend this analysis to more complex NPs, including in particular adjectives, prepositional phrases (both complements and adjuncts) and genitives using semantic relations from the ontology.

2.4 Development of an Experimental System

As already mentioned, it is a high priority of the project that the work on theoretical issues should be complemented by practical, experimental development. The research ideas and approaches sketched above are to be incorporated in an experimental prototype for validation purposes. Moreover, the development of such a prototype provides a framework which makes possible collaboration with potential user partners who have specific application needs. Affirmative contacts concerning co-operation have already been established with the following two organisations:
Statens Information (the Danish National Information Service), which has the explicit aim of improving public information service on the Internet
Danmarks Nationalleksikon (the Danish National Encyclopaedia), which plans an electronic publication of Den Store Danske Encyklopædi (The Large Danish Encyclopaedia).
The intention is to restrict prototype work to a few domains.

2.5 Related Contemporary Research and International Collaboration

There are numerous connections between this project and ongoing research in knowledge-based systems, and computational linguistics. Similar research systems with this functionality are emerging, e.g. ontoseek (c.f. Guarino et al. 1998). In our opinion, however, the themes and particular approaches outlined above gives this project an individual profile. In particular the integration of ontology-based analysis of phrases in a simultaneously relational & logical & algebraic & object-oriented framework is novel.
The most interesting projects in the area of large-scale wordnets and higher level ontologies are WordNet and EuroWordNet (cf. Vossen 1998). A preliminary resource involving Danish is currently being built at CST in the so-called SIMPLE-project (Semantic Information for Multifunctional Plurilingual Lexica). The present project plans to exploit this resource as a starting point for the semantic encoding, and collaboration with Nicoletta Calzolari & Alessandro Lenci, Dipartimento di Linguistica, Università di Pisa, and with Federica Busa & James Pustejovsky, Computer Science Department, Brandeis University, will be continued.
Members of the HHK and SDU groups have been working in close contact with Barbara Partee, University of Massachusets at Amherst, and Vladimir Borschev, VINITI Moscow, on the i ntegration of lexical and formal semantics. This collaboration will continue.
The project possesses a first-hand perspective on ongoing research within databases and knowledge bases since the RUC group has arranged a series of international workshops and conferences under the title "Flexible Query Answering Systems (FQAS)" ('94, '96, '98) attracting researchers from many countries (Andreasen, 1997, Christiansen, 1998). Numerous research contacts have been established at these events and the FQAS conference series continues under the auspices of an international program committee. The RUC group has close contacts to and collaboration with ENSSAT, Université de Rennes, Lannion, France (Patrick Bosc, Olivier Pivert), Dept. of Computer Science & AI, University of Granada, Spain (Amparo Vila) and Machine Intelligence Insitutte, Iona College, USA (Ronald Yager).
The DTU group is collaborating with various European research groups in knowledge bases, e.g. the Uppsala Computational logic and knowledge engineering group. Furthermore the DTU group organised the international conference in Information Modelling and Knowledge Bases in 1996 reported in (Kangassalo, Nilsson et al., 1997). Finally the RUC and DTU groups have collaborated in the joint project "Intelligente Søgesystemer" (Intelligent Search Systems, 1994-1997) supported by The Danish Natural Science Research Council.

3 Organisation of the project

The project involves 5 project partners with the following areas of research specialisation:

Roskilde University (RUC): knowledge-based systems, intelligent query processing, database technology
Technical University of Denmark (DTU): knowledge representation, knowledge modelling, knowledge-based systems.
Copenhagen Business School (HHK): terminological concept analysis, natural language querying systems, NP analysis (genitives, compounds, nominalisations)
Centre for Language Technology (CST): lexical networks, NP analysis, morphological analysis, NLP lexicon for Danish
University of Southern Denmark (SDU): natural language querying systems, NP-analysis (prepositional phrases, genitives, compounds), HPSG-to-TAG algorithm.

It is expected that a number of PhD students fully financed by the institutions will participate in the project. One PhD student will be financed by the project, and the project partners will organise workshops, schools and courses throughout the project period as specified in the research plan below.
A development group consisting of RUC and CST will be responsible for the development of the experimental prototypes, and a management committee consisting of Troels Andreasen (RUC), Jørgen Fischer Nilsson (DTU) and Hanne Erdman Thomsen (HHK) will be responsible for the technical and administrative management of the project.

4 Project Research Plan

The total period of the project is divided into three phases, the first year, the second and third year, and the fourth year. In order to ensure the time resources necessary for innovative research work, it has been decided to adopt a commercially available object-oriented relational database system (Oracle) as a platform for the whole OntoQuery prototype system. Components in this system will thus take the form of database application code, to allow for efficient development with a minimum of programming effort. Furthermore the prototype development is planned such that it can proceed without depending on final results from the research tasks.

4.1 Form and Extent of Results to be Reported

The activities in the project are divided into theoretical research and experimental development. The deliverables will take the following forms:

Theoretical research, typically carried out in subgroups, will be reported in papers presented at regular internal and open seminars (workshops), or published internationally.
Experimental system development involving System component design, typically obtained as spin-offs from the research groups, and reported in the papers with supplementary technical report documentation in relevant cases. Development, test and experimental validation of system modules and integration of these into the system prototype.

The main effort in the project will be put into the theoretical aspects. It is, however, considered essential for the successful outcome of the project, that experiments on ideas are carried out, that real-world domains and applications are dealt with, and that continuous interaction between the theoretical and the experimental work is ensured.
The papers for publication are planned to be written in groups of typically 3 persons composed so as to ensure cross-institutional as well as cross-disciplinary (Comp.Sc./Comp.ling.) representation. This is in the interest of promoting coherence of results from the project research tasks as well as interlocking of developed system modules. For each year 5 to 10 papers are expected and each participant is expected to co-author at least one paper per year (on the average two).
Finally the project group is to host a number of research events on issues closely related to the project. For each year: 2-3 open events (primarily workshops and seminars) and during the 4-year project period: 2 PhD seminars and 1 international "summerschool". Already included in the plans, at this point, is a workshop on "Ontology-based NP-interpretation", January 2000 and a conference "5th Flexible Query-Answering Systems", FQAS 2002". (The 4th, FQAS'2000 will be co-chaired by the RUC group, but is already planned to take place in France).

5 References

Andreasen, T., Christiansen, H., Legind Larsen, H. (eds.)( 1997): Flexible Query Answering Systems, Kluwer Academic Publishers, 1997. Edited book with contributions, describing the area.

Christiansen, H., Larsen, H.L., Andreasen, T. (1998): International Conference on Flexible Query Answering Systems,, Roskilde, Denmark. Revised Conference papers, Lecture Notes in Artificial Intelligence, Springer Verlag, Berlin. 1998.

Guarino, N, Masolo, C., and Vetere, G. (1998): OntoSeek: Using Large Linguistic Ontologies for Gathering Information Resources from the Web. LADSEB-CNR Int. Rep. 02/98, March 1998

H. Kangassalo, J. F. Nilsson et al. (eds.): Information Modelling and Knowledge Bases VIII, IOS Press, 1997

Vossen, P. (ed.) (1998): EuroWordNet, A Multilingual Database with Lexical Semantic Networks. Kluwer Academic Publishers, The Netherlands.

Hanne Erdman Thomsen <het.id@cbs.dk >