Project descriptionThis project falls within the priority subfield 'data and information management' (data- og informationshåndtering) mentioned in the report National delstrategi for IT-forskning ('A National Substrategy for Research in Information Technology'), the Danish Ministry of Research, 22.6.98. In addition, the project focusses on problems concerning the development of the information society, also emphasised in the report, especially pertaining to user interfaces and Internet searching of special relevance for Danish-speaking users.
1.2 Research Objectives
1.3 Research Aspects
2 Research Approach and Methods
2.1 Ontologies for Application Domains
2.2 A Description Language for Ontologies
2.3 Ontology-based Semantic Analysis of Texts
2.4 Development of an Experimental System
2.5 Related Contemporary Research and International Collaboration
3 Organisation of the project
4 Project Research Plan
4.1 Form and Extent of Results to be Reported
1.1 BackgroundThe rapid development of information technology with recent media such as the Internet and CD-ROM, which provide huge sources of often loosely structured information, calls for improved methods and systems for content-oriented computational querying. Contemporary IT applications are characterised by complex information sources where formatted and free-format text sources are mixed. Such information sources require content-based flexible querying and data-structuring methodologies. Traditional tools for data-structuring such as relational databases and more recent methods such as object-oriented data models fail to meet these demands. The present project aims to investigate and develop flexible, ontology-based formal description languages for data structuring. Interdisciplinary research in this area is an important prerequisite for the development of information handling tools of general public utility.
1.2 Research ObjectivesThe overall goal of the proposed project is to make scientific contributions in the form of theories and methods involving ontologies for the representation of domain knowledge, for ontological semantics for natural language phrases, and for ontology-based search strategies.
Ontologies are conceptual models, which transcend linguistic dictionaries for lexical semantics in the methods used for incorporating domain-oriented semantic relations. On the other hand, the ontology is not a knowledge base, since only part of the relevant information of the target application is formalised in the ontology. Rather, ontologies constitute type structure skeletons, which can be gradually extended to form full-fledged knowledge bases. Such an extension is facilitated by the ideas laid out below, and is in accordance with an ambitious long-term objective of the research project.
The project aims to contribute to the development of general solutions to the querying of databases and to the extraction of descriptions of database objects through limited computational natural language understanding. More precisely, the project addresses content-based retrieval and access to Danish text sources such as online document databases and encyclopaedias.
Stressing the use of ontologies, the project will provide a content-based query and retrieval functionality going beyond the superficial key word recognition typical of contemporary search engines, whilst not attempting a full semantic analysis of source texts. Thus this project calls for the integration of contributions from knowledge representation, natural language processing, terminology systems and database querying.
1.3 Research AspectsThe general goal stated above can be split up into the following research issues.
2 Research Approach and MethodsIn our opinion, the interaction between theoretical research and experimental work is essential for ensuring the viability of the project. Therefore theoretical research will procede hand in hand with the development of a prototype in which ontologies serve to obtain the central functionality described below.
The prototype will comprise a data- and knowledge base, a Description Generator (DG) and a Query Evaluator (QE). The data- and knowledge base contains domain knowledge (comprising a concept dictionary, the ontology and additional concept relations), a number of text-objects and formal UDL-descriptions of the text-objects. These descriptions are generated by the DG module, which applies an NP-grammar and uses the domain knowledge, primarily the ontology, for disambiguation of expressions in the texts. A user can search for texts in the database by posing a query in natural language. The DG will produce a UDL-description of the query, and the Query Evaluator will match this description and the text-descriptions to find relevant texts or text fragments.
The use of an ontology will make it possible for a user to find a text containing the phrase tilskud til energibesparende foranstaltninger ('contributions to energy-saving devices') by querying for støtte til solvarme ('support for solar heating'). The ontology will make it clear that the expressions tilskud and støtte both refer to the same concept, 'financial support'), and that the concept referred to with solvarme is a specialisation of the concept referred to with energibesparende foranstaltning. Hence retrieval is content-based (meaning-based) in contrast to the string-based retrieval in text-processors, databases, and search engine systems. This intended system functionality creates interesting research problems involving integration and comparison of information from text queries and ontologies.
2.1 Ontologies for Application DomainsAn ontology is a, possibly simplified, description of an application domain comprising the essential concepts in the domain and the important semantic relationships between these concepts. The core of an ontology is a taxonomy describing the domain concepts and their conceptual inclusion relationships (hypo-/hyperonomy).
Besides the taxonomy the ontology comprises a number of semantic relations pertaining to the understanding of the target domain. Some of these relations are linguistic in nature such as (near) synonymy and antonomy relationships; others are more oriented towards a logical understanding of the domain with relationships such as part-of (partonomy), causes, serves etc. Thus ontologies stress the analysis of logical relationships between concepts in a domain.
Ontologies bear similarity to the conceptual models applied in database modelling and to terminological concept systems. Ontological approaches are, however, typically distinguished by the adoption of more expressive logical description formalisms, a higher level of ambition with respect to the identification of conceptual commonalities across varieties of application domains, and the incorporation of linguistic description components
2.2 A Description Language for OntologiesA central novel methodological characteristic of the project is the adoption of a common formal language of concepts and relationships between concepts, referred to above as UDL. The key idea is to unify taxonomy languages with object classification structures and relational database objects.
The UDL is intended primarily as a theoretical, logical framework, in which the different traditional representations at the various levels of analysis is conceptualised, analysed, and integrated. Thus a strategic purpose is to facilitate coherence in the resulting system architecture. The language consists of descriptions, which serve multiple purposes ranging from feature structures in the linguistic analysis, via lexical semantic bases and terminology bases to ontologies and query descriptions.
It is proposed that this description language be a binary relational logic, thereby complying with contemporary relational and algebraic approaches in lexical semantics, as well as the use of description logic in knowledge bases and feature-structure analysis. An extended relational approach is distinguished by supporting database-relational, object-oriented, logical and algebraic approaches in one formal language. This means that the framework simultaneously supports ontology building, linguistic analysis, automatic abstraction of text descriptions, and computation of database queries.
The relational algebraic descriptions become theoretically well-founded as an algebraic logic, which is distinguished also by supporting the notion of lattices, which is crucial to ontologies. It contains operators for combining descriptions to form compound descriptions, and for matching descriptions, thus forming a logical calculus. It situates the well-known record/frame/feature structures within algebraic lattices for handling the dimensions of a taxonomy and a partonomy.
2.3 Ontology-based Semantic Analysis of TextsThe use of natural language for database querying traditionally relies on a logical semantics, which determines the translation from syntax trees into a logical query language. This technique has to restrict the query language to a small fragment of natural language, since a full computational semantic treatment of comprehensive natural language fragments is far beyond the scope of current language technology.
As an alternative approach this project introduces an ontology-based semantic analysis for natural language texts and query phrases, which refrains from a full logical analysis of the meaning of natural language texts. As a modest starting point we will apply the ontological semantics characterised above to the analysis and disambiguation of NP heads.
At a later stage, we will extend this analysis to more complex NPs, including in particular adjectives, prepositional phrases (both complements and adjuncts) and genitives using semantic relations from the ontology.
2.4 Development of an Experimental SystemAs already mentioned, it is a high priority of the project that the work on theoretical issues should be complemented by practical, experimental development. The research ideas and approaches sketched above are to be incorporated in an experimental prototype for validation purposes. Moreover, the development of such a prototype provides a framework which makes possible collaboration with potential user partners who have specific application needs. Affirmative contacts concerning co-operation have already been established with the following two organisations:
Statens Information (the Danish National Information Service), which has the explicit aim of improving public information service on the Internet
Danmarks Nationalleksikon (the Danish National Encyclopaedia), which plans an electronic publication of Den Store Danske Encyklopædi (The Large Danish Encyclopaedia).
The intention is to restrict prototype work to a few domains.
2.5 Related Contemporary Research and International CollaborationThere are numerous connections between this project and ongoing research in knowledge-based systems, and computational linguistics. Similar research systems with this functionality are emerging, e.g. ontoseek (c.f. Guarino et al. 1998). In our opinion, however, the themes and particular approaches outlined above gives this project an individual profile. In particular the integration of ontology-based analysis of phrases in a simultaneously relational & logical & algebraic & object-oriented framework is novel.
The most interesting projects in the area of large-scale wordnets and higher level ontologies are WordNet and EuroWordNet (cf. Vossen 1998). A preliminary resource involving Danish is currently being built at CST in the so-called SIMPLE-project (Semantic Information for Multifunctional Plurilingual Lexica). The present project plans to exploit this resource as a starting point for the semantic encoding, and collaboration with Nicoletta Calzolari & Alessandro Lenci, Dipartimento di Linguistica, Università di Pisa, and with Federica Busa & James Pustejovsky, Computer Science Department, Brandeis University, will be continued.
Members of the HHK and SDU groups have been working in close contact with Barbara Partee, University of Massachusets at Amherst, and Vladimir Borschev, VINITI Moscow, on the i ntegration of lexical and formal semantics. This collaboration will continue.
The project possesses a first-hand perspective on ongoing research within databases and knowledge bases since the RUC group has arranged a series of international workshops and conferences under the title "Flexible Query Answering Systems (FQAS)" ('94, '96, '98) attracting researchers from many countries (Andreasen, 1997, Christiansen, 1998). Numerous research contacts have been established at these events and the FQAS conference series continues under the auspices of an international program committee. The RUC group has close contacts to and collaboration with ENSSAT, Université de Rennes, Lannion, France (Patrick Bosc, Olivier Pivert), Dept. of Computer Science & AI, University of Granada, Spain (Amparo Vila) and Machine Intelligence Insitutte, Iona College, USA (Ronald Yager).
The DTU group is collaborating with various European research groups in knowledge bases, e.g. the Uppsala Computational logic and knowledge engineering group. Furthermore the DTU group organised the international conference in Information Modelling and Knowledge Bases in 1996 reported in (Kangassalo, Nilsson et al., 1997). Finally the RUC and DTU groups have collaborated in the joint project "Intelligente Søgesystemer" (Intelligent Search Systems, 1994-1997) supported by The Danish Natural Science Research Council.
3 Organisation of the projectThe project involves 5 project partners with the following areas of research specialisation:
A development group consisting of RUC and CST will be responsible for the development of the experimental prototypes, and a management committee consisting of Troels Andreasen (RUC), Jørgen Fischer Nilsson (DTU) and Hanne Erdman Thomsen (HHK) will be responsible for the technical and administrative management of the project.
4 Project Research PlanThe total period of the project is divided into three phases, the first year, the second and third year, and the fourth year. In order to ensure the time resources necessary for innovative research work, it has been decided to adopt a commercially available object-oriented relational database system (Oracle) as a platform for the whole OntoQuery prototype system. Components in this system will thus take the form of database application code, to allow for efficient development with a minimum of programming effort. Furthermore the prototype development is planned such that it can proceed without depending on final results from the research tasks.
4.1 Form and Extent of Results to be ReportedThe activities in the project are divided into theoretical research and experimental development. The deliverables will take the following forms:
The papers for publication are planned to be written in groups of typically 3 persons composed so as to ensure cross-institutional as well as cross-disciplinary (Comp.Sc./Comp.ling.) representation. This is in the interest of promoting coherence of results from the project research tasks as well as interlocking of developed system modules. For each year 5 to 10 papers are expected and each participant is expected to co-author at least one paper per year (on the average two).
Finally the project group is to host a number of research events on issues closely related to the project. For each year: 2-3 open events (primarily workshops and seminars) and during the 4-year project period: 2 PhD seminars and 1 international "summerschool". Already included in the plans, at this point, is a workshop on "Ontology-based NP-interpretation", January 2000 and a conference "5th Flexible Query-Answering Systems", FQAS 2002". (The 4th, FQAS'2000 will be co-chaired by the RUC group, but is already planned to take place in France).
Hanne Erdman Thomsen <email@example.com >