Onto

Query





HOME
Registration
Course Plan
Readings
Participants
Venue
Accomodation
Contact
Organizers








Contact:
borsmose@ruc.dk

PhD Course: Concept Analysis and Concept Based Retrieval
Participants - Project descriptions

Alessio Bosca, Politecnico di Torino, Italy.

Title: Analysis, specific and development of systems and architectures for the distributed elaboration

During the last years, in the research field of software engineering and intelligent systems new algorithms and instruments have been developed, in order to develop complex software systems that are characterized by a continuous increase in amount and structural complexity of the data deal with and thus require self managing capabilities.

A peculiar interest is covered by the study of models, methodologies and instruments for the next generation web, as the Semantic Web, Ontologies, MultiAgent Systems and Autonomic Computing in order to realize platforms for the distribution of services related to innovative Web Intelligence applications. Those technologies are deeply linked, in fact the core value added by the semantic web resides just in the machine understandable paradigm, so software agents are the future direct consumers of those resources and the deputies of such high-level information processing. Furthermore such technologies find their natural place in sectors of high socio-economic spin-off as e-Government and e-Health

Antoine Doucet, University of Helsinki, Finland.

Title: Exact coverage information retrieval based on static passage-clustering

Since a few decades, the amount of digital information stored in computer systems and networks has been exponentially growing. Using automated data collection tools, massive amounts of data have been collected and stored into databases. Among this rising quantity of data, the proportion consisting in textual data has been constantly increasing. Consequently, automatic and efficient methods have been developed for searching for specific information within textual data.

Unfortunately, these algorithms do not take any extra information into account. Information such as the structure of a document would definitely improve the relevance of document descriptors. Since this data is nowadays included in most textual documents, an obvious need for including this contextual information within the algorithms has now appeared.

My first goal is to extend the set of document features, so as to include hierarchical document information. The hypothesis here, is that the hierarchical strucure of a document (e.g., its subdivisions in sections, subsections, paragraphs, subparagraphs, etc.) carries semantically meaningful information. In current text processing systems, this information if often skipped, or used very partially (for example, by increasing the weight affected to a word, when it is used in a title element).

By taking this extra data into account, many difficulties are arrising. They mostly are computational. By example, in the case of geographical information, a way to include non-sequential information (such as maps, images or tables) to a process based on a typically sequential data (words used in natural language). How to mix these totally different data is a real problem; Indeed, the occurrence of a map within a document often carries the same information, be it lying before or after a given word sequence.

When thinking of adding information to the document, using a morphological analyser, the potential amount of new information is huge. A way to select relevant morphological information has to be thought of. Then, a way to represent this information has to be found. A first way might be to replace each word by a feature vector including the corresponding data. But taking care of many information for a same word is another big issue, since the method is so far based on a single information (the word itself) per index of the sequence.

This basically defines the main challenge of this thesis, i.e., adapt ideas based on sequential and single-valued information towards semi-sequential and multi-valued information.

Bonino Dario, Politecnico di Torino, Italy.

The Ph.D research activity involves the usage and the extension of the technologies proposed by the "Semantic Web" research community, to develop new architectures and algorithms for intelligent web sites and services. Those applications are able to provide high level interaction by the integrating many techniques coming from Artificial Intelligence, in particular "Computational Intelligence", and from the Evolutionary Algorithms research field. The developed tools will be applied to different real world problems involving usability, wireless interfaces, search engines and multimedia information retrieval.
With respect to the first Ph.D year, research activities involved two main topics which are related to the Web Intelligence field and the Semantic Web. The first one, includes the design and the preliminary implementation of an evolutionary prediction system, that aims at correctly predicting the next requested page on a web server, different application scenarios have been investigated, specifically with regard to the dynamic Web area. The second branch involved the proposal, design and implementation of a semantic platform able to provide automatic semantic annotation facilities and semantic search functionalities for existing web resources. The architecture is able to work at different level of detail from the single paragraph in a web page to entire chapters in a book, as an example, and is well integrated with already existing semantic and lexical nets like WordNet.

Davide Martinenghi, Roskilde University, Denmark.

The purpose of the PhD project is to develop a general model for evaluation of database integrity constraints with possible emphasis on incremental maintenance and automatic or semi-automatic generation of update routines.
In particular, a procedure for simplification of integrity constraints can been defined in order to specialize integrity checking for the (classes of) updates that are given to the database. Furthermore, a simplification can be characterized as optimal if it corresponds to a minimum in an ordering that represents the effort of evaluating the specialized constraints. It can be shown that no optimal simplification procedure can exist, due to unavoidable undecidability issues.
The best we can do is to find good approximations of optimal procedures and applications in which they prove useful. Simplification of integrity constraints comes in handy, e.g., for data integration, semantic query optimization, abductive reasoning and data mining; but integrity constraints can be used to express semantic knowledge in a number of different contexts, such as text processing and concept analysis.

Eija Airio, University of Tampere, Finland.

The subject of my dissertation is cross-language information retrieval (CLIR), which is a subset of information retrieval (IR). In CLIR, the request is represented in a language which differs from the document language(s). There are two kinds of CLIR tasks: bilingual and multilingual tasks. The first are tasks with only one target language, while the latter deals with tasks with multiple target languages. My research deals with both bilingual and multilingual CLIR tasks.

There are two main approaches in CLIR: either to translate the queries into the target languages or to translate the documents into the source language. The first approach is easier and cheaper to implement, and I apply it in my research as well. There are many ways to translate the queries. I apply bilingual dictionaries. The dictionary-based approach is simple: all the words (except stop-words), are translated into the target language(s), and retrieval is performed with the translated query.

Ontogies are useful in IR in two ways: 1) they allow the user to navigate in order to find the relevant concepts for retrieval, and, 2) they offer a tool for query expansion. In CLIR, ontologies may be used in many different ways. For example, the user could navigate the ontology in the source language, searching the right concepts and expanding the query, after which the query is translated by a bilingual dictionary. An alternative, though more demanding approach would be to create multilingual ontologies. I will apply the first of these approaches in my research, and possibly the latter as well, if there will be resources for that.

Ekaterina Mhaanna, Copenhagen Business School, Denmark.

Frédéric HALLOT, Royal Military Academy of Belgium.

This thesis, is at its very beginning and therefore, the final direction it will go to is not very clear yet.
With this research, I want to focus on the possibility to make ontologies independent of natural languages.

To get this concrete, I want to work on an existing project.
I already developed an online multilingual examination system. Currently this system is multilingual in the sense that all the different linguistic versions of the questions and answers are very nicely stored in a normalized relational database. The system is quite complete. Indeed it allows questions with textual answers, but also multiple choice questions. The examinations are randomly composed in the language of the student. All the multiple choice answers are automatically corrected by the system. For the textual answers there are three possibilities, the student didn’t give a response and the system consider the answer as faulty, the student give literally the same answer as the one stored in the system and the system grant the answer as correct, otherwise the teacher must compare both answers and decide himself about the correctness of the answer.
Possible improvements using multilingual ontologies:
• Conversion of one typical answer for a textual question in an ontology.
• Conversion of the student’s answer in an ontology, and then trying to match the ontologies, taking synonyms in account.
• Ideally, the student’s answer could be in any natural language (supported by the system) while the typical answer could be stated in the teacher’s language, and still the matching of ontologies should be possible, because then the multilingual aspect of ontologies (and thus also web services) would be achieved.

Gunn Inger Lyse, University of Bergen, Norway.

Title: Translationally derived information about lexical semantics for WSD purposes

In recent years, there has been an increasing interest in the use of translational data as a source of information about lexical semantics. The Mirrors method (Dyvik, 2003) utlises translational (corpus) data in order to derive the sense-distinctions of words as well as the semantic relations between the resulting senses (such as synonymy and hypo-/hypernymy). It is not easy to evaluate the Mirrors method, as it is virtually impossible to produce an intersubjective «gold-standard» against which it may be compared.
The goal of this project is to use the practical task of Word Sense Disambiguation (WSD) in order to evaluate how the Mirrors method lends itself as a source of paradigmatic lexical information within a practical NLP-task. Concretely, the project addresses the sparse data problem associated with corpus-based machine-learning approaches to WSD: How to produce sufficiently informative sense-tagged data as training material for stochastic machine-learning (ML) algorithms?
Corpus-based WSD-approaches require large amounts of data, since they are generally limited to observations on concrete co-occurrence patterns between a word sense and its context words. Intuitively, however, the presence of a word a (say, fish) in the context of the target sense in question (e.g., course in its food sense) also implies that the target sense is likely to co-occur with a´s semantic relatives (e.g., salmon as a hyponym to fish), even if such semantic relatives are not actually exemplified in the training corpus. Therefore, an interesting approach would be to generalise context information from unanalysed words to classes of semantically related word senses. In the Mirrors wordnet, semantic relatedness is encoded through (translationally derived) semantic features: The more closely two senses are, the more featuresthey share. The basic idea of the project is to base machine learning on the semantic features present in the target sense context, rather than on the set of unanalysed context words, i.e., producing co-occurrence patterns not between words but classes of words: All words that share some instantiated feature.

Henrik Bulskov, Roskilde University, Denmark.

The topic of my PhD is information retrieval and information extraction. The main issue is how to refine query evaluation by use of ontological knowledge, and how to extract information from a text corpus to support this. Instead of the traditional word-based descriptions of text objects, I want to create semantic descriptions by shallow natural language parsing and semantic extraction. This is sought done by recognition of simple noun phrases and subsequently extraction of relations between these noun phrases, to form simple and compound concepts. The semantic descriptions, sets of concepts, extracted from the text objects, are used as the basis for the query evaluation, and the same technique is sought used on the query definitions, thus having the same type of descriptions for text object in the information base and query definitions. The main goal for the query evaluation is to find a representation for the semantic descriptions and a similarity measures taking the semantic knowledge into account. The idea here is to use a simple representation, which can be mapped directly into a ontology, and then use distance in the ontology to calculate the similarity and/or relatedness between concepts. As a overall requirement are, that both the extraction and the query evaluation is scalable to large information bases, and that the knowledge, primary ontology and linguistic information, not are bound to some specific knowledge base, but rather can use different knowledge sources, for instance found on the Internet, by a translation into a simple generic format.

Henrik Oxhammar, Stockholm University, Sweden.

This project aims at developing, implementing and testing a system that identifies company names and product names in web pages, and maps the product names to a standardized classification scheme (the Common Procurement Vocabulary (CPV)).
The system would limit manual identification, extraction and classification and improve consistency of classification decisions. If employed in a semi-automatic setting, the system will learn from the classification decisions and improve its decisions with the number of learning instances.

The intention is to tackle the recognition and classification of product names in the following way:

* Product Name Recognition: First, a pattern-based approach will be used for finding generic product terms (e.g. alkaline batteries) and true product names (e.g. The Energizer) in English documents using lexicons, part-of-speech tags, orthography (i.e capitalization) etc. and special characters (e.g., parentheses, slash, '&'). Subsequently, the pattern-based approach will be complemented by a machine learning approach.
* Product Name Classification: Product names will be matched to nodes in the hierarchy using a vector space approach. In order to map product names into the hierarchy, the node descriptions will be broadened by building "clouds'' of semantically similar items around them. This will first be done based on existing thesauri (e.g. WordNet). Eventually the intention is to use methods from Information Retrieval for automatic construction of similarity thesauri.

Jaak Simm, University of Tartu, Estonia.

The goal of my PhD work is to examine and explore possibilities of creating architectures capable of developing dynamic ontology for their internal representation. The idea is to view intelligent systems as a series of ontological transformations of information. An ontological transformation transforms information from one ontology to another. To have capabilities of dynamic ontology is to have the ability to perform dynamic ontological transformations (instead of statical transformations that are available to intelligent system during its creation).
For that reason I plan to analyze the way knowledge representation techniques transform ontology. Thus, I try to create a framework how ontological commitments are made and evaluate the efficiency of these ontological commitments. Such framework can provide insights for developing architectures with dynamic ontology capabilities.

Jenny Eriksson, Uppsala University, Sweden.

Jesper Mathiassen, Roskilde University, Denmark.

The area of my project is Computational Linguistics. The focus will be on partial parsing and the idea is to describe a syntax analysis component of a conceptual analysis of domain specific text corpora. In this sense, the syntax analyzer should form part of an information retrieval system in the sense outlined in the project description of the OntoQuery project (http://www.ontoquery.dk/description/). This system facilitates information extraction and query answering with background in a taxonomically ordered ontology. The automatic analysis of text within this project aims at generating descriptors, and the semantics of phrases such as "symptoms due to lack of vitamin-B" is represented with descriptors of the kind examplified below:
symptom[CBY : lack[WRT : vitaminB]]
A feature structural representation where the sort-labels are formed by concepts, while WRT (with respect to) and CBY (caused by) are relation names which relate concepts in the ontology shows the general format that the analysis is targeting. - One of the key ideas in QntoQuery is that a conceptual grammar is used as a supplement to the linguistic grammar. An interesting perspective is to focus on the use of the conceptual grammar in the parse process with the purpose of applying rule sets that will filter away parse trees corresponding to phrases that are conceptually marginal or incorrect in relation to the domain ontology. Hence, while in (at least Danish) political discourse, it would be sensible to talk about a 'red proposal', this makes little sense with other domains as example. However, at present, parsers have no way to incorporate such restrictions. In other words, an adjective and a noun would be recognized as a noun phrase irrespective of whether it is sensible or not. Therefore, such a technique would be quite helpful in relation to the problem of ambiguity.

Jesper Vinther Christensen, Technical University of Denmark.

Title: Specification of geographic information

Geographic information is an abstract representation of reality. That is widely accepted. Specification, of what is called the universe of discourse, defines the interpretation of reality to form representation that can be captured in computer-based systems. It is crucial to be able to express the universe of discourse, i.e specifications, in a way that adds conceptual transparency and clarity such that these can be used as solid foundations for both producing and using geographic information.

My PhD concerns formal specification of geographic information with focus on topographic maps and place names. The aim for the project is to establish a better understanding of what geographic information is, how it is specified, and represented in computer-based systems. An important aspect of geographic information is classification of individuals and specification of concept hierarchies. Defining roles that restricts relations among individuals is also an important task. Using the concept hierarchies and defined roles, it is possible to specify rules that must hold for representation of individuals. At least this is the project’ hypotheses.

The motivations for establishing a framework for writing formal specifications for geographic information are many. It gives authors a predefined and structured scheme for writing specifications, which minimize the risk for introducing contradictive and unclear rules. Formal specifications in a well-known structure supply users of geographic information with access to detailed metadata that can be queried and presented as needed. Formal specification of geographic information allows validation processes, which can decide if some information conforms to a specification. Formal specification makes interchanges of specifications between different systems possible, hereby the implementation of specification in different software is easier, and more important the risk of implementing a specification in different ways are minimized.

Kadri Vider, University of Tartu, Estonia.

Title: Word Sense Disambiguation of Verbs According to Lexical-Syntactic Information

The aim of the dissertation is to study the means, how to disambiguate verb senses according to corpus material, and what lexical-semantic functions (according to Fillmore's FrameNet and Mel'cuk's theory of lexical functions) these verb senses have.
The practical output of my dissertation will be a formally consistent corpus-based lexical-semantic database that describes the usage of Estonian language on the lexical-semantic level, where the main attention is turned on senses of verbs.
I have long practical experience in developing Estonian WordNet (EstWN, site in EC project EuroWordNet-2, 1998-1999) and our research group have remarkable experiences in word sense disambiguation tasks (participation in SensEval-2, 2001). Attempt to disambiguate word senses in CELL (Corpus of Estonian Literary Language) according to word senses in Estonian wordnet referred to inconsistencies in splitting words into senses. It rises a question, how much wordnet word sense (as just one member in synset) is reliable in (con)text and how much word senses created manually by lexicographers are covered in real usage.
Constructing lexical-semantic database of verbs from text corpus data requires to take into consideration, that verbs behave differently from nouns. There is mostly one verb with its argument structure per sentence. Verbs are not clearly distinctable into senses as nodes in hypernymy/hyponymy trees, rather they differs in manner.
Our WSD system works with EstWN hyponym/hypernym hierarchies, taking into consideration the distances between the nodes corresponding to the word senses in the EstWN tree as well as the density of the tree. Results in WSD of verbs are bad by reason of marked above. My idea is to improve also our WSD system with supporting knowledge about Estonian verb senses and their argument structure influencing disambiguation.

Kean Huat Soon, University of Muenster, Germany.

My PhD project studies the development of user tasks ontologies and ensures the developed user tasks ontologies could feasibly be mapped with the content ontologies that describe the domain knowledge of the database schemata or text corpus. A document depicts the user tasks in a particular domain is selected in this study. The verbs and noun phrases from the document, which represent the actions and goals of user tasks, are manually extracted. In order to forming the user task ontology, a concrete conceptual structure is constructed from the extracted information. The conceptual structure of user tasks is formalized with Formal Concept Analysis (FCA), a method well suited in the analysis of data. The user tasks hierarchy then represented in concept lattice from a cross tabulation. In this study, we propose a method where the verbs of the tasks represented as verbal adjectives (verbs with suffix “-able”). The verbal adjectives treated as the formal attributes, whereas the goals of the tasks (noun phrases) defined as the formal objects of the concepts. Owing to the focus of this study is activities-centered rather than object-centered, the notion of implication between attributes is used with the assumption that the action of one particular user task implies other user tasks to be accomplished.
The developed user tasks ontologies from the concrete conceptual structure must ensure that they can map with the domain knowledge. However, this merely depends on how reliable and extendable of the conceptual structure from the formalization. Hence, in this course, I thirst for the answers of the following matters:
• The automation method of concept based retrieval in achieving better retrieval results and building substantial ontologies from the resources such as corpus, database schemata, documents;
• Constructing a concrete conceptual tasks hierarchy where the concentration is between the actions as attributes of the formal concepts;
• The potentials of FCA applied in mapping between two different ontologies.

Kendall Lister, The University of Melbourne, Austalia.

The unprecedented speed with which information is being created and made available via current information technology has produced a sea of disparate data sources that do not interoperate (or do so only at a relatively shallow level, such as data visualisation techniques). Interpreting the meaning of these many different sources is left to human analysts who must spend valuable time sorting through them to ‘hand-extract’ the relevant information. The Structured Knowledge Source Integration (henceforth ‘SKSI’) project at Cycorp (Austin, Texas) has been developed to address this issue. Current SKSI tools enable the Cyc knowledge base to integrate (i.e. to access, to perform complex queries over, to assimilate and to merge) a variety of external structured knowledge sources, such as databases, spreadsheets, XML or DAML tagged text, GIS datasets and web pages, with the rich, multi-contextual Cyc ontology (expressed in the specially-tailored language, ‘CycL’) acting as the mediating lingua franca.

Such knowledge source integration is achieved declaratively via a Schema Mapping Language (SML) within CycL. Using this language, assertions are made in the knowledge base with respect to a given knowledge source regarding: its ‘physical schema’ (e.g. How many columns does a given table consist of?), it’s ‘logical schema’ (e.g. What information do those columns represent?) as well as its access paths, privileges and update frequency. A few such simple assertions are all that is required for the knowledge source to be reasoned over just as if it were part of the Cyc knowledge base. Sources successfully integrated can be ‘dynamically generated’. For example, it is now possible to ask Cyc the current weather in Austin. At present, however, sources have to be found and ‘hand-declared’ in advance in order to be used by the Cyc KB. There is a need for functionality that is more intelligent and proactive in this regard.

We believe that progress in this area may best be made by studying web pages. Web pages, in many cases, form an interesting middle ground between the highly structured knowledge sources with which SKSI tools already deal competently and (currently intractable) natural language (NL) sources, in that they tend to carry their own semantic declarations – in NL, but often in very simple phrases (for instance, in the form of tables or lists of information with simple headers on the columns and sections).

In order to build functionality to proactively identify and add knowledge sources, we wish to exploit the software agent paradigm, which has emerged as a potential aid for interacting with knowledge on the Internet. There is no consensus on an exact definition of the term ‘agent’, though see (Wooldridge and Jennings, 1995), (Ndumu and Nwana, 1997). However, one important type of agent is an independent piece of software which can locate a relatively small amount of accurate information for the end-user, in part by mimicking how a human, knowledgeable about the domain, would seek that information (Sterling, 1997, 1999).

This project will build on work already done by the Intelligent Agent Laboratory at the University of Melbourne in prototyping a range of information agents over the past 5 years. Significant insight has been gained by the Intelligent Agent Laboratory as to when a knowledge-based approach to building software agents may be successful. The key characteristic of a suitable domain is that there is a variety of pages in differing formats but there is some common overall structure. Too much structure reduces the problem to known methods. Too little structure reduces the problem to natural language understanding which is currently too difficult. Domains successfully prototyped include finding paper citations, sports scores, subjects offered in universities, classified advertisements for cars and real estate, and legal information. It has been possible to develop the information agents in a way that can be generalised to some extent from domain to domain (Loke et al, 1999). These agents however have not fitted within an overall ontology.

The aim of this research, therefore, is to advance knowledge-source integration technology by exploring ways in which agents can automatically find interesting web sites on particular topics and automatically generate suitable mappings for them which integrate them within a large-scale, general ontology (in this case, the Cyc ontology).

Lone Bo Sisseck, Copenhagen Business School, Denmark.

Olatz Ansa, University of the Basque Country.

In the context of the increasing importance of lexicons in Natural Language Processing, we have considered the need to build a lexical knowledge base for Basque. We are interested in developing a general lexical-semantic framework, in which all type of relations (even multilingual and complex ones) are incorporated (Agirre et al., 2003).
Recently, two works have been carried out in order to create lexical-semantic resources for the Basque language:
i) The Basque WordNet resource, that is carried out in the context of the EuroWordNet project (Agirre et al., 2002), and
i) The extraction of relations from the analysis of a Basque monolingual dictionary (Agirre et al., 2000).
The Basque WordNet and the concepts extracted from the monolingual dictionary are going to be mapped. Our purpose is to enrich the Basque WordNet resource with the relations stored in the lexical knowledge base, and vice versa. Indirectly we hope to disambiguate extracted relations from the monolingual dictionary via this mapping.
Moreover, we want this lexical resource to be usable in practical applications. The application we are thinking of is a multilingual question-answering one. This system will receive questions written in Basque, and the answers will be obtained from multilingual corpora (Basque, Spanish, English).
At present, we are working on the design of this application and discussing in detail how and where this lexical-semantic information can be used to improve system results.

Päivi Pasanen, University of Helsinki, Finland.

Title: Terminological data in maritime safety texts and methods of their extraction

Traditionally, existing standards, dictionaries and glossaries have served as research material for researchers of terminology. Therefore, there are no satisfactory language-independent methods for extracting terminological data from texts. In addition to this, as a prescriptive approach has been predominant in terminology, concept development or term variations have attracted only a little attention. However, in reality concepts change, and term variation is common.
The objectives of this work are, firstly, to research terminological data embedded in texts, and, secondly, to test the applicability and usefulness of certain methods used for extracting terminological data from texts. Consequently, proposals to develop new methods will be presented.
The research material consists of specialized texts in the subject domain of maritime safety. International and national regulations, textbooks, conference papers, research reports and articles from professional journals in Finnish and in Russian are included.
The research will be conducted in three phases. The first phase consists of the application of certain term extraction methods. These are manual term identification and semi-automatic term extraction, the latter of which will be carried out by using two commercial computer programmes. The results of term extraction will be compared and the accuracy and precision of the methods will be evaluated.
During the second phase, other terminological data such as concept relations and characteristics will be retrieved from the texts. It has been argued that certain linguistic expressions could be used to identify terminological data from texts. In this study these expressions, which some researchers call knowledge probes, will be applied to identify concept relations and characteristics.
The third phase consists of the comparative analysis of concepts in two languages. The analysis will be performed by the examination of characteristics identified during the second phase of the study. Concepts will be studied diachronically and the function of term variation will be considered.
The research will provide new information of the applicability of terminological methods to the extraction of terminological data from texts. The results may be applied to terminology work in special fields and to the education of translators and field professionals.

Paulo Gottgtroy, Auckland University of Technology, New Zealand.

Title: An ontology driven bioinformatics annotation tool

Problem Definition:
Sequence alignments provide a powerful way to compare novel sequences with previously characterized genes. Statistical measures results indicate the quality of matches. Often the statistical and biological significance are related. Sometimes, however matches of real biological significance have low statistical scores.
Objective:
To build an automatic annotation tool that considers both statistical and the semantic analysis of the definition presented in the alignments results.
Proposal Description:
Extract relevant information from small corpus, such as presented in the definition sentences of alignments results, is a challenge task and is based mainly on the information extraction techniques. Most of the current algorithms of information extraction use a large Lexus and huge training set to extract useful information and relationships among entities. However in the case of sequence analysis, the results are not homogeneous, the definitions are short, and sometimes the quality of the annotations is not trustworthy.
To overwhelm these restrictions we are proposing a methodology that uses ontologies as knowledge representation. Furthermore we are using different levels of ontologies, such as domain, application and task ontologies, to allow reuse and different inference from the same information. The ontologies are used as a semantic dictionary to guide our concept recognition process.
To implement this solution we are considering the current bioinformatics annotation processes in order to acquire previous knowledge and represent the application domain. The resulted annotations are going to be classified and stored in an ontological representation that is going to be reused in terms of rule based inferences that will guide the selection of target data sources.
The annotation tool is going to be implemented as a knowledge acquisition tool that will support the population of the knowledge base for further analysis.
The proposal includes the development and implementation of an integrated ontology that includes gene ontology as domain ontology, and others ontologies built from the current annotation process to support the workflow and the concepts involved in the application domain. The prototype is going to include an information extraction process and an automatic ontology annotation.

Pirkko Saatsi, University of Tampere, Finland.

At the moment I work at the University of Tampere in National Ontology Project in Finland. I am constructing ontology to help information retrieval process. Ontology’s subject is food products, food production and food supervision.

The ontology construction is based on three level model of information retrieval. The levels are conceptual, linguistic and string level. The concepts and relationships (generic, partitive and associative relationships) among them are represented in conceptual level. The linguistic level contains the natural language expression(s) of each concept. The matching patterns of each expression are given in a query-language independent way in the string level.

Puay Leng Lee, Strathclyde University, UK.

My research interest is in the use of conceptual structures to enable and support Information Retrieval (IR). The PhD project is interested in the dynamic construction of FCA lattices in response to user queries, for the purpose of retrieving documents from large, non domain-specific text collections. We conjecture that FCA lattices enable efficient, effective retrieval of documents from such collections, enhancing retrieval performance. This is motivated by potential benefits of lattices to IR such as: the lattice information space naturally incorporates the navigational as well as querying spaces; there are multiple paths to a given document. The project seeks practical solutions to problems due to size and lack of domain-specificity of the collection. The implemented FCA-IR system will be tested against more conventional IR systems in a user evaluation study.

Rasmus Knappe, Roskilde University, Denmark.

Title: Ontology-based Similarity Measures

The main topic is similarity measures in connection with content-based query evaluation. The aim is to use the knowledge from a domain-specific ontology covering the domain of a given information base to obtain better and closer answers on a semantic basis. This is sougth done by devising a similarity measure that utilizes the different relations and the structures of the ontology to calculates similarity between a query and objects in the information base.

Sotiris Rompas, University Of Glasgow, Scotland, UK.

Title: Visualization for web post – retrieval clustering

As the World Wide Web has been expanded in size greater than any expectations, currently 5 billion pages (Lawrence & Giles 1998), and continues to increase rapidly, extended research has been taken place to identify effective ways on allocating information on the Web. Search engines and directories such as Google and Altavista were created in order to aid users with their web retrieval tasks. As things stand, users get elongated lists of web documents in a form of ranked list as a result. Identifying the appropriate information becomes a tedious task, as the user should browse most of the results returned in order to identify any appropriate information.
Sophisticated information retrieval algorithms have been developed in order to increase the efficiency of the search engines (in terms of precision and recall). Furthermore, clustering techniques have been implemented in various search engines such as Vivissimo (www.vivissimo.com) and Dogpile (www.dogpile.com) to increase the information retrieval efficiency by supplying meaningful groups of information (clusters) to the user in order to increase the speed of allocating any appropriate information. Extended research has taken place on clustering algorithms for web post retrieval in order to increase the efficiency of the clustering. Furthermore increased effort has taken place in order to apply various visualization techniques that could visualize the information space of a retrieval task. These visualizations fail to satisfy users in terms of browsing, as usually the information space of such a task contains a very large amount of documents.
We are taking a different approach by trying to combine both worlds (IR and Information Visualization) and visualize the clusters of a post-retrieval clustering task instead of visualizing the retrieved information space as a whole. We hypothesize here that such visualization will increase the overall efficiency, and also minimizing the search time, of the search process. Our visualization interface, VisOC, creates a simple graphical representation of the clusters generated for a specific retrieval task in such a way that the user can easily access every cluster generated without being “lost” in the retrieved information space.
The main aim of VisOC is to minimize user clicks and also increase the efficiency of the search process by giving the user a “picture” of the available information for her query. The user can access a cluster by clicking on it in order to access sub clusters. At the lowest level of the clustering hierarchy simple icons are used to represent the retrieved documents in order to increase the familiarity process with the interface. At any time the user can view the entire information space without the need of a mouse click.

Thomas Terney, Roskilde University, Denmark

Till C. Lech, CognIT as / University of Bergen, Norway.

Title: Ontology-based Co-reference Chaining

The proposed PhD project is a part of the KunDoc research project, funded by the Norwegian Research Council (KUNSTI-framework). KunDoc is cross-disciplinary project, aiming at combining language technology with methods related to knowledge-based systems and the semantic web.

The goal of the proposed work is to develop a method for knowledge-intensive co-reference chaining by means of domain-specific ontologies encoded in RDFS.

There have been some knowledge-based approaches to co-reference chaining, such as Wilks’ preference semantics, Schank’s scripts/frames or semantic networks. However, the lack of available world knowledge has been pointed out as a major drawback with these kinds of methods. In recent years methods and tools for knowledge-based systems in general, and semantic web initiatives in special have made major progress, providing powerful representation languages and concept extraction tools.

The proposed approach aims at transferring results from a linguistic and statistical corpus analysis into explicit domain-specific knowledge represented in RDFS. Special focus is directed towards how predicate-argument structures in sentences can be mapped into concepts and relations on the domain ontology – which in turn will deliver semantic information for the analysis of unknown documents from the same domain.

Methods and Tools

The work is planned to be carried out in three phases: In the first phase, a domain specific corpus will be processed, and the predominating concepts will be extracted. In the next step, the predicates employed by the domain concepts will be statistically evaluated in order to cluster the concepts semantically and store the concept-predicate relations in the domain ontology.

In the third step, the knowledge stored in the ontology will be used for the analysis of unknown texts. If an NP’s referent in a text is unclear, it will be – along with its predicate – fed into the ontology in order to find an appropriate referent.

In order to extract the concepts, the Onto-Extract tool made by CognIT as in the On-To-Knowledge Project will be used, together with the Oslo-Bergen Tagger, a rule-based PoS-Tagger developed at the University of Oslo and AKSIS, Bergen. The predicate-argument analysis will executed by means of the NORGRAM parser, developed at the University of Bergen.