An Architecture for Linguistic Analysis (POS-tagger, NP-recognizer, Parser) in the OntoQuery Project

Patrizia Paggio, Dorte Haltrup, Lene Offersgaard

 

Abstract:

The OntoQuery Parser Group will present three items: 1)The suggested architecture for the linguistic analysis. 2) The results of training Brill’s POS-tagger for Danish. 3) Considerations on using LKB as syntactic and semantic parser.

The suggested architecture for performing linguistic analysis consists of the following modules: tagger, NP-recogniser, syntactic and semantic parser and interface module to ontology. Use of dictionaries of different levels and the interaction between these dictionaries and the linguistic modules will be discussed. A diagram showing interaction of the different modules and dictionaries will be presented.

In OntoQuery a tagger for Danish has been trained. The principles of Brill’s tagger will be described, followed by the results achieved with the tagger. In Eric Brill’s paper on his rule-based tagger he states that the tagger gives 97.2% correct analyses, when it is trained with 600.000 words, and all words in the test material are known by the tagger. In this attempt to train Brill’s tagger 98.5% correct analyses are found when the tagger is trained on 260.000 words and all words are known. Brill reports 96.5% correct analyses when the tagger is trained on 950.000 words from the Penn Treebank, furthermore 85% of all unknown words are correctly guessed. We have achieved comparable results for Danish where 96.5% of all words get the right analysis and 80% of all unknown words are correctly guessed.

As a last item we will present our considerations on how to use a parser module in the project. LKB is proposed as a promising candidate. LKB is a unification and type feature structure based grammar development environment, which includes a bottom-up chart parser.