Systems & Grammars

From DELPH-IN

(Redirected from Systems and Grammars)
Jump to: navigation, search

Contents

DELPH-IN members share a commitment to re-usable, multi-purpose resources and active exchange. Based on contributions from several members and joint development over many years, an open-source repository of software and linguistic resources has been created that has wide usage in education, research, and application building.

At the core of the DELPH-IN repository is agreement among partners on a shared set of linguistic assumptions (grounded in HPSG and Minimal Recursion Semantics) and on a common formalism (i.e. logic) for linguistic description in typed feature structures. The formalism is implemented in several development and processing environments (that can serve differing purposes) and enables the exchange of grammars and lexicons across platforms. Formalism continuity, on the other hand, has allowed DELPH-IN researchers to develop several comprehensive, wide-coverage grammars of diverse languages that can be processed by a variety of software tools.

Over time, the following configuration of core components has emerged as a typical grammar engineering configuration that is commonly used both by DELPH-IN members and other research initiatives.

Linguistic Knowledge Builder (LKB)

provides an interactive grammar development environment for typed feature structure grammars. The LKB includes a parser and generator, visualization tools for all relevant data structures (including trees, feature structures, MRSs, hierarchies, parse and generation charts), and a set of specialized debugging facilities (like ‘interactive unification’) and well-formedness tests for grammar and lexicon.
The LKB is implemented in ANSI Common-Lisp and available in full source code or as precompiled binaries for common platforms, including Linux, Solaris, and MS Windows.

The PET System

for the high-efficiency processing of typed feature structure grammars complements the LKB as a run-time and application delivery component. PET interprets the same logical formalism (in fact reads the exact same grammar source files) and provides a parser that is (much) less resource-demanding than the LKB, more robust, portable, and available as a library that can be embedded into NLP applications. Unlike the LKB, PET includes only very limited debugging facilities.
The PET System is implemented in ANSI C++ (with critical parts in pure ANSI C to improve run-time efficiency) and has been ported to several Unix flavours and MS Windows. Its industrial-strength code quality has already been confirmed in a commercial product built on top of PET. Full source code and pre-compiled binaries for (currently) Linux are available.

Competence and Performance Profiler

The [incr tsdb()] Competence and Performance Profiler provides an evaluation and benchmarking tool to grammar writers and system developers alike. [incr tsdb()] (‘tee ess dee bee plus plus’) acts like an umbrella application to a range of processing systems for typed feature structure grammars, including the LKB and PET, and defines a common format for the organization of test suites or corpora and the storage of precise and fine-grained measures of grammar and processor behavior. [incr tsdb()] profiles abstract over the idiosyncrasies of individual platforms and, thus, facilitate contrastive cross-platform comparison as well as in-depth analysis.
The [incr tsdb()] environment is implemented in ANSI C (for a simple DBMS), ANSI Common-Lisp (core functionality), and Tcl/Tk (GUI) and has been used successfully in various Un*x flavours. Besides a distribution in full source-code, pre-compiled object files are available that can be loaded on top of common LKB run-time binaries.


Linguistic resources that are available as part of the DELPH-IN open-source repository include broad-coverage grammars for English, German, and Japanese, as well as a set of ‘emerging’ grammars for French, Korean, Modern Greek, Norwegian, Portuguese, and Spanish. Additionally, a proprietory grammar for Italian (developed by CELI s.r.l. in Torino) uses the exact same DELPH-IN formalism (and many of the Matrix assumptions) and is available for licensing. Following is some more background information on select grammars:

LinGO English Resource Grammar (ERG)

Being developed at the Center for the Study of Language and Information (CSLI) at Stanford University since 1993. The ERG was originally developed within the Verbmobil machine translation effort, but over the past few years has been ported to additional domains (most notably in an ecommerce and financial services self-help product that is marketed by a CSLI industrial affiliate) and significantly extended. The grammar includes a hand-built lexicon of around ten thousand lexemes and allows interfacing to external lexical resources (like COMLEX). The main grammar developer is Dan Flickinger, with contributions by (among others) Emily Bender, Rob Malouf, and Jeff Smith.

La Grenouille

The French Resource Grammar was originally designed as a tool for modeling selected linguistic phenomena by incorporating insights from ongoing research into the formal analysis of French in HPSG (Abeillé, Bonami, Boyé, Desmets, Godard, Miller, Sag, Tseng). In addition to basic clausal structures, the grammar provides a treatment of (for example) complex predicate constructions (compound tenses, causatives) and morpho-syntactic and phono-syntactic effects (clitic climbing, contraction, vowel elision, consonant liaison). La Grenouille, currently in its tadpole stage, is undergoing metamorphosis; a generation-enabled version has been made available for public distribution in mid-2006. Further inquiries can be addressed to Jesse Tseng, the primary developer at Loria (Nancy, France).

JaCY Japanese Grammar

Jacy is a large scale grammar of Japanese, currently mainly being used in the Jaen machine translaiton sytem. The grammar is comparable in scope and size to the LinGO ERG and builds on the ChaSen package for word segmentation, morphological analysis, and a treatment of unknown words. It has been developed at multiple sites. It was originally developed at the German National Research Center in AI (DFKI GmbH) and Saarland University (both in Saarbrücken, Germany) then through cooperation with YY Technologies, later NTT Communications Research Laboratories and the National Institutue for Information Technologies, Japan and now Nanynag Technological University. Melanie Siegel, Emily Bender and Francis Bond are the main developers.

Korean Resource Grammar

A computational grammar for Korean currently under development by Jong-Bok Kim at Kyung Hee University and Jaehyung Yang at Kangnam University. The grammar, adopting the formalism of HPSG and Minimal Recursion Semantics, aims to develop an open-source grammar of Korean. The morphological analyzer we use for Korean is MACH. The grammar developing team has a close cooperation with the LinGO Research Laboratory at CSLI, Stanford, and the JaCY developer team. The current grammar covers basic sentence types, relative clauses, light verb constructions, case phenomena, auxiliary constructions, and so forth.

Modern Greek Resource Grammar

A computational grammar for Modern Greek currently being developed at the Department of Computational Linguistics of Saarland University. The grammar includes, among others, analyses of basic clause syntax, word order and cliticization phenomena in Modern Greek, valence alternating and ditratransitive constructions, subject-verb inversion, subordinate clauses, relative clauses, UDCs, raising and control, politeness contructions, as well as the implementation of the syntax of noun phrases, passives, and coordination phenomena. Valia Kordoni and Julia Neu are the main developers of the Modern Greek Resource Grammar.

NorSource Norwegian Grammar

Under development at the Norwegian University of Science and Technology (NTNU) in Trondheim. Similar in spirit to the other resource grammars, NorSource aims for a re-usable and precise grammar of Norwegian, adapting the theory of HPSG and Minimal Recursion Semantics to a language (family) that arguably presents a couple of novel challenges to existing work within the HPSG framework. Grammar development is partially funded by the EU Deep-Thought initiative and currently focuses on core syntactic constructions, argument structure and the syntax–semantics interface, and interfacing to an existing computational lexicon for Norwegian. Lars Hellan and Petter Haugereid at NTNU are the main NorSource developers, working with a team of other researchers and students.

Spanish Resource Grammar (SRG)

A computational grammar for Spanish currently being developed at Institut Universitari de Lingüística Aplicada of Universitat Pompeu Fabra. MSG development is currently funded by the Juan de la Cierva program (MEC, Spain) within the TEXTERM-II project (BFF2003-2111). Montserrat Marimon and Núria Bel are the main developers of the SRG.

As several HPSG implementations evolved within the same common formalism, it became clear that homogeneity among existing grammars could be increased and development cost for new grammars greatly reduced by compiling an inventory of cross-linguistically valid (or at least useful) types and constructions. The LinGO Grammar Matrix provides a starter kit to grammar engineers, facilitating not only efficient bootstrapping but also rapid growth towards the wide coverage necessary for robust natural language processing and the precision parses and semantic representations that the ‘deep’ processing paradigm has to offer. The Matrix (in its current release version 0.4) comprises (a) types definitions for the basic feature geometry and technical devices, (b) the representation and composition machinery for with Minimal Recursion Semantics in a type feature structure grammar, (c) general classes of rules, including derivational and inflectional (lexical) rules, unary and binary phrase structure rules, headed and non-headed rules, and head-initial and head-final rules, and (d) types for basic constructions such as head-complement, head-specifier, head-subject, head-filler, and head-modifier rules, coordination, as well as more specialized classes of constructions.

Finally, as processing efficiency and grammatical coverage have become less pressing aspects for ‘deep’ NLP applications, the research focus of several DELPH-IN members has shifted to combinations of ‘deep’ processing with stochastic approaches to NLP, on the one hand, and to building hybrid NLP systems that integrate ‘deep’ and ‘shallow’ techniques in novel ways. More specifically, the transfer of DELPH-IN resources into industry has amplified the need for improved parse ranking, disambiguation, and robust recovery techniques and there is now broad consensus that applications of broad-coverage linguistic grammars for analysis or generation require the use of sophisticated stochastic models. The LinGO Redwoods initiative is providing the methodology and tools for a novel type of treebanks, far richer in the granularity of available linguistic information and dynamic in both the access to treebank information and its evolution over time. Redwoods has completed two sets of treebanks, each of around 7,000 sentences, for Verbmobil transcribed dialogues and customer emails from an ecommerce domain. On-going research for the Redwoods group at Stanford (and partners in Edinburgh and Saarbrücken) is investigating generative and conditional probabilistic models for parse disambiguation in conjunction with the LinGO ERG (and other DELPH-IN grammars).

Heart of Gold

The Heart of Gold environment is an XML-based middleware for the integration of deep and shallow natural language processing components, with the focus on robust, multilingual, application-oriented HPSG parsing assisted by, for example, shallow part-of-speech taggers, chunkers and named entity recognizers. The Heart of Gold provides a uniform inrastructure for building applications that use RMRS-based and/or XML-based natural language processing components. The middleware itself has been developed at DFKI and Saarland University within the DeepThought and Quetal projects, and is published under LGPL. However, many NLP components for which adapters (‘Modules’) are provided, come with different licenses, most of them free for research purposes. The deep component that is currently integrated is PET, with all deep HPSG grammars mentioned on the DELPH-IN site. Additional deep and shallow NLP components can be integrated easily by providing a simple Java class or an XML-RPC interface.
Personal tools