Introducción
Proyectos
Herramientas
Paseo histórico por la red

Paseo histórico

A los dos lados del charco las instituciones públicas y las universidades han desarrollado una serie de proyectos de altísimo interés para nosotros.

[Aquí hablar de la creación de TEI, que ya celebró sus 10 años]

EAGLES I y II (1995-1999)

En Europa destacan EAGLES I y II (Expert Advisory Group on Language Engineering Standards)

El primer proyecto terminó en 1996. El segundo proyecto se extendió entre 1997 y primavera de 1999. Según la introducción ( http://www.ilc.pi.cnr.it/EAGLES96/intro.html )

EAGLES is an initiative of the European Commission (…) which aims to accelerate the provision of standards for:

Very large-scale language resources (such as text corpora, computational lexicons and speech corpora);

Means of manipulating such knowledge, via computational linguistic formalisms, mark up languages and various software tools;

Means of assessing and evaluating resources, tools and products.

The work towards common specifications is carried out by five working groups:

Text Corpora

Computational Lexicons

Grammar Formalisms

Evaluation

Spoken Language

Un resultado de los trabajos fue el Corpus Encoding Standard (CES, http://www.cs.vassar.edu/CES/) y XCES ( http://www.cs.vassar.edu/XCES/), la versión XML.

The CES is designed to be optimally suited for use in language engineering research and applications, in order to serve as a widely accepted set of encoding standards for corpus-based work in natural language processing applications. The CES is an application of SGML compliant with the specifications of the TEI Guidelines.

The CES specifies a minimal encoding level that corpora must achieve to be considered standardized in terms of descriptive representation (marking of structural and typographic information) as well as general architecture (so as to be maximally suited for use in a text database). It also provides encoding specifications for linguistic annotation, together with a data architecture for linguistic corpora.

In its present form, the CES provides the following:

a set of metalanguage level recommendations (particular profile of SGML use, character sets, etc.);

tagsets and recommendations for documentation of encoded data;

tagsets and recommendations for encoding primary data, including written texts across all genres, for the purposes of corpus-based work in language engineering.

tagsets and recommendations for encoding linguistic annotation commonly associated with texts in language engineering, currently including:

segmentation of the text into sentences and words (tokens),

morpho-syntactic tagging,

parallel text alignment.

Sin embargo lo más influyente de los resultados de los proyectos son las Directrices EAGLES. Los trabajos del grupo los prosigue el proyecto ISLE.

MULTEXT (1994-1996)

MULTEXT (Multilingual Text Tools and Corpora, LRE 62-050, 1994-96, http://www.lpl.univ-aix.fr/projects/multext/). Estos eran sus objetivos iniciales:

Existing tools for NLP and MT corpus-based research are typically embedded in large, non-adaptable systems which are fundamentally incompatible. Little effort has been made to develop software standards, and software reusability is virtually non-existent. As a result, there is a serious lack of generally usable tools to manipulate and analyze text corpora that are widely available for research, especially for multi-lingual applications. At the same time, the availability of data is hampered by a lack of well-established standards for encoding corpora. Although the TEI has provided guidelines for text encoding, they are so far largely untested on real-scale data, especially multi-lingual data. Further, the TEI guidelines offer a broad range of text encoding solutions serving a variety of disciplines and applications, and are not intended to provide specific guidance for the purposes of NLP and MT corpus-based research. MULTEXT proposes to tackle both of these problems. First, MULTEXT will work toward establishing a software standard, which we see as an essential step toward reusability, and publish the standard to enable future development by others. Second, MULTEXT will test and extend the TEI standards on real-size data, and ultimately develop TEI-based encoding conventions specifically suited to multi-lingual corpora and the needs of NLP and MT corpus-based research.

Herramientas elaboradas por el proyecto MULTEXT son

mmorph (Morphology tool, ftp://issco-ftp.unige.ch/pub/multext/mmorph-2.3.4_2.tar.gz)
mtag (The multext version of the tagger, ftp://issco-ftp.unige.ch/pub/multext/tagger2.22.tar.gz )
tatoo (The ISSCO TAgger TOOl, http://issco-www.unige.ch/staff/robert/tatoo/tatoo.html )
multext_align (Alignment program, ftp://issco-ftp.unige.ch/pub/multext/multext_align_v2.0.tar.gz)

ISLE (2000-2002)

El sitio web del proyecto se encuentra en http://lingue.ilc.pi.cnr.it/EAGLES96/isle/ISLE_Home_Page.htm .

Leemos

The ISLE project which started on 1 January 2000 continues work carried out under the EAGLES initiative. ISLE (International Standards for Language Engineering) is both the name of a project and the name of an entire set of co-ordinated activities regarding the HLT field. ISLE acts under the aegis of the EAGLES initiative, which has seen a successful development and a broad deployment of a number of recommendations and de facto standards.^[1]

The aim of ISLE is to develop HLT standards within an international framework, in the context of the EU-US International Research Cooperation initiative. Its objectives are to support national projects, HLT RTD projects and the language technology industry in general by developing, disseminating and promoting de facto HLT standards and guidelines for language resources, tools and products.^[2]

ISLE targets the 3 areas: multilingual lexicons, natural interaction and multimodality (NIMM), and evaluation of HLT systems. These areas were chosen not only for their relevance to the current HLT call but also for their long-term significance. For multilingual computational lexicons, ISLE will:^[3]

extend EAGLES work on lexical semantics, necessary to establish inter-language links;

design standards for multilingual lexicons;

develop a prototype tool to implement lexicon guidelines and standards;

create exemplary EAGLES-conformant sample lexicons and tag exemplary corpora for validation purposes;

develop standardised evaluation procedures for lexicons.

SALT (2000-2001)

«SALT» (Standards-based Access to multilingual Lexicons and Terminologies) fue un proyecto integrado en el V Programa Marco (2000-2001).

Una de sus páginas web está en http://www.loria.fr/projets/SALT/saltsite.html . El proyecto nace de la toma de conciencia de una necesidad:

This project responds to the fact that many organizations in the localization industry are now using both human translation enhanced by productivity tools and MT with or without human post-editing. This duality of translation modes brings with it the need to integrate existing resources in the form of (a) the NLP lexicons used in MT (which we categorize as lexbases) and (b) the concept-oriented terminology databases used in human-translation productivity tools (which we call termbases). This integration facilitates consistency among various translation activities and lever-ages data from expensive information sources for both lex side and the term side of language processing.

The SALT project combines two recently finalized interchange formats: «OLIF» (Open Lexicon Interchange Format), which focuses on the interchange of data among lexbase resources from various machine translation systems, (Thurmaier et al. 1999), and «MARTIF» (ISO 12200:1999, MAchine-Readable Terminology Interchange Format), which facilitates the interchange of termbase resources with conceptual data models ranging from simple to sophisticated. The goal of SALT is to integrate lexbase and termbase resources into a new kind of database, a lex/term-base called «XLT» (eXchange format for Lex/Term-data).

XLT se basa en XML. El «Default XLT» se conoce como «TBX»: ‘TermBase eXchange format’.

Control of TBX has been handed over from the SALT project (…) to LISA (and its OSCAR SIG).

LISA y OSCAR

Pendiente y urgente: TMX, TBX, SRX.

^[1] http://lingue.ilc.pi.cnr.it/EAGLES96/isle/project_profile.htm .

^[2] http://lingue.ilc.pi.cnr.it/EAGLES96/isle/objectives.htm .

^[3] http://lingue.ilc.pi.cnr.it/EAGLES96/isle/work_description.html .

$Date$	Home
Copyright © 2003 Juan Rafael Fernández García.