Introducción Proyectos Herramientas Paseo histórico por la red
| Paseo histórico
A los dos lados del charco las instituciones públicas y las
universidades han desarrollado una serie de proyectos de
altísimo interés para nosotros.
[Aquí hablar de la creación de TEI, que ya celebró sus 10
años]
EAGLES I y II (1995-1999)
En Europa destacan EAGLES I y II (Expert Advisory Group on
Language Engineering Standards)
El primer proyecto terminó en 1996. El segundo proyecto
se extendió entre 1997 y primavera de 1999. Según la
introducción
(
http://www.ilc.pi.cnr.it/EAGLES96/intro.html
)
EAGLES is an initiative of the European Commission (…)
which aims to accelerate the provision of standards for:
-
Very large-scale language resources (such as text
corpora, computational lexicons and speech corpora);
-
Means of manipulating such knowledge, via
computational linguistic formalisms, mark up languages and
various software tools;
-
Means of assessing and evaluating resources, tools
and products.
The work towards common specifications is carried out by
five working groups:
-
Text Corpora
-
Computational Lexicons
-
Grammar Formalisms
-
Evaluation
-
Spoken Language
Un resultado de los trabajos fue el
Corpus Encoding Standard
(CES,
http://www.cs.vassar.edu/CES/) y XCES
(
http://www.cs.vassar.edu/XCES/), la versión XML.
The CES is designed to be optimally suited for use in
language engineering research and applications, in order to
serve as a widely accepted set of encoding standards for
corpus-based work in natural language processing
applications. The CES is an application of SGML compliant
with
the specifications of the TEI Guidelines.
The CES specifies a minimal encoding level that corpora
must achieve to be considered standardized in terms of
descriptive representation (marking of structural and
typographic information) as well as general architecture (so
as to be maximally suited for use in a text database). It also
provides encoding specifications for linguistic annotation,
together with a data architecture for linguistic corpora.
In its present form, the CES provides the following:
-
a set of metalanguage level recommendations
(particular profile of SGML use, character sets,
etc.);
-
tagsets and recommendations for documentation of
encoded data;
-
tagsets and recommendations for encoding primary data,
including written texts across all genres, for the
purposes of corpus-based work in language engineering.
-
tagsets and recommendations for encoding linguistic
annotation commonly associated with texts in language
engineering, currently including:
-
segmentation of the text into sentences and words
(tokens),
-
morpho-syntactic tagging,
-
parallel text alignment.
Sin embargo lo más influyente de los resultados de
los proyectos son las Directrices EAGLES.
Los trabajos del grupo los prosigue el proyecto ISLE.
MULTEXT (1994-1996)
MULTEXT (Multilingual Text Tools and Corpora,
LRE 62-050, 1994-96,
http://www.lpl.univ-aix.fr/projects/multext/). Estos eran
sus objetivos iniciales:
Existing tools for NLP and
MT corpus-based research are
typically embedded in large, non-adaptable systems which
are fundamentally incompatible. Little effort has been
made to develop software standards, and software
reusability is virtually non-existent. As a result, there
is a serious lack of generally usable tools to manipulate
and analyze text corpora that are widely available for
research, especially for multi-lingual applications. At
the same time, the availability of data is hampered by a
lack of well-established standards for encoding
corpora. Although the TEI has
provided guidelines for text
encoding, they are so far largely untested on real-scale
data, especially multi-lingual data. Further, the TEI
guidelines offer a broad range of text encoding solutions
serving a variety of disciplines and applications, and are
not intended to provide specific guidance for the purposes
of NLP and MT corpus-based research. MULTEXT proposes to
tackle both of these problems. First, MULTEXT will work
toward establishing a software standard, which we see as
an essential step toward reusability, and publish the
standard to enable future development by others. Second,
MULTEXT will test and extend the TEI standards on
real-size data, and ultimately develop TEI-based encoding
conventions specifically suited to multi-lingual corpora
and the needs of NLP and MT corpus-based research.
Herramientas elaboradas por el proyecto MULTEXT son
ISLE (2000-2002)
El sitio web del proyecto se encuentra en
http://lingue.ilc.pi.cnr.it/EAGLES96/isle/ISLE_Home_Page.htm
.
Leemos
The ISLE project which started on 1 January
2000 continues work carried out under the EAGLES
initiative. ISLE (International Standards for Language
Engineering) is both the name of a project and the name of an
entire set of co-ordinated activities regarding the HLT
field. ISLE acts under the aegis of the EAGLES initiative,
which has seen a successful development and a broad deployment
of a number of recommendations and de facto
standards.[1]
The aim of ISLE is to develop HLT standards within an
international framework, in the context of the EU-US
International Research Cooperation initiative. Its objectives
are to support national projects, HLT RTD projects and the
language technology industry in general by developing,
disseminating and promoting de facto HLT standards and
guidelines for language resources, tools and
products.[2]
ISLE targets the 3 areas: multilingual lexicons, natural
interaction and multimodality
(NIMM), and evaluation of
HLT
systems. These areas were chosen not only for their relevance
to the current HLT call but also for their long-term
significance. For multilingual computational lexicons, ISLE
will:[3]
-
extend EAGLES work on lexical semantics, necessary
to establish inter-language links;
-
design standards for multilingual lexicons;
-
develop a prototype tool to implement lexicon
guidelines and standards;
-
create exemplary EAGLES-conformant sample lexicons
and tag exemplary corpora for validation purposes;
-
develop standardised evaluation procedures for
lexicons.
SALT (2000-2001)
«SALT»
(Standards-based Access to multilingual Lexicons and
Terminologies) fue un proyecto integrado en el V Programa
Marco (2000-2001).
Una de sus páginas web está en
http://www.loria.fr/projets/SALT/saltsite.html
.
El proyecto nace de la toma de conciencia de una necesidad:
This project responds to the fact that many
organizations in the localization industry are now using
both human translation enhanced by productivity tools
and MT with or without human
post-editing. This duality of translation
modes brings with it the need to integrate existing
resources in the form of (a) the NLP lexicons used
in MT (which we categorize as
lexbases) and (b) the
concept-oriented terminology databases used in
human-translation productivity tools (which we call
termbases). This
integration facilitates consistency
among various translation activities and lever-ages data
from expensive information sources for both lex side and
the term side of language processing.
The SALT project combines two recently finalized
interchange formats: «OLIF»
(Open Lexicon Interchange Format),
which focuses on the interchange of data
among lexbase resources from various machine
translation systems, (Thurmaier et al. 1999), and
«MARTIF»
(ISO 12200:1999, MAchine-Readable Terminology
Interchange Format), which facilitates the interchange
of termbase resources with conceptual data models
ranging from simple to sophisticated. The goal of SALT
is to integrate lexbase and termbase resources into a
new kind of database, a lex/term-base called
«XLT»
(eXchange format for Lex/Term-data).
XLT se basa en XML. El «Default XLT»
se conoce como «TBX»: ‘TermBase
eXchange format’.
Control of TBX has been handed over from the SALT
project (…) to LISA (and its OSCAR SIG).
LISA y OSCAR
Pendiente y urgente: TMX, TBX, SRX.
|