Massive multilingual corpus compilation: Acquis

Transcript Massive multilingual corpus compilation: Acquis

The FIDA & MULTEXT-East
language resources
Tomaž Erjavec
Department of Knowledge Technologies
Jožef Stefan Institute, Ljubljana
[email protected], http://nl.ijs.si/et/
Gralis 2006
Institut für Slawistik der Universität Graz
2006-05-09
Overview
1.
2.
3.
4.
Background
FIDA: a reference corpus of Slovene
MULTEXT-East: morphosyntactic
resources for Central and EastEuropean languages
Other language resources for
Slovene
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
Language Resources

LR comprise three layers of data:
– corpora: mono- or multilingual, reference or specialised, …
/variously annotated/
– lexica: vocabularies, morphosyntactic, syntactic, semantic,
(ontologies)
– standards: linguistic and technical encoding

LRs, esp. corpora are used for
empirical language research:
– linguistic studies:
(annotated) corpus + (sophisticated) search engine
– human language technology R&D:
testing and training dataset
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
Part I.
The FIDA corpus

Gralis
2006-05-09
Slovene reference corpus for
linguistic studies
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
FIDA
http://www.fida.net/
Joint project (1997-2000) of
 Filozofska fakulteta
Vojko Gorjanc, Marko Stabej, Špela Vintar

Institut Jožef Stefan

DZS

Amebis
Tomaž Erjavec
Simon Krek
Peter Holozan, Miro Romih
Financed by industry partnerns
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
Characteristics of FIDA




monolingual
synchronous
written language
reference
– representative
– balanced

annotated
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
Sizes
Total
103,513,072
words
29,177 texts
Avg. text length
3,548 words
Largest texts:
Leksikon DZS:
508,370 words
69 texts > 100.000
Smallest texts:
2.648 < 100 words
2 x <w>rezgrtshdrghgth4</w>
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
Time Composition



Oldest/most recent text: 1989/2000
Average date 1997-02
Texts/Words with unknown date:
3.94%/8.28%
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
FIDA taxonomoy:
publication types
…
Ft.P.P.O (published)
95.72%
Ft.P.P.O.K (books)
22.71%
Ft.P.P.O.P (periodicals)
70.50%
Ft.P.P.O.P.C (newspaper)
46.59%
Ft.P.P.O.P.C.D (daily)
32.67%
Ft.P.P.O.P.C.T (weekly)
66.18%
Ft.P.P.O.P.C.V (multi-weekly) 17.74%
…
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
FIDA taxonomoy:
text types
Ft.Z (text type)
Ft.Z.N (non-ficiton)
Ft.Z.N.N (non-professional)
Ft.Z.N.S (professional)
Ft.Z.N.S.H (hum. & soc. sci.)
Ft.Z.N.S.N (nat. & tech. sci.)
Ft.Z.U (fiction)
Ft.Z.U.D (drama)
Ft.Z.U.P (poetry)
Ft.Z.U.R (prose)
Gralis
2006-05-09
99.47%
93.57%
75.14%
18.37%
10.57%
6.04%
5.90%
0.10%
0.17%
5.12%
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
Markup of FIDA



corpus elements annotated with metadata (bibliographic, taxonomy)
text linguistically annotated
encoded according to international
standards and recommendations
– technical: SGML, TEI P3
– linguistic: MULTEXT-East
(MULTEXT, EAGLES)
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
Linguistic annotation
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
Accesibility
Exploitation by partners:
–
–
–
–
DZS: new dictionaries
Amebis: development of HLT
Arts faculty: teaching
IJS: research on HLT
Availability to the public:
– access via concordance engine by Amebis
– free access, but displays only few hits
– possibility of academic licences
FIDA (web site) no longer maintained!
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
FIDA+
http://www.fidaplus.net/

FIDA Plus project:
– Filozofska fakulteta, Fakulteta za družbene vede, Institut
Jožef Stefan
– DZS, Amebis


Financed by the ministry + ind. partners
Extend the corpus with
– Web materials
– spoken component



Better linguistic markup
Free concordances: up to 100 lines
Also possibility of licences
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
Concordancer
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
Output
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
Extended searches
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
Corpus “Nova Beseda”
http://bos.zrc-sazu.si/





being developed at Institute for
Slovene language, ZRC SAZU (Primož
Jakopin)
Web concordancer with no hit limit
now larger than FIDA
but much less varied:
fiction, Delo, DZ
not linguistically annotated
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
Part II.
MULTEXT-East

Gralis
2006-05-09
multilingual morphosyntactic
resources for HLT development
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
MULTEXT-East
resources

MULTEXT-East: Copernicus Joint Project COP 106
(1995-1997) Multilingual Texts and Corpora for
Eastern and Central European Languages


Based on the results of EU MULTEXT (~West)
To produce a harmonised BLARK for six languages:
–
–
–
–
–
–
Gralis
2006-05-09
corpus encoding standardisation (TEI / CES)
multilingual parallel, comparable, speech corpora
morphosyntactic specifications (EAGLES / MULTEXT)
(inflectional) lexicon
annotated corpus
language processing tools
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
History of MULTEXT-East
resources





First release 1998 on TELRI CD-ROM Vol II:
already extended with new languages
Resources since 1998 available on the Web:
http://nl.ijs.si/ME/
Second release 2002 in scope of EU CONCEDE:
re-encoding in XML/TEI, harmonisation
Third release 2004:
merge of first two releases, further languages
Work (indirectly) supported by:
TELRI, CONCEDE, NSF grant, bi-lateral projects
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
The Languages of
MULTEXT-East




Germanic: English
Romance: Romanian
Baltic:
– Latvian
– Lithuanian
Finno-Ugric:
– Estonian
– Hungarian
Gralis
2006-05-09
Slavic:
 Russian (East Slavic)
 Czech (West Slavic)
 Slovene (South West Slavic)
 Resian (Slovene dialect)
 Croatian (South West Slavic)
 Serbian (South West Slavic)
 Bulgarian (South East Slavic)
In progress:
 Macedonian
 Persian
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
Version 3



Available on http://nl.ijs.si/ME/V3/
Some parts completely free, others
free for research  Web licence
Web pages gives:
– extensive documentation
– bibliography list
– web licence form
– resource download
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
The MULTEXT
morphosyntactic trinity
1.
2.
3.
MULTEXT-East morphosyntactic
specifications
MULTEXT-East morphosyntactic
lexica
MULTEXT-East morphosyntactically
annotated "1984" corpus
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
1. Morphosyntactic
specifications



Based on EAGLES / MULTEXT
Define PoS, their attributes and values
The specs are a document containing:
– introduction
– common tables
– language particular sections


Written in LaTeX  PDF & HTML
Derived XML/TEI encoding as feature
structures
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
Example common table
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
Example
language specific
section

table
(shows only categories
actually used)

notes

combinations

lexicon

for Slovene (FIDA):
localisation of category
names
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
Morphosyntactic
Complexity
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
2. The lexica




Medium size morphosyntactic lexica
Languages: English, Romanian, Slovene,
Czech, Bulgarian, Estonian, Hungarian,
Serbian.
~ all word-forms of cca 15.000 lemmas
Lexical entry is composed of three fields:
– the word-form: the inflected form of the word
– the lemma: the base-form of the word
– the morphosyntactic description (MSD)
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
Example: Slovene lexicon
abeced
abeced
abeceda
abecedah
abecedah
abecedam
abecedama
abecedama
abecedami
abecede
abecede
abecede
abecedi
abecedi
…
Gralis
2006-05-09
abeceda
abeceda
=
abeceda
abeceda
abeceda
abeceda
abeceda
abeceda
abeceda
abeceda
abeceda
abeceda
abeceda
Ncfdg
Ncfpg
Ncfsn
Ncfdl
Ncfpl
Ncfpd
Ncfdd
Ncfdi
Ncfpi
Ncfpa
Ncfpn
Ncfsg
Ncfda
Ncfdn
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
Lexicon sizes
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
3. The “1984” corpus





Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…))
Structuraly annotated
Sentence aligned with English
Words annotated with lemma and MSD
Encoded in TEI P4 (XML)
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
Example linguistic encoding
<text id="Osl." lang="sl">
Sentence alignment &
<body>
<div type="part" id="Osl.1">
Context disambiguated
<div type="chapter" id="Osl.1.2">
<p id="Osl.1.2.2">
lemmas and MSDs
<s id="Osl.1.2.2.1">
<w lemma="biti" ana="Vcps-sma">Bil</w>
<w lemma="biti" ana="Vcip3s--n">je</w>
<w lemma="jasen" ana="Afpmsnn">jasen</w>
<c>,</c>
<w lemma="mrzel" ana="Afpmsnn">mrzel</w>
<w lemma="aprilski" ana="Aopmsn">aprilski</w>
<w lemma="dan" ana="Ncmsn">dan</w>
<w lemma="in" ana="Ccs">in</w>
<w lemma="ura" ana="Ncfpn">ure</w>
<w lemma="biti" ana="Vcip3p--n">so</w>
<w lemma="biti" ana="Vmps-pfa">bile</w>
<w lemma="trinajst" ana="Mcnpnl">trinajst</w>
<c>.</c>
</s>
…
Gralis
Tomaž Erjavec
2006-05-09
Dept. of Knowledge Technologies, Jožef Stefan Institute
Quantifying the corpus
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
Utility of MULTEXT-East
LRs



Specifications became, for some, the “national” standard
Training/testing dataset for HLT development:
PoS taggers, lemmatizers, lexicon extractors, ILP
A base dataset for further annotation and experiments:
– Word-sense disambiguation
– WordNet development and evaluation
– Syntactic parser induction



Teaching aid in HLT courses
~ 100 registered users
As a BLARK “best practice” for new languages:
Resian, Croatian, Macedonian, Persian
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
LRs @ JSI
http://nl.ijs.si/nl.html#Resource
Also ours: VAYNA, GORE, sloWNet
Contributors to: FIDA, DSI, FDV, JRC-ACQUIS
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
Overview of Slovene LRs and services @
Slovenian Language Technologies Society
http://nl.ijs.si/sdjt/
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute
Thank you!
Gralis
2006-05-09
Tomaž Erjavec
Dept. of Knowledge Technologies, Jožef Stefan Institute

Massive multilingual corpus compilation: Acquis

Transcript Massive multilingual corpus compilation: Acquis

Directory