Encoding Croatian Corpora Marko Tadić Department of linguistics/Institute of linguistics,

Download Report

Transcript Encoding Croatian Corpora Marko Tadić Department of linguistics/Institute of linguistics,

Encoding Croatian Corpora
Marko Tadić
([email protected], www.hnk.ffzg.hr/mt)
Department of linguistics/Institute of linguistics,
Faculty of philosophy, University of Zagreb
(www.ffzg.hr/www.ffzg.hr/zzl/zzl-home.htm)
Tübingen, 2001-02-22
Lecture plan

Monolingual corpora
– Croatian National Corpus (HNK)

Bilingual corpora
– Croatian-English parallel corpus
– Croatian-Slovenian parallel corpus
– Acquis translations parallel corpus
Croatian National Corpus (HNK) 1

project of the Ministry of Science and Technology of the
Republic of Croatia 130718, Computational processing of
Croatian language, formally started 1996, actually 1998

theoretical foundations (www.hnk.ffzg.hr/cilj) in 1995, published:
– Tadić (1996) Računalna obradba hrvatskoga i nacionalni korpus,
Suvremena lingvistika 41-42, 603-612
– Tadić (1998) Raspon, opseg i sastav korpusa suvremenoga
hrvatskoga jezika, Filologija 30-31, 337-347

need for the reference corpus of Croatian
– 1st step: written
– later: some 10% spoken

a tentative solution for its composition

the size, time-span and structure was elaborated

accessibility via WWW service was suggested
HNK 2: structure

30m
30-million Corpus of
Contemporary Croatian
– texts from 1990 until today
– different domains and genres
– representativeness for
contemporary Croatian
standard

HETA
Croatian Electronic Text Archive
(Hrvatski Elektronski Tekstovni
Arhiv)
– whole texts older than 1990
– whole texts of complete
publications after 1990 which
would disbalance the
representativeness of 30m
HNK 3: 30m text typology
1. Informative texts/Faction
1.1. newspaper
1.1.1. daily
1.1.2. weekly
1.1.3. bi-weekly
1.1.4. irregular
1.2. magazines
1.2.1. weekly
1.2.2. bi-weekly
1.2.3. monthly
1.2.4. bi/tri-monthly
1.3. books
1.3.1. journalism
1.3.2. crafts etc.
1.3.3. science
2. Imaginative teksts/Fiction
2.1. prose
2.1.1. novels
2.1.2. stories
2.1.3. diaries, travelling notes...
3. Miješani tekstovi
3.1. imaginative-factographic pieces
3.2. essays
3.3. speeches
76
37
22
9
3
3
17
10
1
3
3
22
7
2
13
21
21
13
7
1
3
1
1
1
22800000
11100000
6600000
2700000
900000
900000
5100000
3000000
300000
900000
900000
6600000
2100000
600000
3900000
6300000
6300000
3900000
2100000
300000
900000
300000
300000
300000
HNK 4: corpus on www
http://www.hnk.ffzg.hr

Testing V 1.0: 1998-12-05
– 30m: 3 mW

Testing V 1.1: 1999-02-14 & 1999-07-20
– 30m: 7,67 mW
– HETA: 2,9 mW from CD-ROM: Classics of Croatian literature,
Naklada Bulaja, Zagreb, 1999

Testing V 1.1 (approx. 10 mW) of corpus is www accessible
– text format: quasi HTML, no XML
– no POS marking

Testing V 1.2 (approx. 17 mW)
– being filled right now
– no additional retrieval facilities
HNK 5: Statistics

www.hnk.ffzg.hr/stats
Item
Hits
Total Data Transferred
Total Visiting Users
Time Period
Average Hits per User
Average Users per Day
Average Data Transferred per Day
Hits cached by Client
Report generated on
Incomplete downloads/file requests
Log spans a period of
Total failed requests
Unique IP Addresses
Average Data Transferred per User
Average Hits per Day
Average Data Transferred per Hit
Each user has visited approximately
Hits on Pages
Hits on Files
Hits on Images
Value
261182
7.28 gigabytes
28871
November 27, 1998, 08:43 AM to December 31, 2000, 11:47 PM
9.05
37.69
9.74 megabytes
67983 (26.03%)
January 11, 2001 at 11:44 AM
3037 (1.16%)
766 days
16574 (6.35%)
9480
264.50 kilobytes
340.97
29.24 kilobytes
3.05 times
123105
18620
102883
Domain Name
Croatia/Hrvatska (.hr)
Commercial (.com)
Germany (.de)
Network (.net)
Austria (.at)
Educational (.edu)
Slovenia (.si)
Netherlands (.nl)
Australia (.au)
Czech Republic (.cz)
Italy (.it)
Canada (.ca)
Yugoslavia (.yu)
France (.fr)
Sweden (.se)
Poland (.pl)
Bosnia and Herzegowina (.ba)
Russian Federation (.ru)
United Kingdon (.uk)
Japan (.jp)
Switzerland (.ch)
New Zealand (.nz)
Denmark (.dk)
Slovakia (Slovak Republic) (.sk)
Hungary (.hu)
Non-profit Organization (.org)
Israel (.il)
Belgium (.be)
Greece (.gr)
Norway (.no)
Macedonia (.mk)
Spain (.es)
Finland (.fi)
Kuwait (.kw)
Portugal (.pt)
United States (.us)
Ukraine (.ua)
Ireland (.ie)
Bulgaria (.bg)
Brazil (.br)
Estonia (.ee)
Hits
119542
22371
18312
11400
2848
2037
1378
990
870
782
770
747
698
682
669
555
502
396
396
373
326
294
276
224
205
171
166
148
145
145
118
96
96
83
71
70
65
64
64
57
57
Percentage
63.01%
11.79%
9.65%
6.01%
1.50%
1.07%
0.73%
0.52%
0.46%
0.41%
0.41%
0.39%
0.37%
0.36%
0.35%
0.29%
0.26%
0.21%
0.21%
0.20%
0.17%
0.15%
0.15%
0.12%
0.11%
0.09%
0.09%
0.08%
0.08%
0.08%
0.06%
0.05%
0.05%
0.04%
0.04%
0.04%
0.03%
0.03%
0.03%
0.03%
0.03%
HNK 6: text conversion and encoding

XML
– XCES (XML version of CES)
– Ide, Bonhomme & Romary (2000)

DIVs, Ps, Ws

S-boundary detection algorithm
– problem with ordinal numbers written with punctuation

input text formats
– WWW: HTML, XML
– DTP: RTF, DOC, QXD, WP, TXT etc.

conversion
– 2XML: custom made software
• input: HTML, RTF / output: XML, no header
• two-step conversion by user-defined scripts
• enables high level of automation
HNK 7: corpus format 1
<<?xml version="1.0"?>
<!DOCTYPE cesDoc PUBLIC "-//CES//DTD XML cesDoc//EN"
"xcesDoc.dtd" [
]>
<cesDoc version="3.19">
<cesHeader type="text" version="3.19">
<fileDesc>
<titleStmt>
<h.title>Electronic version of Vecernji list, vl990311</h.title>
<respStmt>
<respType>XCES markup prepared by</respType>
<respName>Bosko Bekavac</respName>
</respStmt>
</titleStmt>
<extent>
<wordCount>4456</wordCount>
<byteCount>25385</byteCount>
</extent>
<publicationStmt>
<distributor>Project MZT RH 130718</distributor>
<pubAddress>Institute of linguistics</pubAddress>
<telephone>+385 1 6120-142</telephone>
<fax>+385 1 6856-118</fax>
<eAddress>http://www.ffzg.hr/zzl/zzl-home.htm</eAddress>
<idno>76676665676</idno>
<availability status="free">
</availability>
<pubDate>1999-12-20</pubDate>
</publicationStmt>
<sourceDesc>
<biblStruct>
HNK 7: corpus format 2
<BODY>
<DIV0 type="article">
<HEAD type="nn">U GORICI SVETOJANSKOJ ODRŽAN 12. FESTIVAL PJEVAČA AMATERA</HEAD>
<HEAD type="na">Ivana osvojila županijski Sanremo</HEAD>
<HEAD type="pn">* Od 20 natjecatelja žiri je najboljom proglasio Ivanu Erdeljac s pjesmom "Crazy", druga je Antoni
<FIGURE>Publici su se najviše svidjeli Marija Šalić i Petar Puhijera</FIGURE>
<P>Pod medijskim pokroviteljstvom "Večernjeg lista" i Radio Jaske, a uz pomoć DIR "Rubinić" kao generalnog te još
održan je 12. festival pjevača amatera.</P>
<P>Prve festivalske večeri, na kojoj su nastupila 22 izvođača do 15 godina, prvu nagradu stručnog žirija odnijela
Nikolini Oslaković iz Gornje Reke za pjesmu "Neka mi ne svane", a treća Mariji Jurini iz Desinca za pjesmu "Ginem"
nagradu dodijelila Natali Rajnović iz Jaske za pjesmu "Don"t ever cry", a treću Aniti Oslaković iz Desinca za pjes
pjesmom "Izdali me".</P>
<P>Druga večer - s dvadeset starijih izvođača iz Jaske, Karlovca, Bjelovara, Zagreba i Velike Gorice - bila je oso
nije bilo lako odabrati najbolje.</P>
<P>Nakon poduže stanke tijekom koje su izbrojani glasovi - a koju su publici kratili gost večeri Ivo Pattiera te s
stručnog žirija, prvu nagradu i zlatnu plaketu "Večernjaka" dobila je Karlovčanka Ivana Erdeljac za vrlo dobro otp
a treća Kseniji Cvetetić iz Petrovine za pjesmu "Neka mi ne svane".</P>
<P>Publika je najviše glasova dodijelila svetojansko-zagrebačkom duetu Mariji Šalić i Petru Puhijeri za interpreta
mjesto publika je svrstala "Svetojanske tamburaše" koji su nastupili s pjesmom "Dobro jutro", a na treće Zagrepčan
<P>Najboljom debitanticom završne večeri proglašena je Zagrepčanka Marina Posilović s pjesmom "Piši, piši mi", a n
suseda, suseda". Čini se da su ovogodišnje nagrade - a bilo ih je doista mnogo, od sedmodnevnog boravka u Opatiji,
Oni koji ih nisu dobili, a možda su ih također zaslužili, neka se ovaj put utješe pljeskom publike, a dogodine će
županije - nastavlja se.</P>
<BYLINE>N. Godrijan-Videc</BYLINE>
</DIV0>
</BODY>
HNK 8: corpus format 3

tokenization
– TOKENIZER: custom made
software
• input: XML
• output 1: tabbed file for
data-base input
• output 2: tokenized XML
<BODY>
<DIV0 type="article">
<HEAD type="nn">
U
GORICI
SVETOJANSKOJ
ODR&#381;AN
12
.
FESTIVAL
PJEVA&#268;A
AMATERA
</HEAD>
<HEAD type="na">
Ivana
osvojila
&#382;upanijski
Sanremo
</HEAD>
<HEAD type="pn">
*
Od
20
natjecatelja
&#382;iri
je
najboljom
proglasio
Ivanu
Erdeljac
s
pjesmom
"
Crazy
"
,
druga
je
Antonija
Mikita
s
pjesmom
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
vl990301gr01
1
7
28
44
46
53
66
78
80
82
91
104
111
118
134
140
149
165
172
179
195
197
200
203
216
226
229
239
249
255
264
266
275
276
281
282
284
290
293
302
309
311
X
X
X
R
R
R
R
B
I
R
R
R
X
X
R
R
R
R
X
X
I
R
B
R
R
R
R
R
R
R
R
R
I
R
I
I
R
R
R
R
R
R
HNK 9: corpus format 4

output 2: tokenized XML
<BODY>
<DIV0 type="article">
<HEAD type="nn">
<W type="R">U</W>
<W type="R">GORICI</W>
<W type="R">SVETOJANSKOJ</W>
<W type="R">ODRŽAN</W>
<W type="B">12</W>
<W type="I">.</W>
<W type="R">FESTIVAL</W>
<W type="R">PJEVAČA</W>
<W type="R">AMATERA</W>
</HEAD>
<HEAD type="na">
<W type="R">Ivana</W>
<W type="R">osvojila</W>
<W type="R">županijski</W>
<W type="R">Sanremo</W>
</HEAD>
<HEAD type="pn">
<W type="I">*</W>
<W type="R">Od</W>
<W type="B">20</W>
<W type="R">natjecatelja</W>
<W type="R">žiri</W>
<W type="R">je</W>
<W type="R">najboljom</W>
<W type="R">proglasio</W>
<W type="R">Ivanu</W>
<W type="R">Erdeljac</W>
<W type="R">s</W>
<W type="R">pjesmom</W>
<W type="I">"</W>
<W type="I">"</W>
<W type="I">,</W>
<W type="R">druga</W>
<W type="R">je</W>
<W type="R">Antonija</W>
<W type="R">Mikita</W>
<W type="R">s</W>
<W type="R">pjesmom</W>
<W type="I">"</W>
<W type="R">To</W>
<W type="I">"</W>
<W type="I">,</W>
<W type="R">a</W>
<W type="R">treće</W>
<W type="R">je</W>
<W type="R">mjesto</W>
<W type="R">osvojila</W>
<W type="R">Ksenija</W>
<W type="R">Cvetetić</W>
</HEAD>
<FIGURE>
<W type="R">Publici</W>
<W type="R">su</W>
<W type="R">se</W>
<W type="R">najviše</W>
<W type="R">svidjeli</W>
<W type="R">Marija</W>
<W type="R">Šalić</W>
<W type="R">i</W>
<W type="R">Petar</W>
<W type="R">Puhijera</W>
</FIGURE>
<P>
<W type="R">Pod</W>
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
<W
type="R">medijskim</W>
type="R">pokroviteljstvom</W>
type="I">"</W>
type="R">Večernjeg</W>
type="R">lista</W>
type="I">"</W>
type="R">i</W>
type="R">Radio</W>
type="R">Jaske</W>
type="I">,</W>
type="R">a</W>
type="R">uz</W>
type="R">pomoć</W>
type="R">DIR</W>
type="I">"</W>
type="R">Rubinić</W>
type="I">"</W>
type="R">kao</W>
type="R">generalnog</W>
type="R">te</W>
type="R">još</W>
type="R">sedamdesetak</W>
type="R">drugih</W>
type="R">sponzora</W>
type="I">,</W>
type="R">u</W>
type="R">petak</W>
type="R">i</W>
type="R">u</W>
type="R">subotu</W>
type="R">u</W>
type="R">Gorici</W>
type="R">Svetojanskoj</W>
type="R">pokraj</W>
HNK 10: POS annotation 1

Croatian
– morphologically rich language
•
•
•
•
•
•
nouns: 7 cases, 2 numbers, 3 genders
adjectives: + 2 forms (definite & indefinite), 3 grades in comparation
adverbs: 3 grades in comparation
pronouns: 7 cases, 2 numbers, 3 genders, 3 persons
numbers: 7 cases, 3 genders
verbs:
– 2 numbers, 3 persons
– 3 simple, 3 periphrastic tenses (with difference in 3 genders and 2 numbers
in participles)
– 2 additional participles
– 2 conditionals
– imperative
– very complex system of aspects (perfect & imperfect/iterative)

a lot of syntactic relations coded by morphology
– POS annotation and lemmatization more important than for e.g.
English
HNK 11: POS annotation 2

Croatian morphological lexicon
– 36000 headwords
– GenOblik2 morphological generator
Tadić (1994)

MulTextEast MSD recommendation
– 6 CEE languages
– Croatian specification added in 1998
– Erjavec: MulTextEast recommendation V 2.0 ?

matching with corpus
= abeceda Ncfsn
abecede abeceda Ncfsg
abecedi abeceda Ncfsd
abecedu abeceda Ncfsa
abecedo abeceda Ncfsv
abecedi abeceda Ncfsl
abecedom abeceda Ncfsi
abecede abeceda Ncfpn
abeceda abeceda Ncfpg
abecedama abeceda Ncfpd
abecede abeceda Ncfpa
abecede abeceda Ncfpv
abecedama abeceda Ncfpl
abecedama abeceda Ncfpi
= abolicija Ncfsn
abolicije abolicija Ncfsg
aboliciji abolicija Ncfsd
aboliciju abolicija Ncfsa
abolicijo abolicija Ncfsv
aboliciji abolicija Ncfsl
abolicijom abolicija Ncfsi
abolicije abolicija Ncfpn
abolicija abolicija Ncfpg
abolicijama abolicija Ncfpd
abolicije abolicija Ncfpa
abolicije abolicija Ncfpv
abolicijama abolicija Ncfpl
abolicijama abolicija Ncfpi
= abrazija Ncfsn
abrazije abrazija Ncfsg
abraziji abrazija Ncfsd
abraziju abrazija Ncfsa
abrazijo abrazija Ncfsv
abraziji abrazija Ncfsl
abrazijom abrazija Ncfsi
abrazije abrazija Ncfpn
abrazija abrazija Ncfpg
abrazijama abrazija Ncfpd
abrazije abrazija Ncfpa
abrazije abrazija Ncfpv
abrazijama abrazija Ncfpl
HNK 12: POS annotation 3
HNK 13: POS annotation 4

automatically anotate 1Mw corpus

manual correction

use it as training data for tagger

TNT
Parallel corpora

Croatian-English parallel corpus

Slovene-Croatian parallel corpus

Acquis translations corpus
HR-EN parallel corpus 1

source: Croatia Weekly
– like USA today: different domains
• politics, economy and finance, tourism, ecology, culture, art, events,
sports
– 12 pages, A3
– prepared in Croatian then translated by professional translating
office

availability
– 118 numbers
– started January 1998, finished May 2000
– access to all texts in electronic form in both languages
HR-EN parallel corpus 2

Articles:

Sentences:
– HR
– EN

4,343
67,694
75,390
(15.59 s/article avg.)
(17.36 s/article avg.)
1,490,964
1,796,744
3,287,708
(22.03 w/s avg.)
(23.83 w/s avg.)
Tokens:
– HR
– EN
– Total
HR-EN parallel corpus 3
HR-EN parallel corpus 4

Sentence marking
– </S><S> insertion after punctuation followed by capital letter
– filtered for known exceptions: Mr., Mrs., Miss., dr., St. etc.
– problem of ordinal numbers written with punctuation by Croatian
orthography

Vanilla aligner

alignments
–
–
–
–
–
–

0:1
1:0
1:1
1:2
2:1
2:2
310
25
56783
8611
1391
379
Total alignments:
in
in
in
in
in
in
235 articles 0.45%
12 articles 0.04%
4143 articles 84.12%
3288 articles 12.76%
1012 articles 2.06%
345 articles 0.56%
67499 in 4143 articles
HR-EN parallel corpus 5

encoding problem: How to store alignments?

Tadić (2000): LREC2000

(X)CES way:
– each language in a separate document
– <S id=“...”>
– pointers to IDs of aligned sentences in 3rd document
HR-EN parallel corpus 6
Acquis translations parallel corpus

Croatia is on the way of becaming a Candidate country for EU

Translation of AC = only task equal to all Candidate countries

translating 200.000 pages of EU OJ into Croatian (ca 60 Mw)

translating 100.000 pages of Croatian legislation in English/French...

Ministry of European integration of the Republic of Croatia
– organizing the translation process
– 200 freelance translators or translation companies
– existing on-line lexical dBases (CELEX...): no Croatian terms and/or TE

mantain the consistency of translations?

EuroVoc = translated in Croatian
– thesaurus of European Commision terms

Institute of linguistics
– proposal for joint project of preparation of AC texts for translation
– term marking found in EuroVoc and TE suggestion
AC translations parallel corpus 3
AC translations parallel corpus
AC translations parallel corpus 5
AC translations parallel corpus 6
AC translations parallel corpus 7

if we put <S>s and </S>s and give them ID-attributes in both
original and translation we can use the whole of AC as a huge
Translation memory

parallel corpus aligned at the <S> level = TM
– just a matter of encoding
• alignment and/or <TU> marking

term marking
– <W>-level marking needed
– several encoding solutions
AC translations parallel corpus 8

solution 1: term tags intermixed with corpus data
<P>
<S>
<W id=845>The</W>
<term><W id=846>European</W>
<W id=847>Parliament</W></term>
<W id=848>may</W>
<W id=849>ask</W>...
</S>...
</P>...

problem: non-contiguous multi-W terminological units
AC translations parallel corpus 9

solution 2: term marking in stand-off annotation i.e. in other XML
document linked to corpus data
<P>
<S>
<W id=845>The</W>
<W id=846>European</W>
<W id=847>Parliament</W>
<W id=848>may</W>
<W id=849>ask</W>...
</S>...
</P>...
<W
<W
<W
<W
<W
<term_unit id=en122>
<link xtargets="846 ; 847">
</term_unit>
<term unit id=hr345>
<link xtargets="765 ; 767">
</term unit>

id=765>Europski</W>
id=766>bi</W>
id=767>parlament</W>
id=768>mogao</W>
id=769>tražiti</W>
allows marking of non-contiguous terms
AC translations parallel corpus 10

solution 3: term marking with translation equivalent suggestion
<P>
<S>
<W id=845>The</W>
<W id=846>European</W>
<W id=847>Parliament</W>
<W id=848>may</W>
<W id=849>ask</W>...
</S>...
</P>...
<W
<W
<W
<W
<W
id=765>Europski</W>
id=766>bi</W>
id=767>parlament</W>
id=768>mogao</W>
id=769>tražiti</W>
<term_unit id=en122>
<link xtargets="846 ; 847">
</term_unit>
<term unit id=hr345>
<link xtargets="765 ; 767">
</term unit>
<tu><link xtargets="en122 ; hr345"></tu>
AC translations parallel corpus 11

XLink
– W3C Working Draft, 2000-02-21 (http://www.w3.org/TR/xlink)
– XML’s powerful linking tool
– allows stand-off annotation (Ide et al. 2000)
• no changes in corpus data <= annotation of read-only data
• multimodal corpora annotation
– time-line links
– links of language data with audio or video (paralinguistic data)

Systems using XLink intensively
– MATE workbench (McKelvie et al. 2000)
– LDC (Bird & Liberman 2000)
– ...
Some methodological remarks 1

some skepticism

what do we do exactly by putting annotations in corpora?
– adding the secondary data to our primary data in order to able to
retrieve information later
– adding categories selected from the prepared list and applying
them to our corpus data

not concerned here with meta-description (usually in headers)

secondary data = result of interpretation of primary data

by adding already prepared categories
– we get a lot of information which could not be collected any other
way
– could we miss some phenomena which we haven’t forseen in the
stage of category preparation?
Some methodological remarks 2

example on the very basic level of word boundary
nmkojo, zam. pridj. nijedan, nikakav
(Anić, Vladimir: Rječnik hrvatskoga jezika, 1991)
Ni u kojem se slučaju ne smiješ okrenuti!
oligo- and poly-sacharids...
Ivan je Šikić radosno krenuo nizbrdo.
– How many words do we have here?
– Is it a trivial question?
– opposition between “graphic words” and lemmas

not to mention syntax and/or semantics
Some methodological remarks 3

putting only one kind of secondary/interpretive data in corpus
– filtering only those linguistic phenomena which we are able to
grasp by our already prepared categories
– missing phenomena for which we are not prepared

keeping our secondary/tertiary/... data apart from basic resource
data
– allows other researchers to have their own secondary etc. data and
different interpretations
– allows us to compare different interpretive data interpersonally
and/or automatically

XML and concept of stand-off annotation gives us a tool for that
Encoding Croatian Corpora
Marko Tadić
([email protected], www.hnk.ffzg.hr/mt)
Department of linguistics/Institute of linguistics,
Faculty of philosophy, University of Zagreb
(www.ffzg.hr/www.ffzg.hr/zzl/zzl-home.htm)
Tübingen, 2001-02-22