Creating parallel and comparable corpora for work in

Download Report

Transcript Creating parallel and comparable corpora for work in

Creating parallel and comparable
corpora for work in domain
specific areas of language
Belinda Maia
FLUP
Parallel corpora - definition
• “A parallel corpus is a collection of texts,
each of which is translated into one or more
other languages than the original. The
simplest case is where two languages only
are involved: one of the corpora is an exact
translation of the other. ....... The direction
of the translation may not even be known”.
Parallel corpora - uses
• “Parallel corpora are objects of interest at
present because of the opportunity offered
to align original and translation and gain
insights into the nature of translation. From
this work it is hoped that tools to aid
translation will be devised. Probabilistic
machine translation systems can moreover
be trained on such corpora”.
Comparable corpora - definition
• “A comparable corpus is one which selects
similar texts in more than one language or
variety. There is as yet no agreement on the
nature of the similarity, because there are
very few examples of comparable corpora”.
Comparable corpora - uses
• “The possibilities of a comparable corpus
are to compare different languages or
varieties in similar circumstances of
communication, but avoiding the inevitable
distortion introduced by the translations of a
parallel corpus”.
Quotations from:
• EAGLES - Expert Advisory Group on
Language Engineering Standards
• Guidelines – 1996 – at:
• http://www.ilc.pi.cnr.it/EAGLES96/browse.
html
Parallel corpora
- alignment & annotation
• Most common form of alignment
= at sentence level
• E.g. Text aligners:
– WORDSMITH – recognizes full stops only
– WinAlign – TRADOS – recognizes a certain amount of
formatting, paragraphs, numbers, tagging
• Ongoing research to align at:
– term/word level
– tag level
Parallel corpora
- alignment & annotation
problems
• Different linguistic theories = different annotation
schemes
– E.g. Morphological, syntactic or semantic?
• Different languages = different annotation
schemes
– E.g. English / Portuguese / Polish / Finnish /Chinese
• Different languages = different types of alignment
– E.g. English / Hebrew / Chinese
Parallel corpora
- professional uses
• Translation memories – aligned collections
of repetitive texts in special domains
– Provide previous translations for translator to
consult / copy
– Allow economy in translation process
– Provide material for probabilistic machine
translation
– E.g. EU translation services, Canadian Hansard
Translation memories –
requirements
• “Garbage in = garbage out!”
• Original > good quality – hence
– Emphasis on: good editing and proof reading >
controlled language
– E.g. EU documentation – training people to edit
English documents written by non-native speakers
• Translation > good quality – but certain parallel
relationship to the original
• Therefore: tendency to homogeneity
– (e.g. Eurospeak)
Parallel corpora
- academic uses
• For studying the translation process
• For studying translation solutions
• E.g.
– INTERSECT – French/English (Brighton)
– English-Norwegian Parallel Corpus Project (Oslo)
– COMPARA/DISPARA – Portuguese/English – online
at http://www.portugues.mct.pt/
• For terminology extraction
Parallel corpora
- requirements
• Theory should allow for any original + translation
- warts and all!
– Much literary criticism of translation thrives on the
‘warts’!
– Useful for study of errors, translationese etc
• Practical applications require quality:
– Contrastive linguistics
– Pedagogical applications
– Terminology extraction
Comparable corpora –
Perceived needs
• Texts as:
– Examples of ‘natural’ original text in the source
language culture
– E.g.
• Legal texts written according to local conventions
• Socially conventional texts: e.g. the ‘deaths column’
and advertisements for houses and jobs.
• Academic / scientific texts – different cultural
conventions
Comparable corpora –
Advantages
• Availability
– More texts
– Greater variety
• Versatility - applications for research in:
–
–
–
–
Discourse analysis
Pragmatics
Information retrieval
Knowledge engineering
What makes
texts /corpora
COMPARABLE?
EAGLES - quotes
• “A comparable corpus is one which selects
similar texts in more than one language or
variety”.
Similar - in more than one language
AND/OR
Similar - in variety
“...similar circumstances of communication..”
Similarity – Form/content?
• Form
– Size, no. of words, sentences, paragraphs
– Length of texts
– Format - .txt, .doc, .html,.xml
• Content
– General language
– Specialised domains
Similarity- Structure/Function?
• Structure
– Formal, carefully constructed texts – e.g. Legal
texts
– Informal, loosely organized discourse – e.g.
transcriptions of conversation
• Function
– Social
– Cultural
Similarity- Register?
• Register
– Field – situation, subject matter etc
– Tenor – interpersonal relationships
• e.g. formal/informal, politeness, etc
– Mode
• Spoken: e.g. speech, formal dialogue, conversation
• Written: e.g. book, essay, instruction manual
• Multimedia: e.g. Encarta, films
Similarity - Dialect?
• Dialect
– Geographical
• e.g. urban/rural areas, developed/developing
countries
– Temporal
• e.g. historical periods, different age groups
– Social
• e.g. social classes, educational backgrounds
Comparability in Very Large
Corpora
• Very Large Corpora comparable if :
– similar in size
– constructed according to same criteria –e.g.
quantity and quality of text types
• Consider:
– British National Corpus
– Mannheimer Corpora
Comparability in newspaper
corpora
• Newspaper corpora vary according to:
– Type: ‘quality’/‘popular’, general/specialised
content
– Time: same day/month/year > ‘concurrent’
corpora
• Consider:
– CETEMPúblico - Portuguese
– Reuter’s Corpus - English
Comparability in literary corpora
• Period:
– Medieval, 18th Century, Post-war
• School:
– Romanticism, Realism, Post-modernism
• Genre:
– Novel, science fiction, drama, poetry
Comparability in technical and
scientific corpora - form
•
•
•
•
•
Pamphlets
Manuals
Textbooks
Articles and papers
Dissertations, theses
Comparability in technical and
scientific corpora - content
•
•
•
•
•
Everyday information
Encyclopedic information
Instructions
Education
Expert-to-expert communication
Constructing comparable corpora
- general language
• Where does one start?
• Very large comparable corpora in 2 or more
languages = mega-proposition!
• Carefully selected annotated general
corpora – like ICAME corpora (Brown,
LOB etc) = a possibility + limitations
Using comparable corpora general language
• Advantages:
– Comparative and contrastive research at all
levels
– Particularly useful for lexicographical research
and search for syntactic patterns
• Disadvantages:
– Difficult to manage for more delicate analysis
– Unnecessary for certain types of research
Constructing comparable corpora
– Newspaper texts
• Newspaper corpora
– Relatively easy to acquire
– A wide variety of fields
– Similarity in
• tenor
• mode
Using comparable corpora –
Newspaper texts
Concurrent corpora > extraction of similar
news items > e.g.
– War reports
– Politics – election campaigns
– Football during the World Cup
OR > styles of journalism > comparing
individual journalists etc.
Constructing comparable corpora
– general language
+ restricted text type
• General subject texts of similar text type –
e.g. Encyclopedia entries, tourism pamphlets
• Literary texts of similar period, school or
genre
• Technical and scientific texts with similar
form or function e.g. textbooks
Using comparable corpora –
general language
+ restricted text type
•
•
•
•
Discourse analysis
Pragmatics
Genre analysis
Sociolinguistic analysis
Constructing comparable corpora
– specialized language
• Special domains at various levels – e.g.
– Geography > population geography > ethnic
minorities
– Engineering > mechanical engineering >
tribology
– Medicine > oncology > breast cancer
Using comparable corpora –
specialized language
•
•
•
•
•
Genre analysis
Terminology extraction
Information retrieval
Web browsing technology
Knowledge engineering
All corpora construction
• Must establish:
– Overall general policy in relation to:
• Form – computational structure
• Content of sub-corpora
• Availability to general / restricted public
– Specific objectives of sub-corpora
All corpora construction
• Must take into account:
– Copyright restrictions
– Effect of external factors on the text
• Idiosyncracies of individual author
• Characteristics of writing in specific cultural/ social
situation
• Homogenising effect of internationalisation
– Eurospeak
– Anglicisation of scientific terminology
Linguateca - Porto
More immediate objectives
• To construct comparable and parallel corpora in
Portuguese and English using:
– Texts in special domains already being investigated
– Adding corpora from special domains as and when the
opportunity arises
• To construct the necessary computational
framework for using the corpora for research
• To make these corpora as widely available as the
respective copyright situation permits
Linguateca - Porto
Longer-term objectives
• To extend the notion of comparability to:
– genre-specific corpora
– restricted general language corpora
• To construct integrated networks of
comparable corpora
• To extend these objectives to other
languages
• To contribute to similar projects elsewhere
Bibliography
• Bourigault, Didier, Christian Jacquemin, & MarieClaude L’Homme. (Eds.) 2001. Recent Advances in
Computational
Terminology.
Amsterdam
&
Philadelphia: John Benjamins Publishing Co.
• Charlet, J., M.Zacklad G.Kassel D.Bourigault. 2001.
Ingénierie des connaissances. Paris: Éditions Eyrolles.
• Veronis, Jean (Ed). 2000. Parallel Text Processing –
Alignment and Use of Translation Corpora. Dordrecht:
Kluwer Academic Publishers.