The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA Corpora Linguistics 23.08.04 A bit of history • PALC ’97 – 'Do-ityourself corpora ... with a little.

Download Report

Transcript The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA Corpora Linguistics 23.08.04 A bit of history • PALC ’97 – 'Do-ityourself corpora ... with a little.

The Corpógrafo
Belinda Maia & Luís Sarmento
PoloFLUP
LINGUATECA
Corpora Linguistics 23.08.04
1
A bit of history
• PALC ’97 – 'Do-ityourself corpora ...
with a little bit of help
from your friends!'
• CULT 1998 ‘Making corpora – a
learning process’
 Contrastive linguistics
 Corpora linguistics
 Translation teaching
 General > specific
language
Corpora Linguistics 23.08.04
2
A bit of history
• 2000 – 1st Master’s in
Terminology and
Translation at FLUP
• PALC 2001 - ‘Training
Translators in
Terminology and
Information Retrieval
using Comparable and
Parallel Corpora’
 Specialized translation
and terminology
 Contact with domain
experts
 Importance of IT
 Need for technical
help for more
ambitious students!
Corpora Linguistics 23.08.04
3
A bit of history
• LREC 2002 - ‘Corpora for
terminology extraction –
the differing perspectives
and objectives of
researchers, teachers and
language services
providers’
• 2002 – 2nd Master’s in
Terminology and
Translation at FLUP
 Plea for help to Diana
Santos
 October 2002
LINGUATECA Polo FLUP
Corpora Linguistics 23.08.04
4
LINGUATECA
• See http://www.linguateca.pt
• Leader > Diana Santos (SINTEF – Oslo)
• Objective - to create resources and tools for
the computational processing of Portuguese
• Nodes at Oslo, Lisbon, Braga and Porto
• Porto - Polo CLUP/FLUP
Corpora Linguistics 23.08.04
5
Polo CLUP/FLUP
General focus
• See http://www.linguateca.pt/poloclup/
• On constructing resources specific to the
needs of FLUP/CLUP
– For researchers, teachers and students
– For teaching methodology at FLUP
 BNC & Reuter’s corpora on intranet
 A small ‘chat’ corpus
 Comparable corpora
Corpora Linguistics 23.08.04
6
More history
• 2003 – Poster of the GC – at CL2003
• 2003 – ‘What are comparable corpora?’
CL2003
• 2003 – Experimentation with evaluation of
Machine Translation
• 2003 – Experimentation with GC
• 2003 – 3rd Master’s in Terminology and
Translation at FLUP
Corpora Linguistics 23.08.04
7
Polo CLUP/FLUP
Research focus
• See http://www.linguateca.pt/poloclup/
• On-line suite of corpora tools to work with
comparable corpora with emphasis on
bilingual research
– Focus on special domains
– Construction of terminology databases,
ontologies and domain models
Corpógrafo
Corpora Linguistics 23.08.04
8
And ...
• Evaluation of Machine Translation
– Experimentation with evaluation
– Teaching + research focus
– Tools for collecting empirical data
• Results:
– TrAva – MT evaluation tool
– CorTA – Corpus of 1 EN input + 4 MT
output sentences
Corpora Linguistics 23.08.04
9
The Corpógrafo results from:
• Terminology, translation and language study
and research (Belinda)
• Computational linguistics research and
production of resources (Diana)
• Information retrieval and artificial
intelligence (Luís)
• Terminology data (Domain experts)
= Discussions on priorities!
Corpora Linguistics 23.08.04
10
GC – Integrated Web Environment for Corpora Linguistics
What is GC?
GC is a Web tool being developed at Linguateca/CLUP that aims to provide a comprehensive work
environment for Corpora-Based Linguistic Research. GC allows users to:
Motivation
• access several Corpora tools from a single entry point using a regular web browser
• Lack of Comprehensive, wide-scope Corpora Tools
• Commercial Packages are usually difficult to Integrate/Customize
• Tools are not prepared to support cooperative work.
• Linguistic knowledge is not usually integrated in tools.
• access and query generic Corpora (BNC, Reuter’s, COMPARA, CETEMPúblico)
• build personal simple, parallel and comparable Corpora from text files (PDF, PS, Word, HTML, TXT)
• use several (on-line/off-line) tools with their personal Corpora (statistics, POS-taggers, Filters, etc.)
• communicate and exchange results with other users
Internet Integration
GC provides seamless integration with the World
Wide Web allowing users to:
Developer’s Tasks:
• Integrate Existing Tools/Resources
BNC
• Develop Additional Generic Tools
• search specific Corpora resources on the Internet
CETEM
Público
COMPARA
Custom Interface
Custom Interface
Others
• use available translation-engines in parallel.
Developer Task:
• Interact with Users/Administrator
Custom Interface
• query the web for concordances
Custom Interface
• Develop Custom Tools for particular
research needs
DEV
Administrator’s Tasks:
• Concordance Engine
• Corpora Bot
• Taggers
• Statistics
Tool Pool
• Aligner (Semi-Auto)
Internet
• Custom Tools
Terminology DB
• Users, Groups and Disk Quotas
• Corpora Taxonomy (see box)
Inter-user
Communication
• Documentation Organization
• Access Service Statistics
ADM
USER
Teacher’s Tasks:
• Provide on-line tutorials
• Provide links to:
• on-line teaching material
• bibliography and other resources
Virtual
Desktop
Personal
Corpora
Terminology Extraction Tool
(Auto/Semi-Auto)
PS
Inter-User Communication
• Tagging and Aligning Cooperatively
TXT
RTF
Corpora Taxonomy
• Medium: written, spoken, multimedia
• Domain: Engineering, medicine, etc.
• Genre: scientific, technical, informative, etc.
HTML
• Messaging Service
• Exchange of Corpora Resources
PDF
DOC
Corpora Linguistics 23.08.04
11
Working with the Corpógrafo
• Corpógrafo is a suite of integrated tools for
INDIVIDUAL or GROUP research
• All research done ONLINE
• Each username/password = separate space on our
server
• At present > anyone can work with it using 10 MB
space for FREE
• BUT - you get an empty space + tools + tutorial!
Corpora Linguistics 23.08.04
12
Corpora and Terminology
•
•
•
•
•
Special Domain Corpora
Terminology extraction
Terminology databases
Structuring of domain knowledge
Further corpora and information retrieval
Corpora Linguistics 23.08.04
13
Internet
Corpora
Corpora
Analysis
Terminology
Database
Text details
Text details
Text details
Corpora Linguistics 23.08.04
14
Terminology
Prescription or Description?
•
•
•
•
•
•
Prescriptive > descriptive
Paper > digital form
Static > dynamic resources
‘Democratization’ of terminology
ISO standards > socioterminology
Knowledge structures increasingly
recognized as structured but dynamic
Corpora Linguistics 23.08.04
15
Perspectives of terminology users
• Domain experts and
vested interests
• Translators
• Information retrieval
• Knowledge
engineering
 Standardized
terminology
 The ‘right word’
 Finding information
 Perfecting Google
 Structuring knowledge
 Finding it fast
Corpora Linguistics 23.08.04
16
Bridging the Gap
•
•
•
•
•
General linguists
Translation teachers
Translation students
Corpus linguists
Computational
linguists
• Computer engineers
Computer-phobia
Computer-worship
Corpora Linguistics 23.08.04
17
Focus of Corpógrafo
• Design priorities are to:
–
–
–
–
–
See the Big Picture
Create the Overall Framework
Get feedback from users
Develop according to real research needs
Fill in details and improve techniques as needed
Corpora Linguistics 23.08.04
18
Corpora Linguistics 23.08.04
19
File Manager
Area where each individual or group can:
–
–
–
–
–
–
Upload texts to space on server
Convert various text formats to .txt
‘Clean’ them of unnecessary material
Check tokenization and sentence divisions
Register full information on source, domain
and text type
Group – and re-group - texts into corpora
Corpora Linguistics 23.08.04
20
General corpus analysis
•
Concordancing tools allowing for
–
–
–
•
Concordancing at sentence level
KWIC concordancing
Collocations
N-gram tool
–
–
Case-sensitive
Alphabetical or frequency ordering
Corpora Linguistics 23.08.04
21
Corpora + TDB
• Choose corpus
• Choose related TDB
= All terms, examples, definitions extracted
(semi) automatically from corpus and
transferred to TDB
= All metadata on texts providing data can be
automatically transferred to TDB
Corpora Linguistics 23.08.04
22
Term extraction
• N-grams
– Unfiltered
– Filtered with restrictions on term in PT EN FR
IT ES DE
– Filtered with restrictions on term and context in
PT EN FR IT ES DE
– Singular + plural terms can be combined
– Existing terms in TDB need not appear
Corpora Linguistics 23.08.04
23
Term selection from n-grams
• Consultation of list of n-grams
• Check term status of each n-gram via
underlying concordances
• Check sources
• Send to TDB
Corpora Linguistics 23.08.04
24
Search for Candidates for
Definitions
and/or Semantic Relations
• Already possible via TDB
• Under development
• Research areas for Mestrado dissertations
and research assistants
– Expressions that find definitions
– Expressions that find semantic relations
Corpora Linguistics 23.08.04
25
TDB - Terminology database
Databases are designed to be multilingual
–
–
–
–
–
–
–
Terms listed alphabetically + language tag
General data
Morphological data
Source metadata: Authors, texts etc
Definitions + search for candidates
Translation equivalents
Semantic relations
Corpora Linguistics 23.08.04
26
Future developments
• General testing and improvement
• Development of new ideas or functions
• Isomorphic relationship between:
– Research possibilities
– Researchers’ needs
– Our skills
• Coordination of individual corpus projects into
bigger projects, when possible or necessary
Corpora Linguistics 23.08.04
27
Theoretical questions / problems
• How large is a good domain corpus?
• Comparable corpora v. Parallel corpora?
• How much information does a database
need – for information retrieval and
knowledge engineering?
• How much does the user of a database need
– for translation, teaching etc.?
Corpora Linguistics 23.08.04
28
Corpógrafo and special domains
• Master’s in Terminology and Translation
• Terminology projects with the support of domain
specialists in:
– Engineering – Electronics, Mechanical Engineering
– Geography - Population Geography, Natural Hazards –
Fire, Floods, Earthquakes, Coastal Erosion,
– Medicine - Kidney support machines, Neurology
– Science – Genetics
– Technology – GPS – Geographical Positioning Systems
Corpora Linguistics 23.08.04
29
Corpógrafo and
terminology/translation research
•
Ongoing dissertations on aspects of:
–
–
–
–
–
–
Terminology – neologisms, definition searches,
semantic relations, conceptual analysis
Corpora – text analysis, corpora construction
Technical writing > Electrical Appliances
Localization
Terminology in documentaries
Translation of Multimedia
Corpora Linguistics 23.08.04
30
Linguateca
• Linguateca’s policy - all resources and
tools freely available online
• Primary users - Portuguese and Brazilian
• Other users also welcome
Corpora Linguistics 23.08.04
31
Polo CLUP/FLUP
• Bi- or multi-lingual in interest
• Corpógrafo available for experiments on a
small scale to the general public
• Possibilities of future work on projects with
users from other universities and other
countries
Corpora Linguistics 23.08.04
32
Corpógrafo team
• Belinda Maia - FLUP -Associate Professor
• Luís Sarmento - Linguateca, FCCN – Computer
Engineer - Researcher-in-charge
• Luís Miguel Cabral - Linguateca, FCCN –
Computer Engineer, Research assistant
• Débora Oliveira - Linguateca, FCCN – Research
assistant
• Ana Sofia Pinto – FLUP – technical assistant
Corpora Linguistics 23.08.04
33
Contacts
If you are interested is finding out more, please
contact me:
Belinda Maia at [email protected]
Or
Luís Sarmento at [email protected]
The Corpógrafo can be used
(with a username and password) at:
http://www.linguateca.pt/corpografo and
http://poloclup.linguateca.pt/ferramentas/gc
Corpora Linguistics 23.08.04
34
Corpora Linguistics 23.08.04
35