The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA Corpora Linguistics 23.08.04 A bit of history • PALC ’97 – 'Do-ityourself corpora ... with a little.
Download ReportTranscript The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA Corpora Linguistics 23.08.04 A bit of history • PALC ’97 – 'Do-ityourself corpora ... with a little.
The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA Corpora Linguistics 23.08.04 1 A bit of history • PALC ’97 – 'Do-ityourself corpora ... with a little bit of help from your friends!' • CULT 1998 ‘Making corpora – a learning process’ Contrastive linguistics Corpora linguistics Translation teaching General > specific language Corpora Linguistics 23.08.04 2 A bit of history • 2000 – 1st Master’s in Terminology and Translation at FLUP • PALC 2001 - ‘Training Translators in Terminology and Information Retrieval using Comparable and Parallel Corpora’ Specialized translation and terminology Contact with domain experts Importance of IT Need for technical help for more ambitious students! Corpora Linguistics 23.08.04 3 A bit of history • LREC 2002 - ‘Corpora for terminology extraction – the differing perspectives and objectives of researchers, teachers and language services providers’ • 2002 – 2nd Master’s in Terminology and Translation at FLUP Plea for help to Diana Santos October 2002 LINGUATECA Polo FLUP Corpora Linguistics 23.08.04 4 LINGUATECA • See http://www.linguateca.pt • Leader > Diana Santos (SINTEF – Oslo) • Objective - to create resources and tools for the computational processing of Portuguese • Nodes at Oslo, Lisbon, Braga and Porto • Porto - Polo CLUP/FLUP Corpora Linguistics 23.08.04 5 Polo CLUP/FLUP General focus • See http://www.linguateca.pt/poloclup/ • On constructing resources specific to the needs of FLUP/CLUP – For researchers, teachers and students – For teaching methodology at FLUP BNC & Reuter’s corpora on intranet A small ‘chat’ corpus Comparable corpora Corpora Linguistics 23.08.04 6 More history • 2003 – Poster of the GC – at CL2003 • 2003 – ‘What are comparable corpora?’ CL2003 • 2003 – Experimentation with evaluation of Machine Translation • 2003 – Experimentation with GC • 2003 – 3rd Master’s in Terminology and Translation at FLUP Corpora Linguistics 23.08.04 7 Polo CLUP/FLUP Research focus • See http://www.linguateca.pt/poloclup/ • On-line suite of corpora tools to work with comparable corpora with emphasis on bilingual research – Focus on special domains – Construction of terminology databases, ontologies and domain models Corpógrafo Corpora Linguistics 23.08.04 8 And ... • Evaluation of Machine Translation – Experimentation with evaluation – Teaching + research focus – Tools for collecting empirical data • Results: – TrAva – MT evaluation tool – CorTA – Corpus of 1 EN input + 4 MT output sentences Corpora Linguistics 23.08.04 9 The Corpógrafo results from: • Terminology, translation and language study and research (Belinda) • Computational linguistics research and production of resources (Diana) • Information retrieval and artificial intelligence (Luís) • Terminology data (Domain experts) = Discussions on priorities! Corpora Linguistics 23.08.04 10 GC – Integrated Web Environment for Corpora Linguistics What is GC? GC is a Web tool being developed at Linguateca/CLUP that aims to provide a comprehensive work environment for Corpora-Based Linguistic Research. GC allows users to: Motivation • access several Corpora tools from a single entry point using a regular web browser • Lack of Comprehensive, wide-scope Corpora Tools • Commercial Packages are usually difficult to Integrate/Customize • Tools are not prepared to support cooperative work. • Linguistic knowledge is not usually integrated in tools. • access and query generic Corpora (BNC, Reuter’s, COMPARA, CETEMPúblico) • build personal simple, parallel and comparable Corpora from text files (PDF, PS, Word, HTML, TXT) • use several (on-line/off-line) tools with their personal Corpora (statistics, POS-taggers, Filters, etc.) • communicate and exchange results with other users Internet Integration GC provides seamless integration with the World Wide Web allowing users to: Developer’s Tasks: • Integrate Existing Tools/Resources BNC • Develop Additional Generic Tools • search specific Corpora resources on the Internet CETEM Público COMPARA Custom Interface Custom Interface Others • use available translation-engines in parallel. Developer Task: • Interact with Users/Administrator Custom Interface • query the web for concordances Custom Interface • Develop Custom Tools for particular research needs DEV Administrator’s Tasks: • Concordance Engine • Corpora Bot • Taggers • Statistics Tool Pool • Aligner (Semi-Auto) Internet • Custom Tools Terminology DB • Users, Groups and Disk Quotas • Corpora Taxonomy (see box) Inter-user Communication • Documentation Organization • Access Service Statistics ADM USER Teacher’s Tasks: • Provide on-line tutorials • Provide links to: • on-line teaching material • bibliography and other resources Virtual Desktop Personal Corpora Terminology Extraction Tool (Auto/Semi-Auto) PS Inter-User Communication • Tagging and Aligning Cooperatively TXT RTF Corpora Taxonomy • Medium: written, spoken, multimedia • Domain: Engineering, medicine, etc. • Genre: scientific, technical, informative, etc. HTML • Messaging Service • Exchange of Corpora Resources PDF DOC Corpora Linguistics 23.08.04 11 Working with the Corpógrafo • Corpógrafo is a suite of integrated tools for INDIVIDUAL or GROUP research • All research done ONLINE • Each username/password = separate space on our server • At present > anyone can work with it using 10 MB space for FREE • BUT - you get an empty space + tools + tutorial! Corpora Linguistics 23.08.04 12 Corpora and Terminology • • • • • Special Domain Corpora Terminology extraction Terminology databases Structuring of domain knowledge Further corpora and information retrieval Corpora Linguistics 23.08.04 13 Internet Corpora Corpora Analysis Terminology Database Text details Text details Text details Corpora Linguistics 23.08.04 14 Terminology Prescription or Description? • • • • • • Prescriptive > descriptive Paper > digital form Static > dynamic resources ‘Democratization’ of terminology ISO standards > socioterminology Knowledge structures increasingly recognized as structured but dynamic Corpora Linguistics 23.08.04 15 Perspectives of terminology users • Domain experts and vested interests • Translators • Information retrieval • Knowledge engineering Standardized terminology The ‘right word’ Finding information Perfecting Google Structuring knowledge Finding it fast Corpora Linguistics 23.08.04 16 Bridging the Gap • • • • • General linguists Translation teachers Translation students Corpus linguists Computational linguists • Computer engineers Computer-phobia Computer-worship Corpora Linguistics 23.08.04 17 Focus of Corpógrafo • Design priorities are to: – – – – – See the Big Picture Create the Overall Framework Get feedback from users Develop according to real research needs Fill in details and improve techniques as needed Corpora Linguistics 23.08.04 18 Corpora Linguistics 23.08.04 19 File Manager Area where each individual or group can: – – – – – – Upload texts to space on server Convert various text formats to .txt ‘Clean’ them of unnecessary material Check tokenization and sentence divisions Register full information on source, domain and text type Group – and re-group - texts into corpora Corpora Linguistics 23.08.04 20 General corpus analysis • Concordancing tools allowing for – – – • Concordancing at sentence level KWIC concordancing Collocations N-gram tool – – Case-sensitive Alphabetical or frequency ordering Corpora Linguistics 23.08.04 21 Corpora + TDB • Choose corpus • Choose related TDB = All terms, examples, definitions extracted (semi) automatically from corpus and transferred to TDB = All metadata on texts providing data can be automatically transferred to TDB Corpora Linguistics 23.08.04 22 Term extraction • N-grams – Unfiltered – Filtered with restrictions on term in PT EN FR IT ES DE – Filtered with restrictions on term and context in PT EN FR IT ES DE – Singular + plural terms can be combined – Existing terms in TDB need not appear Corpora Linguistics 23.08.04 23 Term selection from n-grams • Consultation of list of n-grams • Check term status of each n-gram via underlying concordances • Check sources • Send to TDB Corpora Linguistics 23.08.04 24 Search for Candidates for Definitions and/or Semantic Relations • Already possible via TDB • Under development • Research areas for Mestrado dissertations and research assistants – Expressions that find definitions – Expressions that find semantic relations Corpora Linguistics 23.08.04 25 TDB - Terminology database Databases are designed to be multilingual – – – – – – – Terms listed alphabetically + language tag General data Morphological data Source metadata: Authors, texts etc Definitions + search for candidates Translation equivalents Semantic relations Corpora Linguistics 23.08.04 26 Future developments • General testing and improvement • Development of new ideas or functions • Isomorphic relationship between: – Research possibilities – Researchers’ needs – Our skills • Coordination of individual corpus projects into bigger projects, when possible or necessary Corpora Linguistics 23.08.04 27 Theoretical questions / problems • How large is a good domain corpus? • Comparable corpora v. Parallel corpora? • How much information does a database need – for information retrieval and knowledge engineering? • How much does the user of a database need – for translation, teaching etc.? Corpora Linguistics 23.08.04 28 Corpógrafo and special domains • Master’s in Terminology and Translation • Terminology projects with the support of domain specialists in: – Engineering – Electronics, Mechanical Engineering – Geography - Population Geography, Natural Hazards – Fire, Floods, Earthquakes, Coastal Erosion, – Medicine - Kidney support machines, Neurology – Science – Genetics – Technology – GPS – Geographical Positioning Systems Corpora Linguistics 23.08.04 29 Corpógrafo and terminology/translation research • Ongoing dissertations on aspects of: – – – – – – Terminology – neologisms, definition searches, semantic relations, conceptual analysis Corpora – text analysis, corpora construction Technical writing > Electrical Appliances Localization Terminology in documentaries Translation of Multimedia Corpora Linguistics 23.08.04 30 Linguateca • Linguateca’s policy - all resources and tools freely available online • Primary users - Portuguese and Brazilian • Other users also welcome Corpora Linguistics 23.08.04 31 Polo CLUP/FLUP • Bi- or multi-lingual in interest • Corpógrafo available for experiments on a small scale to the general public • Possibilities of future work on projects with users from other universities and other countries Corpora Linguistics 23.08.04 32 Corpógrafo team • Belinda Maia - FLUP -Associate Professor • Luís Sarmento - Linguateca, FCCN – Computer Engineer - Researcher-in-charge • Luís Miguel Cabral - Linguateca, FCCN – Computer Engineer, Research assistant • Débora Oliveira - Linguateca, FCCN – Research assistant • Ana Sofia Pinto – FLUP – technical assistant Corpora Linguistics 23.08.04 33 Contacts If you are interested is finding out more, please contact me: Belinda Maia at [email protected] Or Luís Sarmento at [email protected] The Corpógrafo can be used (with a username and password) at: http://www.linguateca.pt/corpografo and http://poloclup.linguateca.pt/ferramentas/gc Corpora Linguistics 23.08.04 34 Corpora Linguistics 23.08.04 35