Transcript Slide 1
HLT, Data Sparsity and Semantic Tagging
Louise Guthrie
(University of Sheffield)
Roberto Basili
(University of Tor Vergata, Rome)
Hamish Cunningham
(University of Sheffield) 1(21)
Outline
– A ubiquitous problem: data sparsity – • • The approach: coarse-grained semantic tagging learning by combining multiple evidence – The evaluation: intrinsic and extrinsic measures – The expected outcomes: architectures, tools, development support 2(21)
Applications
Present
We’ve seen growing interest in a range of HLT tasks: e.g. IE, MT
Trends
– Fully portable IE, unsupervised learning – Content Extraction vs. IE 3(21)
Data Sparsity
• Language Processing depends on a model of the features important to an application.
– MT - Trigrams and frequencies – Extraction - Word patterns • New texts always seem to have lots of phenomena we haven’t seen before 4(21)
Different kinds of patterns
Person
was appointed as
post
of
company Company
named
person
to
post
• Almost all extraction systems tried to find patterns of mixed words and entities.
– People, Locations, Organizations, dates, times, currencies 5(21)
Can we do more?
Astronauts aboard the space shuttle Endeavor were forced to dodge a derelict Air Force satellite Friday
Humans
aboard
satellite timeref.
space_vehicle
dodge 6(21)
Could we know these are the same?
The IRA bombed a family owned shop in Belfast yesterday.
FMLN set off a series of explosions in central Bogota today.
ORGANIZATION ATTACKED LOCATION DATE 7(21)
Machine translation
• Ambiguity of words often means that a word can translate several ways.
• Would knowing the semantic class of a word, help us to know the translation?
8(21)
Sometimes . . .
• Crane the bird vs crane the machine • Bat the animal vs bat for cricket and baseball • Seal on a letter vs the animal 9(21)
SO ..
P(translation(crane) = grulla |
animal
) > P(translation(crane) = grulla) P(translation(crane) = grua |
machine
) > P(translation(crane) = grua) Can we show the overall effect lowers entropy?
10(21)
Language Modeling – Data Sparseness again ..
• We need to estimate
Pr (w 3 | w 1 w 2 )
• If we have never seen w 1 w 2 w 3 before • Can we instead develop a model and estimate
Pr (w 3 | C 1 C 2 )
or
Pr (C 3 | C 1 C 2 )
11(21)
A Semantic Tagging technology. How?
• We will exploit similarity with NE tagging, ...
– Development of pattern matching rules as
incremental wrapper induction
• ... with semantic (sense) disambiguation – Use
as much evidence as possible
– Exploit
existing resources
like MRD or LKBs • ... and with machine learning tasks –
Generalize from positive examples
in training data 12(21)
Multiple Sources of Evidence
• Lexical information (priming effects) • Distributional information from general and training texts • Syntactic features – SVO patterns or Adjectival modifiers • Semantic features – Structural information in LKBs – (LKB-based) similarity measures 13(21)
Machine Learning for ST
• Similarity estimation – among contexts (texts overlaps, …) – among lexical items wrt MRD/LKBs • We will experiment – Decision tree learning (e.g. C4.5) – Support Vector Machines (e.g. SVM light) – Memory-based Learning (TiMBL) – Bayesian learning 14(21)
What’s New?
• Granularity – Semantic categories are coarser than word senses (cfr. homograph level in MRD) • Integration of existing ML methods – Pattern induction is combined with probabilistic description of word semantic classes • Co-training – Annotated data are used to drive the sampling of further evidence from unannotated material (
active learning
) 15(21)
How we know what we’ve done: measurement, the corpus
Hand-annotated corpus
- from the BNC, 100-million word balanced corpus - 1 million words annotated a little under ½ million categorised noun phrases
Extrinsic evaluation
Perplexity of lexical choice in Machine Translation
Intrinsic evaluation
Standard measures or precision, recall, false positives (baseline: tag with most common category = 33%) 16(21)
Ambiguity levels in the training data
NPs by semantic categories: 11 12 13 14 15 17 0 1 2 3 4 5 6 7 8 9 10 104824 23.1% 119228 26.3% 96852 44385 35671 15499 13555 7635 6000 2191 3920 1028 606 183 450 919 414 21.4% 9.8% 7.9% 3.4% 3.0% 1.7% 1.3% 0.5% 0.9% 0.2% 0.1% 0.0% 0.1% 0.2% 0.1% Total NPs (interim) 453360
distribution of NPs to categories
30.0% 25.0% 20.0% 15.0% 10.0% 5.0% 0.0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
number of senses
17(21)
Maximising project outputs: software infrastructure for HLT
Three outputs from the project: 1. A new resource
Automatical annotation of the whole corpus
2. Experimental evidence re. 1.
- how accurate the final results are - how accurate the various methods employed are
3. Component tools for doing 1., based on GATE
(a General Architecture for Text Engineering) 18(21)
What is GATE?
•
An architecture
• A macro-level organisational picture for LE software systems.
A framework
For programmers, GATE is an object-oriented class library that implements • the architecture.
A development environment
For language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. • • Information Extraction.
Some free components...
Tools
...and wrappers for other people's components for: evaluation; visualise/edit; persistence; IR; IE; dialogue; • ontologies; etc.
Free
software (LGPL). Download at http://gate.ac.uk/download/ 19(21)
Where did GATE come from?
A number of researchers realised in the early- mid-1990s (e.g. in TIPSTER): • Increasing trend towards multi-site collaborative projects • Role of engineering in scalable, reusable, and portable HLT solutions • Support for large data, in multiple media, languages, formats, and locations • Lower the cost of creation of new language processing components • Promote quantitative evaluation metrics via tools and a level playing field • • • History:
1996 – 2002
: GATE version 1, proof of concept
March 2002
: version 2, rewritten in Java, component based, LGPL, more users
Fall 2003
: new development cycle 20(21)
Role of GATE in the project
• • •
Productivity
- reuse some baseline components for simple tasks - development environment support for implementors (MATLAB for HLT?) - reduce integration overhead (standard interfaces between components) - system takes care of persistency, visualisation, multilingual edit, ...
Quantification
- tool support for metrics generation - visualisation of key/response differences - regression test tool for nightly progress verification
Repeatability
- open source supported, maintained, documented software - cross-platform (Linux, Windows, Solaris, others) - easy install and proven useability (thousands of people, hundreds of sites) - mobile code if you write in Java; web services otherwise 21(21)