Transcript Slide 1

HLT, Data Sparsity and Semantic Tagging

Louise Guthrie

(University of Sheffield)

Roberto Basili

(University of Tor Vergata, Rome)

Hamish Cunningham

(University of Sheffield) 1(21)

Outline

– A ubiquitous problem: data sparsity – • • The approach: coarse-grained semantic tagging learning by combining multiple evidence – The evaluation: intrinsic and extrinsic measures – The expected outcomes: architectures, tools, development support 2(21)

Applications

Present

We’ve seen growing interest in a range of HLT tasks: e.g. IE, MT

Trends

– Fully portable IE, unsupervised learning – Content Extraction vs. IE 3(21)

Data Sparsity

• Language Processing depends on a model of the features important to an application.

– MT - Trigrams and frequencies – Extraction - Word patterns • New texts always seem to have lots of phenomena we haven’t seen before 4(21)

Different kinds of patterns

Person

was appointed as

post

of

company Company

named

person

to

post

• Almost all extraction systems tried to find patterns of mixed words and entities.

– People, Locations, Organizations, dates, times, currencies 5(21)

Can we do more?

Astronauts aboard the space shuttle Endeavor were forced to dodge a derelict Air Force satellite Friday

Humans

aboard

satellite timeref.

space_vehicle

dodge 6(21)

Could we know these are the same?

The IRA bombed a family owned shop in Belfast yesterday.

FMLN set off a series of explosions in central Bogota today.

ORGANIZATION ATTACKED LOCATION DATE 7(21)

Machine translation

• Ambiguity of words often means that a word can translate several ways.

• Would knowing the semantic class of a word, help us to know the translation?

8(21)

Sometimes . . .

• Crane the bird vs crane the machine • Bat the animal vs bat for cricket and baseball • Seal on a letter vs the animal 9(21)

SO ..

P(translation(crane) = grulla |

animal

) > P(translation(crane) = grulla) P(translation(crane) = grua |

machine

) > P(translation(crane) = grua) Can we show the overall effect lowers entropy?

10(21)

Language Modeling – Data Sparseness again ..

• We need to estimate

Pr (w 3 | w 1 w 2 )

• If we have never seen w 1 w 2 w 3 before • Can we instead develop a model and estimate

Pr (w 3 | C 1 C 2 )

or

Pr (C 3 | C 1 C 2 )

11(21)

A Semantic Tagging technology. How?

• We will exploit similarity with NE tagging, ...

– Development of pattern matching rules as

incremental wrapper induction

• ... with semantic (sense) disambiguation – Use

as much evidence as possible

– Exploit

existing resources

like MRD or LKBs • ... and with machine learning tasks –

Generalize from positive examples

in training data 12(21)

Multiple Sources of Evidence

• Lexical information (priming effects) • Distributional information from general and training texts • Syntactic features – SVO patterns or Adjectival modifiers • Semantic features – Structural information in LKBs – (LKB-based) similarity measures 13(21)

Machine Learning for ST

• Similarity estimation – among contexts (texts overlaps, …) – among lexical items wrt MRD/LKBs • We will experiment – Decision tree learning (e.g. C4.5) – Support Vector Machines (e.g. SVM light) – Memory-based Learning (TiMBL) – Bayesian learning 14(21)

What’s New?

• Granularity – Semantic categories are coarser than word senses (cfr. homograph level in MRD) • Integration of existing ML methods – Pattern induction is combined with probabilistic description of word semantic classes • Co-training – Annotated data are used to drive the sampling of further evidence from unannotated material (

active learning

) 15(21)

How we know what we’ve done: measurement, the corpus

Hand-annotated corpus

- from the BNC, 100-million word balanced corpus - 1 million words annotated a little under ½ million categorised noun phrases

Extrinsic evaluation

Perplexity of lexical choice in Machine Translation

Intrinsic evaluation

Standard measures or precision, recall, false positives (baseline: tag with most common category = 33%) 16(21)

Ambiguity levels in the training data

NPs by semantic categories: 11 12 13 14 15 17 0 1 2 3 4 5 6 7 8 9 10 104824 23.1% 119228 26.3% 96852 44385 35671 15499 13555 7635 6000 2191 3920 1028 606 183 450 919 414 21.4% 9.8% 7.9% 3.4% 3.0% 1.7% 1.3% 0.5% 0.9% 0.2% 0.1% 0.0% 0.1% 0.2% 0.1% Total NPs (interim) 453360

distribution of NPs to categories

30.0% 25.0% 20.0% 15.0% 10.0% 5.0% 0.0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

number of senses

17(21)

Maximising project outputs: software infrastructure for HLT

Three outputs from the project: 1. A new resource

Automatical annotation of the whole corpus

2. Experimental evidence re. 1.

- how accurate the final results are - how accurate the various methods employed are

3. Component tools for doing 1., based on GATE

(a General Architecture for Text Engineering) 18(21)

What is GATE?

An architecture

• A macro-level organisational picture for LE software systems.

A framework

For programmers, GATE is an object-oriented class library that implements • the architecture.

A development environment

For language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. • • Information Extraction.

Some free components...

Tools

...and wrappers for other people's components for: evaluation; visualise/edit; persistence; IR; IE; dialogue; • ontologies; etc.

Free

software (LGPL). Download at http://gate.ac.uk/download/ 19(21)

Where did GATE come from?

A number of researchers realised in the early- mid-1990s (e.g. in TIPSTER): • Increasing trend towards multi-site collaborative projects • Role of engineering in scalable, reusable, and portable HLT solutions • Support for large data, in multiple media, languages, formats, and locations • Lower the cost of creation of new language processing components • Promote quantitative evaluation metrics via tools and a level playing field • • • History:

1996 – 2002

: GATE version 1, proof of concept

March 2002

: version 2, rewritten in Java, component based, LGPL, more users

Fall 2003

: new development cycle 20(21)

Role of GATE in the project

• • •

Productivity

- reuse some baseline components for simple tasks - development environment support for implementors (MATLAB for HLT?) - reduce integration overhead (standard interfaces between components) - system takes care of persistency, visualisation, multilingual edit, ...

Quantification

- tool support for metrics generation - visualisation of key/response differences - regression test tool for nightly progress verification

Repeatability

- open source supported, maintained, documented software - cross-platform (Linux, Windows, Solaris, others) - easy install and proven useability (thousands of people, hundreds of sites) - mobile code if you write in Java; web services otherwise 21(21)