Empirical Methods for Natural Language Processing

Download Report

Transcript Empirical Methods for Natural Language Processing

Learning Text Mining
Walter Daelemans
[email protected]
http://cnts.uia.ac.be
CNTS, University of Antwerp, Belgium
ILK, Tilburg University, Netherlands
Language Technology
BERGEN 2002
Outline
• Using a Language Bank for Text Mining
• Shallow Parsing for Text Mining
– Example: Question Answering
• Memory-Based Learning
– Memory-Based Shallow Parsing
• Tagging
• Chunking
• Relation-finding
– Memory-Based Information Extraction
Text Mining
• Automatic extraction of reusable information
(knowledge) from text, based on linguistic
features of the text
• Goals:
– Data mining (KDD) from unstructured and semistructured data
– (Corporate) Knowledge Management
– “Intelligence”
• Examples:
– Email routing and filtering
– Finding protein interactions in biomedical text
– Matching resumes and vacancies
Document
Set of Documents
Structured Information
+ existing data
Author Recognition
Document Dating
Language Identification
Text Categorization
Information Extraction
Summarization
Question Answering
Topic Detection and Tracking
Document Clustering
Terminology Extraction
Ontology Extraction
Knowledge Discovery
LE Components
Text
Lexical / Morphological Analysis
Applications
OCR
Spelling Error Correction
Tagging
Grammar Checking
Chunking
Information retrieval
Syntactic Analysis
Grammatical Relation Finding
Document Classification
Information Extraction
Named Entity Recognition
Summarization
Word Sense Disambiguation
Question Answering
Semantic Analysis
Ontology Extraction and Refinement
Reference Resolution
Dialogue Systems
Discourse Analysis
Machine Translation
Meaning
LE Components
Text
Lexical / Morphological Analysis
Applications
OCR
Spelling Error Correction
Grammar Checking
Shallow Parsing
Information retrieval
Document Classification
Information Extraction
Named Entity Recognition
Summarization
Word Sense Disambiguation
Question Answering
Semantic Analysis
Ontology Extraction and Refinement
Reference Resolution
Dialogue Systems
Discourse Analysis
Machine Translation
Meaning
LE Components
Text
Lexical / Morphological Analysis
Applications
OCR
Spelling Error Correction
Grammar Checking
Shallow Parsing
Information retrieval
Document Classification
Information Extraction
Named Entity Recognition
Summarization
Word Sense Disambiguation
Question Answering
Semantic Analysis
Ontology Extraction and Refinement
Reference Resolution
Dialogue Systems
Discourse Analysis
Machine Translation
Meaning
Example: Shallow Parsing for
Question Answering
• Give answer to question
(document retrieval: find documents relevant to query)
• Who invented the telephone?
– Alexander Graham Bell
• When was the telephone invented?
– 1876
(Buchholz & Daelemans, 2001)
QA System: Shapaqa
• Parse question
When was the telephone invented?
– Which slots are given?
• Verb
invented
• Object
telephone
– Which slots are asked?
• Temporal phrase linked to verb
• Document retrieval on internet with given slot keywords
• Parsing of sentences with all given slots
• Count most frequent entry found in asked slot (temporal
phrase)
Shapaqa: example
• When was the telephone invented?
• Google: invented AND “the telephone”
– produces 835 pages
– 53 parsed sentences with both slots and with a temporal phrase
is through his interest in Deafness and fascination with acoustics that
the telephone was invented in 1876 , with the intent of helping
Deaf and hard of hearing
The telephone was invented by Alexander Graham Bell in 1876
When Alexander Graham Bell invented the telephone in 1876 , he
hoped that these same electrical signals could
Shapaqa: example (2)
• So when was the phone invented?
• Internet answer is noisy, but robust
–
–
–
–
–
–
17:
3:
2:
2:
1:
…
1876
1874
ago
later
Bell
• System was developed quickly
• Precision 76% (Google 31%)
• International competition (TREC): MRR 0.45
4 x OSWALD
Who shot Kennedy ?
* www.anusha.com/jfk.htm
situation in which Oswald shot Kennedy on November 22 , 1963 .
* www.mcb.ucdavis.edu/people/hemang/spooky.html
Lee Harvey Oswald shot Kennedy from a warehouse and ran .
* www.gallica.co.uk/monarch.htm
November 1963 U.S. President Kennedy was shot by Lee Harvey Oswald .
* astrospeak.indiatimes.com/mystic_corner.htm
Lee Harvey Oswald shot Kennedy from a warehouse and fled .
2 x BISHOP
* www.powells.com/biblio/0-200/000637901X.html
The day Kennedy was shot by Jim Bishop .
* www.powells.com/biblio/49200-49400/0517431009.html
The day Kennedy was shot by by Jim Bishop .
1 x BULLET
* www.lustlysex.com/index_m.htm
President John F. Kennedy was shot by a Republican bullet .
1 x MAN
* www.ncas.org/condon/text/appndx-p.htm
KENNEDY ASSASSINATION Kennedy was shot by a man who was not .
Documents
Text Analysis / Shallow Parsing
Analyzed Text
who, what, where, when, why, how, …
Text Mining
Structured Data / Knowledge
Language Bank
Classifiers
Documents
Text Analysis / Shallow Parsing
Analyzed Text
who, what, where, when, why, how, …
Text Mining
Structured Data / Knowledge
Information Sources
Annotated corpus
Examples
Machine Learning
•Feature selection and construction
•Learning algorithm parameter optimization
•Combination
•Boosting
•Cross-validation
Postprocessing
Input
Optimal Classifier
Output
Generalisation  Abstraction
Rule Induction
+ abstraction
…
Connectionism
(Fill in your most hated
Inductive Logic Programming
linguist here)
Statistics
Handcrafting
+ generalisation
- generalisation
Memory-Based Learning
- abstraction
Table Lookup
MBL: Use memory traces of experiences as a basis for
analogical reasoning, rather than using rules or other
abstractions extracted from experience and replacing the
experiences.
This “rule of nearest neighbor” has considerable
elementary intuitive appeal and probably corresponds to
practice in many situations. For example, it is possible
that much medical diagnosis is influenced by the doctor's
recollection of the subsequent history of an earlier patient
whose symptoms resemble in some way those of the current
patient. (Fix and Hodges, 1952, p.43)
-etje
MBL
-kje
Coda last syl
?
Nucleus last syl
Memory-Based Learning
• Basis: k nearest neighbor algorithm:
– store all examples in memory
– to classify a new instance X, look up the k examples in
memory with the smallest distance D(X,Y) to X
– let each nearest neighbor vote with its class
– classify instance X with the class that has the most
votes in the nearest neighbor set
• Choices:
– similarity metric
– number of nearest neighbors (k)
– voting weights
The properties of NLP tasks …
•
NLP tasks are mappings between linguistic
representation levels that are
–
–
•
•
•
context-sensitive (but mostly local!)
complex (sub/ir/regularity), pockets of exceptions
Similar representations at one linguistic level correspond
to similar representations at the other level
Several information sources interact in (often)
unpredictable ways at the same level
Data is sparse
… fit the bias of MBL
• The mappings can be represented as (cascades of)
classification tasks (disambiguation or segmentation)
• Locality is implemented through windowing over
representations
• Inference is based on Similarity-Based / Analogical
Reasoning
• Adaptive data fusion / relevance assignment is available
through feature weighting
• It is a non-parametric approach
• Similarity-based smoothing is implicit
• Regularities and subregularities / exceptions can be
modeled uniformly
Shallow Parsing Classifiers
The woman will give Mary a book
POS Tagging
The/Det woman/NN will/MD give/VB
Mary/NNP a/Det book/NN
Chunking
[The/Det woman/NN]NP [will/MD give/VB]VP
[Mary/NNP]NP [a/Det book/NN]NP
Relation Finding
Subject
[The woman ] [will give ] [Mary] [a book]
I-Object
Object
Memory-Based POS Tagger
Assigning morpho-syntactic categories (parts-of-speech) to
words in context:
The green
train
Det Adj/NN NNS/VBZ
Det Adj
NNS
runs
down
NN/VB Prep/Adv/Adj
VB
Prep
that
SC/Pron
Pron
track
.
NN/VB .
NN
.
Disambiguation: resolution of a combination of lexical and
local contextual constraints.
• Lexical representations: Frequency-sensitive ambiguity
class lexicon.
• Convert sentences to MBL cases by ‘windowing’. Local
constraints are modeled by features of neighboring words.
Memory-Based POS Tagger (2)
• Case base for known words. Features:
tag-2, tag-1, lexfocus, word(top100)focus, lex+1, lex+2  POS tag
• Case base for unknown words. Features:
tag-2, tag-1, pref, cap, hyp, num, suf1, suf2, suf3, lex+1, lex+2
 POS tag
Memory-Based POS Tagger (3)
• Experimental results: (Zavrel & Daelemans, 1999)
language
English WSJ
English LOB
Dutch
Czech
Spanish
Swedish
tagset size train test accuracy
44 2000 200
96.4
170 931 115
97.0
13 611 100
95.7
42 495 100
93.6
484 711 89
97.8
23 1156 11
95.6
Memory-Based XP Chunker
Assigning non-recursive phrase brackets (Base XPs) to
phrases in context:
[NP The womanNP ]
Det
NN
I-NP
I-NP
[VPwill giveVP]
MD
VB
I-VP
I-VP
[NPMaryNP]
NNP
I-NP
[NPa
bookNP]
Det
NN
B-NP I-NP
Convert NP, VP, ADJP, ADVP, PrepP, and PP brackets to
classification decisions (I/O/B tags) (Ramshaw & Marcus, 1995).
Features:
POS -2, IOBtag-2, word -2,
POS -1, IOBtag-1, word -1,
POS focus, wordfocus,
POS +1,
word +1, POS +2, word +2,  IOB tag
.
.
Memory-Based XP Chunker (2)
• Results (WSJ corpus) (Buchholz, Veenstra, Daelemans, 1999)
type
prec
recall
F1
NP
92.5
92.2
VP
91.9
91.7
ADJP
68.4
65.0
ADVP
78.0
77.9
Prep
95.5
96.7
PP
91.9
92.2
ADVFunc
78.0
69.5
92.3
91.8
66.7
77.9
96.1
92.0
73.5
• Useful for: Information Retrieval, Information Extraction,
Terminology Discovery, etc.
• See also CoNLL-2000 Shared task at:
http://lcg-www.uia.ac.be/
Memory-Based GR labeling
Assigning labeled Grammatical Relation links between words
in a sentence:
The woman will
Det NN
MD
I-NP I-NP
I-VP
SUBJ-1 VP-1
give
VB
I-VP
VP-1
Mary
NNP
I-NP
OBJ-1
a
Det
B-NP
book
NN
B-NP
.
.
GR’s of Focus with relation to Verbs (subject, object,
location, …, none)
Features:
Focus: prep, adv-func, word+1, word0, word-1, word-2, POS+1, POS0, POS-1,
POS-2, Chunk+1, Chunk0, Chunk-1, Chunk-2.
Verb: POS, word,
Distance: words, VPs, comma’s
 GRtype
Memory-Based GR labeling (2)
• Results (WSJ corpus) (Buchholz, Veenstra, Daelemans, 1999)
features
prec
recall
F1
words+POS only
60.7
41.3
+NPs
65.9
55.7
+VPs
72.1
62.9
+ADJPs +ADVPs
72.1
63.0
+Preps
72.5
64.3
+PPs
73.6
65.6
+ADVFunc
74.8
67.9
49.1
60.4
67.2
67.3
68.2
69.3
71.2
• Completes shallow parser. Useful for e.g. Question
Answering, IE etc.
• See demo at: http://ilk.kub.nl/
From POS tagging to IE
Classification-Based Approach
• POS tagging
The/Det woman/NN will/MD give/VB Mary/NNP a/Det book/NN
• NP chunking
The/I-NP woman/I-NP will/I-VP give/I-VP Mary/I-NP a/B-NP
book/I-NP
• Relation Finding
[NP-SUBJ-1 the woman ] [VP-1 will give ] [NP-I-OBJ-1 Mary] [NPOBJ-1 a book ]]
• Semantic Tagging = Information Extraction
[Giver the woman][will give][Givee Mary][Given a book]
•
Semantic Tagging = Question Answering
Who will give Mary a book?
[Giver ?][will give][Givee Mary][Given a book]
Language Bank
Classifiers
Documents
Text Analysis / Shallow Parsing
Analyzed Text
who, what, where, when, why, how, …
Text Mining
Structured Data / Knowledge
Language Bank
Documents
Analyzed Text
who, what, where, when, why, how, …
Structured Data / Knowledge
Classifiers
IE: Experimental comparison
with HMM
• Systems:
– TK_SemTagger: Textkernel’s Memory-Based shallow
semantic tagger, using TiMBL v4.0; and
– TnT (Brants, 2000), trigram HMM, smoothing by
deleted interpolation. Handles unknown words by
successive abstraction over suffix tree.
• Classification-based Information Extraction
(Zavrel & Daelemans, forthcoming)
• Data set: Seminar Announcements.
• Score: F1 (=2*p*r/(p+r)), Exact match, all occurrences.
Data set
Seminar Announcement data set (Freitag, 1998): 485
documents (1-243 for training, 243-283 for
validation, 283-323 testing
• Extracted fields:
–
–
–
–
“speaker”
“location”
“start time”
“end time”
Experiments: TK_SemTagger
Feature set:
-4
-3
-2
-1
0
1
2
3
4
tag, word{f>4}
tag, word{f>4}
tag, word{f>4}
tag, word{f>4}
lextag, word{f>4}, suf1, suf2, pref1, cap, allcap, initial, hyp,
num, at
lextag, word{f>4}
lextag, word{f>4}
lextag, word{f>4}
lextag, word{f>4}
Experiments: comparison
Comparison best TnT and
TK_SemTagger model:
TnT Context
TKSem Best
speaker location stime
etime
0.66
0.70
0.85
0.90
0.71
0.87
0.95
0.96
Conclusions
• Text Mining tasks benefit from linguistic analysis
(shallow parsing)
• Problems ranging from shallow parsing to
application-oriented tasks (Information
Extraction) can be formulated as classificationbased learning tasks
• These classifiers can be trained on Language
Banks
• Memory-Based Learning seems to have the right
bias for this type of task (can cope with rich
feature sets and exceptions)
Text Mining at CNTS
(http://cnts.uia.ac.be)
• Development of shallow parsers, lemmatizers,
named-entity recognizers and other tools
– ML basic research aspects, design, adaptation to
specific TM applications
– English, Dutch, …
• Information and Ontology Extraction
– BIOMINT (?) (EU), ONTOBASIS (IWT)
• Summarization and automatic subtitling
– MUSA (EU), ATRANOS (IWT)