Transcript Slide 1
Special Topics in Computer Science
Advanced Topics in Information Retrieval
Lecture 9:
Natural Language Processing and IR.
Tagging, WSD, and Anaphora Resolution
Alexander Gelbukh
www.Gelbukh.com
Previous Chapter: Conclusions
Reducing synonyms can help IR
Better matching
Ontologies are used. WordNet
Morphology is a variant of synonymy
widely used in IR systems
Precise analysis: dictionary-based analyzers
Quick-and-dirty analysis: stemmers
Rule-based stemmers. Porter stemmer
Statistical stemmers
2
Previous Chapter: Research topics
Constructing and application of ontologies
Building of morphological dictionaries
Treatment of unknown words with morphological
analyzers
Development of better stemmers
Statistical stemmers?
3
Contents
Tagging:
for each word, determine its POS (Part of Speech:
noun, ...) and grammatical characteristics
WSD (Word Sense Disambiguation):
for each word, determine which homonym is used
Anaphora resolution:
For a pronoun (it, ...), determine what it refers to
4
Tagging: The problem
Ambiguity of parts of speech
rice flies like sand
= insects living in rice consider send good?
= rice can fly similarly to sand?
... insect of a container with rice...?
We can fly like sand ... We think fly like sand...
Ambiguity of grammatical characteristics
He have read the book
He will read the book... He read the book
Very frequent phenomenon, nearly at each word!
5
Tagger...
A program that looking at the context and decides
what the part of speech (and other characteristics) are
Input:
He will read the book
Morphological analysis
He<...> will<Ns | Va> read<Vpa | Vpp | Vinf> the<...>
?
?
?
?
?
Ns = noun singular,
Tags: Tagger
Va = verb auxiliary,
Vpa = verb past
Vpp = verb past participle, Vinf = verb infinitive, ...
6
...Tagger
Input of tagger
He<...> will<Ns | Va> read<Vpa | Vpp | Vinf> the<...>
Task: Choose one!
Output:
He<...> will<Va> read<Vinf> the<...>
How we do it?
He will<N> not possible Va
will<Va> read Vinf
This is simple, but imagine He is ambiguous... Explosion
7
Applications
Used for word sense disambiguation:
Oil well in Mexico is used.
Oil is used well in Mexico.
For stemming and lemmatization
Important for matching in information retrieval
Greatly speed ups syntactic analysis
Tagging is local
No need to process the whole sentence to find that a
certain tag is incorrect
8
How: Parsing?
We can find all the syntactic structures
Only the correct variants will enter the syntactic
structure
will + Vinf form a syntactic unit
will + Vpa do not
Problems
Computationally expensive
What to do with ambiguities?
• fly rice like sand
• Depends on what you need
9
Statistical tagger
Example: TnT tagger
Based on Hidden Markov Model (HMM)
Idea:
Some words are more probable after some other words
Find these probabilities
Guess the word if you know the nearby ones
Problem:
Letter strings denote meanings
“x is more probable after y” are meanings, not strings
so guess what you cannot see: meanings
10
Hidden Markov Model: Idea
A system changes its state
What a person thinks
Random... but not completely (how?)
In each state, it emits an output
What he says when he thinks something
Random... but somehow (?) depends on what he thinks
We know the sequence of produced outputs
Text: we can see it!
Guess what were the underlying states
Hidden: we cannot see them
11
Hidden Markov Model: Hypotheses
A finite set of states: q1 ... qN (invisible)
POS and grammatical characteristics (language)
A finite set of observations: v1 ... vM
Strings we see in the corpus (language)
A random sequence of states xi
POS in the
Probabilities of state transitions P(xi+1| xi)
Language rules and use
Probabilities of observations P(vk| xi)
words expressing the meanings: Vinf: ask, V3: asks
12
Hidden Markov Model: Problem
Same observation corresponds to different meaning
Vinf: read, Vpp: read
Looking at what we can see, guess what we cannot
This is why hidden
Given a sequence of observations oi
The text: sequence of letter strings. Training set
Guess the sequence of states xi
The POS of each word
Our hypotheses on xi depend on each other
Highly combinatorial task
13
Hidden Markov Model: Solutions
Need to find the parameters of the model:
P(xi+1| xi)
P(vk| xi)
Optimal way! To maximize the probability of
generation this specific output
Optimization methods from Operation Research are
used
More details? Not so simple...
14
Brill Tagger (rule-based)
Erik Brill
Makes an initial assumption about
POS tags in the text
Uses context-dependent rewriting
rules to correct some tags
Applies them iteratively
Learns the rules from a training corpus
The rules are in human-understandable form
You can correct them manually to improve the tagger
Unlike HMM which are not understandable
15
Word Sense Disambiguation
Query: international bank in Seoul
Bank:
한
financial institution
river shore
place to store something
...
Korean
superior
한상용
...
원
$
official
...
...
Hotel located at the beautiful bank of Han river.
Relevant for the query?
POS is the same. Tagger will not distinguish them
16
Applications
Translation
대원군
만원
international bank
river bank
Great Governor of the Court
10 thousand won
banco internacional
orilla del río
Information retrieval
Document retrieval: is really useful? Same info
Passage retrieval: can prove very useful!
Semantic analysis
17
Representation of word senses
1. Explanations. Semantic dictionaries
Bank1 is an institution to keep money
Bank2 is a sloppy edge of a river
2. Synsets and ontology: WordNew (HanNet: Chinese)
Synonyms: {bank, shore}
•
•
WordNet terminology: synset #12345
Corresponds to all ways to call a concept
Relationships:
#12345 IS_PART_OF #67890 {river, stream}
#987 IS_A #654 {institution, organization}
WordNet has also glosses
18
Task
Given a text (probably POS-tagged)
Tag each word with its synset number #123 or
dictionary number bank1
Input:
Mary keeps the money in a bank.
Han river’s bank is beautiful.
Output
Mary keeps<1> the money<1> in a bank<1>
Han river’s bank<2> is beautiful.
19
Lesk algorithm
Michael Lesk
Explanatory dictionary
Bank1 is an institution to keep money
Bank2 is a sloppy edge of a river
Mary keeps her money (savings) in a bank.
Choose the sense which has more words in common
with immediate context
Improvements (Pedersen, Gelbukh & Sidorov)
Use synonyms when no direct matches
Use synonyms of synonyms, ...
20
Other word relatedness measures
Lexical chains in WordNet
The length of the path in the graph of relationships
Mutual information: frequent co-occurrences
Collocations (Bolshakov & Gelbukh)
Keep in bank1
Bank2 of river
Very large dictionary of such combinations
Number of words in common between explanations
Recursive: common words or related words
(Gelbukh & Sidorov)
21
Other methods
Hidden Markov Models
Logical reasoning
22
Yarowsky’s Principles
David Yarowsky
One sense per text!
One sense per collocation
I keep my money in the bank1. This
is an international bank1 with a great capital. The
bank2 is located near Han river.
3 words vote for ‘institution’, one for ‘shore’
Institution!
bank1 is located near Han river.
23
Anaphora resolution
Mainly pronouns.
Also co-reference: when two words refer to the same?
John took cake from the table and ate it.
John took cake from the table and washed it.
Translation into Spanish: la ‘she’ table / lo ‘he’ cake
Methods:
Dictionaries
Different sources of evidence
Logical reasoning
24
Applications
Translation
Information retrieval:
Can improve frequency counts (?)
Passage retrieval: can be very important
25
Mitkov’s knowledge poor method
Ruslan Mitkov
Rule-based and statistical-based approach
Uses simple information on POS and general word
classes
Combines different sources of evidence
26
Hidden Anaphora
John bought a house. The kitchen is big.
= that house’s kitchen
John was eating. The food was delicious.
= “that eating” ’s food
John was buried. The widow was mad with grief.
“that burying” ’s death’s widow
Intersection of scenarios of the concepts
(Gelbukh & Sidorov)
house has a kitchen
burying results from death & widow results from death
27
Evaluation
Senseval and TREC international competitions
Korean track available
Human annotated corpus
Very expensive
Inter-annotator agreement is often low!
A program cannot do what humans cannot do
Apply the program and compare with the corpus
Accuracy
Sometimes the program cannot tag a word
Precision, recall
28
Research topics
Too many to list
New methods
Lexical resources (dictionaries)
= Computational linguistics
29
Conclusions
Tagging, word sense disambiguation, and
anaphora resolution are cases of disambiguation of
meaning
Useful in translation, information retrieval, and text
undertanding
Dictionary-based methods
good but expensive
Statistical methods
cheap and sometimes imperfect... but not always (if very
large corpora are available)
30
Thank you!
Till May 31? June 1?
6 pm
31