Transcript Slide 1

Special Topics in Computer Science
Advanced Topics in Information Retrieval
Lecture 9:
Natural Language Processing and IR.
Tagging, WSD, and Anaphora Resolution
Alexander Gelbukh
www.Gelbukh.com
Previous Chapter: Conclusions
 Reducing synonyms can help IR
 Better matching
 Ontologies are used. WordNet
 Morphology is a variant of synonymy
 widely used in IR systems
 Precise analysis: dictionary-based analyzers
 Quick-and-dirty analysis: stemmers
 Rule-based stemmers. Porter stemmer
 Statistical stemmers
2
Previous Chapter: Research topics
 Constructing and application of ontologies
 Building of morphological dictionaries
 Treatment of unknown words with morphological
analyzers
 Development of better stemmers
 Statistical stemmers?
3
Contents
 Tagging:
for each word, determine its POS (Part of Speech:
noun, ...) and grammatical characteristics
 WSD (Word Sense Disambiguation):
for each word, determine which homonym is used
 Anaphora resolution:
For a pronoun (it, ...), determine what it refers to
4
Tagging: The problem
 Ambiguity of parts of speech





rice flies like sand
= insects living in rice consider send good?
= rice can fly similarly to sand?
... insect of a container with rice...?
We can fly like sand ... We think fly like sand...
 Ambiguity of grammatical characteristics
 He have read the book
 He will read the book... He read the book
 Very frequent phenomenon, nearly at each word!
5
Tagger...
 A program that looking at the context and decides
what the part of speech (and other characteristics) are
 Input:
 He will read the book
 Morphological analysis
 He<...> will<Ns | Va> read<Vpa | Vpp | Vinf> the<...>
?
?
?
?
?
Ns = noun singular,
Tags: Tagger
Va = verb auxiliary,
Vpa = verb past
Vpp = verb past participle, Vinf = verb infinitive, ...
6
...Tagger
 Input of tagger
 He<...> will<Ns | Va> read<Vpa | Vpp | Vinf> the<...>
 Task: Choose one!
 Output:
 He<...> will<Va> read<Vinf> the<...>
 How we do it?
 He will<N> not possible  Va
 will<Va> read  Vinf
 This is simple, but imagine He is ambiguous... Explosion
7
Applications
 Used for word sense disambiguation:




Oil well in Mexico is used.
Oil is used well in Mexico.
For stemming and lemmatization
Important for matching in information retrieval
 Greatly speed ups syntactic analysis
 Tagging is local
 No need to process the whole sentence to find that a
certain tag is incorrect
8
How: Parsing?
 We can find all the syntactic structures
 Only the correct variants will enter the syntactic
structure
 will + Vinf form a syntactic unit
 will + Vpa do not
 Problems
 Computationally expensive
 What to do with ambiguities?
• fly rice like sand
• Depends on what you need
9
Statistical tagger
 Example: TnT tagger
 Based on Hidden Markov Model (HMM)
 Idea:
 Some words are more probable after some other words
 Find these probabilities
 Guess the word if you know the nearby ones
 Problem:
 Letter strings denote meanings
 “x is more probable after y” are meanings, not strings
 so guess what you cannot see: meanings
10
Hidden Markov Model: Idea
 A system changes its state
 What a person thinks
 Random... but not completely (how?)
 In each state, it emits an output
 What he says when he thinks something
 Random... but somehow (?) depends on what he thinks
 We know the sequence of produced outputs
 Text: we can see it!
 Guess what were the underlying states
 Hidden: we cannot see them
11
Hidden Markov Model: Hypotheses
 A finite set of states: q1 ... qN (invisible)
 POS and grammatical characteristics (language)
 A finite set of observations: v1 ... vM
 Strings we see in the corpus (language)
 A random sequence of states xi
 POS in the
 Probabilities of state transitions P(xi+1| xi)
 Language rules and use
 Probabilities of observations P(vk| xi)
 words expressing the meanings: Vinf: ask, V3: asks
12
Hidden Markov Model: Problem
 Same observation corresponds to different meaning
 Vinf: read, Vpp: read
 Looking at what we can see, guess what we cannot
 This is why hidden
 Given a sequence of observations oi
 The text: sequence of letter strings. Training set
 Guess the sequence of states xi
 The POS of each word
 Our hypotheses on xi depend on each other
 Highly combinatorial task
13
Hidden Markov Model: Solutions
 Need to find the parameters of the model:
 P(xi+1| xi)
 P(vk| xi)
 Optimal way! To maximize the probability of
generation this specific output
 Optimization methods from Operation Research are
used
 More details? Not so simple...
14
Brill Tagger (rule-based)
 Erik Brill
 Makes an initial assumption about
POS tags in the text
 Uses context-dependent rewriting
rules to correct some tags
 Applies them iteratively
 Learns the rules from a training corpus
 The rules are in human-understandable form
 You can correct them manually to improve the tagger
 Unlike HMM which are not understandable
15
Word Sense Disambiguation
 Query: international bank in Seoul
 Bank:
한




financial institution
river shore
place to store something
...
Korean
superior
한상용
...
원
$
official
...
...
 Hotel located at the beautiful bank of Han river.
 Relevant for the query?
 POS is the same. Tagger will not distinguish them
16
Applications
 Translation




대원군
만원
international bank
river bank
Great Governor of the Court
10 thousand won
banco internacional
orilla del río
 Information retrieval
 Document retrieval: is really useful? Same info
 Passage retrieval: can prove very useful!
 Semantic analysis
17
Representation of word senses
1. Explanations. Semantic dictionaries
 Bank1 is an institution to keep money
 Bank2 is a sloppy edge of a river
2. Synsets and ontology: WordNew (HanNet: Chinese)

Synonyms: {bank, shore}
•
•


WordNet terminology: synset #12345
Corresponds to all ways to call a concept
Relationships:
#12345 IS_PART_OF #67890 {river, stream}
#987 IS_A #654 {institution, organization}
WordNet has also glosses
18
Task
 Given a text (probably POS-tagged)
 Tag each word with its synset number #123 or
dictionary number bank1
 Input:
 Mary keeps the money in a bank.
 Han river’s bank is beautiful.
 Output
 Mary keeps<1> the money<1> in a bank<1>
 Han river’s bank<2> is beautiful.
19
Lesk algorithm
 Michael Lesk
 Explanatory dictionary
 Bank1 is an institution to keep money
 Bank2 is a sloppy edge of a river
 Mary keeps her money (savings) in a bank.
 Choose the sense which has more words in common
with immediate context
 Improvements (Pedersen, Gelbukh & Sidorov)
 Use synonyms when no direct matches
 Use synonyms of synonyms, ...
20
Other word relatedness measures
 Lexical chains in WordNet
 The length of the path in the graph of relationships
 Mutual information: frequent co-occurrences
 Collocations (Bolshakov & Gelbukh)
 Keep in bank1
 Bank2 of river
 Very large dictionary of such combinations
 Number of words in common between explanations
 Recursive: common words or related words
(Gelbukh & Sidorov)
21
Other methods
 Hidden Markov Models
 Logical reasoning
22
Yarowsky’s Principles




David Yarowsky
One sense per text!
One sense per collocation
I keep my money in the bank1. This
is an international bank1 with a great capital. The
bank2 is located near Han river.
 3 words vote for ‘institution’, one for ‘shore’
 Institution!
bank1 is located near Han river.
23
Anaphora resolution






Mainly pronouns.
Also co-reference: when two words refer to the same?
John took cake from the table and ate it.
John took cake from the table and washed it.
Translation into Spanish: la ‘she’ table / lo ‘he’ cake
Methods:
 Dictionaries
 Different sources of evidence
 Logical reasoning
24
Applications
 Translation
 Information retrieval:
 Can improve frequency counts (?)
 Passage retrieval: can be very important
25
Mitkov’s knowledge poor method
 Ruslan Mitkov
 Rule-based and statistical-based approach
 Uses simple information on POS and general word
classes
 Combines different sources of evidence
26
Hidden Anaphora
 John bought a house. The kitchen is big.
 = that house’s kitchen
 John was eating. The food was delicious.
 = “that eating” ’s food
 John was buried. The widow was mad with grief.
 “that burying” ’s death’s widow
 Intersection of scenarios of the concepts
(Gelbukh & Sidorov)
 house has a kitchen
 burying results from death & widow results from death
27
Evaluation
 Senseval and TREC international competitions
 Korean track available
 Human annotated corpus





Very expensive
Inter-annotator agreement is often low!
A program cannot do what humans cannot do
Apply the program and compare with the corpus
Accuracy
 Sometimes the program cannot tag a word
 Precision, recall
28
Research topics




Too many to list
New methods
Lexical resources (dictionaries)
= Computational linguistics
29
Conclusions
 Tagging, word sense disambiguation, and
anaphora resolution are cases of disambiguation of
meaning
 Useful in translation, information retrieval, and text
undertanding
 Dictionary-based methods
 good but expensive
 Statistical methods
 cheap and sometimes imperfect... but not always (if very
large corpora are available)
30
Thank you!
Till May 31? June 1?
6 pm
31