Transcript .ppt

Introduction to Language Models
Evaluation in information retrieval
Lecture 4
Last lecture: term weighting

tf.idf term weighting
tf .idf w,d
N
 tf w,d  log
df w
tfw,d = # of occurrences of word w in doc d
term frequency
N = number of documents in the collection
dfw = # of docs in the collection that contain w
document frequency
Last lecture: vector representation

Vector representation
-

Binary vector
Frequency vector
tf.idf vector
Each component corresponds to a word
–
Sparse vectors (lots of 0 elements)
Last lecture: document similarity

k and s are the vector
representations of two
documents

 
k .s
sim (k , s )    
k s
k s
i i
i
k s
2
i
i
2
i
i
Fried chicken example (p.770)

Query (‘fried chicken’)
–

Document j (‘fired chicken recipe’)
–

q = (1,1)
j = (3,8)
Document k (‘poached chicken recipe’)
–
k = (0,6)
q = (1,1); j = (3,8); k = (0,6)
1 3  1 8
11
sim (q, j ) 

 0.8198
(1  1)(9  81)
180
1 0  1 6
6
sim (q, k ) 

 0.7071
(1  1)(0  36)
72
Corpus representation: a term-bydocument matrix
Document k
Document j
chicken
8
6
fried
3
0
poached
0
4
recipe
1
1
Document length influence

If term t appears say 50 times in a 100 word
paper and 80 times in a 5000 word document,
where is the word more descriptive?
–

Maximum tf normalization: divide tf by the maximum
tf observed in the document
When computing document similarity
–
What happens when one document subsumes the
other?
Language models: introduction
Next Word Prediction

From a NY Times story...
–
–
–
–
–
Stocks ...
Stocks plunged this ….
Stocks plunged this morning, despite a cut in
interest rates
Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall ...
Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall
Street began
Next Word Prediction

From a NY Times story...
–
–
–
–
–
Stocks ...
Stocks plunged this ….
Stocks plunged this morning, despite a cut in
interest rates
Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall ...
Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall
Street began
Claim


A useful part of the knowledge needed to
allow Word Prediction can be captured using
simple statistical techniques
In particular, we'll rely on the notion of the
probability of a sequence (of letters,
words,…)
Applications

Why do we want to predict a word, given
some preceding words?
Rank the likelihood of sequences containing
various alternative hypotheses, e.g. for ASR
Theatre owners say popcorn/unicorn sales have
doubled...
– Spelling correction
– IR: how likely is a document to generate a query
–
N-Gram Models of Language


Use the previous N-1 words in a sequence to
predict the next word
Language Model (LM)
–

unigrams, bigrams, trigrams,…
How do we train these models?
–
Very large corpora
Simple N-Grams

Assume a language has T word types in its lexicon,
how likely is word x to follow word y?
–
–
Simplest model of word probability: 1/T
Alternative 1: estimate likelihood of x occurring in new
text based on its general frequency of occurrence
estimated from a corpus (unigram probability)
popcorn is more likely to occur than unicorn
–
Alternative 2: condition the likelihood of x occurring in
the context of previous words (bigrams, trigrams,…)
mythical unicorn is more likely than mythical popcorn

Unigram model
–

likely topics
count ( w)
P ( w) 
# tokens
Bigram model
–
grammaticality
count ( wi wi 1 )
P( wi wi 1 ) 
count ( wi )
Computing the Probability of a
Word Sequence

Compute the product of component
conditional probabilities?
–

P(the mythical unicorn) = P(the) P(mythical|the) *
P(unicorn|the mythical)
The longer the sequence, the less likely we
are to find it in a training corpus
P(Most biologists and folklore specialists believe that in
fact the mythical unicorn horns derived from the
narwhal)

Solution: approximate using n-grams
Bigram Model

Approximate P(wn |w1n1)
–


by
P(wn |wn 1)
P(unicorn|the mythical) by P(unicorn|mythical)
Markov assumption: the probability of a word
depends only on the probability of a limited history
Generalization: the probability of a word depends
only on the probability of the n previous words
–
–
–
trigrams, 4-grams, …
the higher n is, the more data needed to train
backoff models…
A Simple Example: bigram model
–
P(I want to each Chinese food) = P(I | <start>)
P(want | I) P(to | want) P(eat | to) P(Chinese | eat)
P(food | Chinese) P(<end>|food)
Generating WSJ
Google N-Gram Release










serve
serve
serve
serve
serve
serve
serve
serve
serve
serve
as
as
as
as
as
as
as
as
as
as
the
the
the
the
the
the
the
the
the
the
incoming 92
incubator 99
independent 794
index 223
indication 72
indicator 120
indicators 45
indispensable 111
indispensible 40
individual 234
Evaluation in information retrieval



How do we know one system is better than
another?
How can we tell if a new feature improves
performance?
Metrics developed for IR are used in other
fields as well
Gold standard/ground truth

Given a user information need, documents in
a collection are classified as either relevant
or nonrelevant

Relevant = pertinent to the user information
need
Information needs are not equivalent to
queries

Information on whether drinking red wine is
more effective at reducing your risk of heart
attack than white wine

Pros and cons of low fat diets for weight
control

Health effects from drinking green tea
Needed for evaluation


Test document collection
Reasonable number of information needs
–

At least 50
Relevance judgments
–
–
Practically impossible to get these for every
document in the collection
Usually only for the top ranked results returned
form systems
Accuracy

Problematic measure for IR evaluation
–

(tp+tn)/(tp+tn+fp+fn)
99.9% of the documents will be nonrelevant
–
Trivially achieved high performance
Precision
Recall
Precision/Recall trade off

Which is more important depends on the
user needs
–
Typical web users

–
High precision in the first page of results
Paralegals and intelligence analysts


Need high recall
Willing to tolerate some irrelevant documents as a price
F-measure