A Risk Minimization Framework for Information Retrieval

Transcript A Risk Minimization Framework for Information Retrieval

Information Retrieval Models:
Language Models
ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
Modeling Relevance:
Raodmap for Retrieval Models
Relevance constraints
[Fang et al. 04]
Relevance
(Rep(q), Rep(d))
Similarity
Different
rep & similarity
…
P(d q) or P(q d)
Probabilistic inference
P(r=1|q,d) r {0,1}
Probability of Relevance
Regression
Model (Fuhr 89)
Generative
Model
Learn. To Rank Doc
(Joachims 02,
generation
Berges et al. 05)
Vector space
Prob. distr.
model
model
(Salton et al., 75) (Wong & Yao, 89)
Div. from Randomness
(Amati & Rijsbergen 02)
Classical
prob. Model
(Robertson &
Sparck Jones, 76)
Query
generation
Different
inference system
Prob. concept Inference
network
space model
(Wong & Yao, 95) model
(Turtle & Croft, 91)
LM
approach
(Ponte & Croft, 98)
(Lafferty & Zhai, 01a)
Query Generation
( Language Models for IR)
O( R  1 | Q , D ) 

P (Q , D | R  1)
P (Q , D | R  0 )
P (Q | D, R  1) P ( D | R  1)
P (Q | D , R  0 ) P ( D | R  0)
 P (Q | D, R  1)
P ( D | R  1)
( Assume P (Q | D, R  0)  P (Q | R  0))
P ( D | R  0)
Query likelihood p(Q| D,R=1)
Document prior
Assuming uniform prior, we have O( R  1 | Q, D)  P (Q | D, R  1)
Now, the question is how to compute P (Q | D, R  1) ?
Generally involves two steps:
(1) estimate a language model based on D
(2) compute the query likelihood according to the estimated model
P(Q|D, R=1) Prob. that a user who likes D would pose query Q
How to estimate it?
The Basic LM Approach
[Ponte & Croft 98]
Document
Language Model
…
Text mining
paper
text ?
mining ?
assocation ?
clustering ?
…
food ?
…
Food nutrition
paper
…
food ?
nutrition ?
healthy ?
diet ?
…
Query =
“data mining algorithms”
?
Which model would most
likely have generated
this query?
Ranking Docs by Query Likelihood
Doc LM
Query likelihood
d1
 d1
p(q| d1)
d2
 d2
p(q| d2)
p(q| dN)
dN
 dN
q
Modeling Queries: Different Assumptions
•
Multi-Bernoulli: Modeling word presence/absence
– q= (x1, …, x|V|), xi =1 for presence of word wi; xi =0 for absence
|V |
p (q  ( x1 ,..., x|V | ) | d )  p( wi  xi | d ) 
i 1
•
|V |

i 1, xi 1
p( wi  1| d )
– Parameters: {p(wi=1|d), p(wi=0|d)}
|V |

i 1, xi  0
p( wi  0 | d )
p(wi=1|d)+ p(wi=0|d)=1
Multinomial (Unigram LM): Modeling word frequency
– q=q1,…qm , where qj is a query word
|V |
m
p(q  q1...qm | d )  p(q j | d )   p ( wi | d )c ( wi ,q )
j 1
i 1
– c(wi,q) is the count of word wi in query q
– Parameters: {p(wi|d)}
p(w1|d)+… p(w|v||d) = 1
[Ponte & Croft 98] uses Multi-Bernoulli; most other work uses multinomial
Multinomial seems to work better [Song & Croft 99, McCallum & Nigam 98,Lavrenko 04]
Retrieval as LM Estimation
• Document ranking based on query
likelihood m
|V |
log p ( q | d )   log p ( qi | d )   c ( wi , q ) log p ( wi | d )
i 1
where, q  q1q2 ...qm
i 1
Document language model
• Retrieval problem  Estimation of p(wi|d)
• Smoothing is an important issue, and
distinguishes different approaches
How to Estimate p(w|d)?
• Simplest solution: Maximum Likelihood
Estimator
– P(w|d) = relative frequency of word w in d
– What if a word doesn’t appear in the text? P(w|d)=0
• In general, what probability should we give a
word that has not been observed?
• If we want to assign non-zero probabilities to
such words, we’ll have to discount the
probabilities of observed words
• This is what “smoothing” is about …
Language Model Smoothing
(Illustration)
P(w)
Max. Likelihood Estimate
p ML ( w ) 
count of w
count of all words
Smoothed LM
Word w
How to Smooth?
•
All smoothing methods try to
– discount the probability of words seen in a document
– re-allocate the extra counts so that unseen words will
have a non-zero count
•
Method 1 Additive smoothing [Chen & Goodman 98]:
Add a constant  to the counts of each word, e.g.,
“add 1”
Counts of w in d
c ( w, d )  1
p(w | d ) 
| d |  |V |
Length of d (total counts)
“Add one”, Laplace
Vocabulary size
Improve Additive Smoothing
• Should all unseen words get equal
probabilities?
• We can use a reference model
to
discriminate unseen words
Discounted ML estimate
if w is seen in d
 pDML ( w | d )
p( w | d )  
 d p( w | REF ) otherwise
Reference language model
1
d 
 p (w | d )
 p(w | REF )
DML
w is seen
w is unseen
Normalizer
Prob. Mass for unseen words
Other Smoothing Methods
•
Method 2 Absolute discounting [Ney et al. 94]: Subtract a
constant  from the counts of each word
# unique words
p (w | d ) 
•
max( c ( w , d )  ,0)   | d |u p ( w| REF )
|d |
Method 3 Linear interpolation [Jelinek-Mercer 80]: “Shrink”
uniformly toward p(w|REF)
c ( w, d )
p ( w | d )  (1   )
  p ( w | REF )
|d |
ML estimate
parameter
Other Smoothing Methods (cont.)
•
Method 4 Dirichlet Prior/Bayesian [MacKay & Peto 95, Zhai &
Lafferty 01a, Zhai & Lafferty 02]: Assume pseudo counts
p(w|REF)
parameter
c ( w, d )   p ( w | REF )
| d | c ( w, d )

p (w | d ) 


p ( w | REF )
| d | 
| d |  | d |
| d | 
•
Method 5 Good Turing [Good 53]: Assume total # unseen
events to be n1 (# of singletons), and adjust the seen
events in the same way
p (w | d ) 
c*( w , d )
|d |
; c *( w, d ) 
n
2* n2
c( w, d )  1
nc ( w,d ) 1;0*  1 ,1* 
,.....
nc ( w,d )
n0
n1
nr  the number of words with count r
What if nc ( w,d )  0? What about p  w | REF  ? Heuristics needed
Dirichlet Prior Smoothing
• ML estimator: M=argmax M p(d|M)
• Bayesian estimator:
– First consider posterior: p(M|d) =p(d|M)p(M)/p(d)
– Then, consider the mean or mode of the posterior dist.
• p(d|M) : Sampling distribution (of data)
• P(M)=p(1 ,…, N) : our prior on the model parameters
• conjugate = prior can be interpreted as “extra”/“pseudo”
data
• Dirichlet distribution is a conjugate prior for multinomial
sampling distribution
( 1     N ) N  i 1
Dir( |  1 ,  ,  N ) 
i
( 1 )  ( N ) i 1
“extra”/“pseudo” word counts
i= p(wi|REF)
Dirichlet Prior Smoothing (cont.)
Posterior distribution of parameters:
p( | d )  Dir( | c( w1 )  1 , , c( wN )   N )
Property : If  ~ Dir( |  ), then E( )  {
i
i

}
The predictive distribution is the same as the mean:
p(w i | ˆ )   p(w i |  ) Dir( |  )d
c( w i )   i
c( w i )  p( w i | REF )


N
| d | 
| d |  i
i 1
Dirichlet prior smoothing
Smoothing with Collection Model Illustrated
(Unigram) Language Model  Estimation
Document
p(w| )=?
…
10/100
5/100
3/100
3/100
1/100
0/100
text ?
mining ?
association ?
database ?
…
query ?
…
network?
Jelinek-Mercer
Dirichlet prior
text 10
mining 5
association 3
database 3
algorithm 2
…
query 1
efficient 1
(total #words=100)
Collection LM
P(w|C)
the 0.1
a 0.08
..
computer 0.02
database 0.01
……
text 0.001
network 0.001
mining 0.0009
…
Query Likelihood Retrieval Functions
p seen ( wi | d )
log p ( q | d )   [log
]  n log  d 
 d p ( wi | C )
wi  d
wi q
p( w | C ) 
 log p ( w
i 1
i
|C)
c ( w, C )
 c( w' , C )
w 'V
With Jelinek-Mercer (JM):
S JM ( q , d )  log p ( q | d ) 
n

log[1 
w d
wq
1   c ( w, d )
]
 | d | p(w | C )
With Dirichlet Prior (DIR):
S DIR ( q , d )  log p ( q | d ) 

w d
wq
log[1 
c ( w, d )

]  n log
p ( w | C )
| d | 
What assumptions have we made in order to derive these functions?
Do they capture the same retrieval heuristics (TF-IDF, Length Norm)
as a vector space retrieval function?
So, which method is the best?
It depends on the data and the task!
Cross validation is generally used to choose the best
method and/or set the smoothing parameters…
For retrieval, Dirichlet prior performs well…
Backoff smoothing [Katz 87] doesn’t work well due to a
lack of 2nd-stage smoothing…
Note that many other smoothing methods exist
See [Chen & Goodman 98] and other publications in speech recognition…
Comparison of Three Methods
[Zhai & Lafferty 01a]
Query T yp e
Title
Long
Jelinek- M ercer
0.228
0 .2 78
D irichlet
0 .2 56
0.276
A b s. D isco unt ing
0.237
0.260
Relative performance of JM, Dir. and AD
precision
0.3
TitleQuery
0.2
LongQuery
0.1
0
JM
DIR
AD
Method
Comparison is performed on a variety of test collections
Understanding Smoothing
Retrieval formula using
the general smoothing
scheme
The general smoothing scheme
Discounted ML estimate
if w is seen in d
 pDML ( w | d )
p( w | d )  
 d p( w | REF ) otherwise
log p (q | d )   c( w, q ) log p ( w | d )

wV

wV ,c ( w , d )  0

c( w, q ) log p DML ( w | d ) 
 c(w, q) log 
wV ,c ( w , d )  0
d
p ( w | REF )

c( w, q ) log p DML ( w | d )   c( w, q ) log  d p ( w | REF ) 

c( w, q ) log
wV ,c ( w , d )  0

Reference language model
wV ,c ( w , d )  0
wV
 c(w, q) log 
wV , c ( w , d )  0
d
p ( w | REF )
p DML ( w | d )
 | q | log  d   c( w, q ) log p ( w | REF )
 d p ( w | REF )
wV
The key rewriting step
Similar rewritings are very common when using LMs for IR…
Smoothing & TF-IDF Weighting
[Zhai & Lafferty 01a]
•
Plug in the general smoothing scheme to the query
likelihood retrieval formula, we obtain
TF weighting
Doc length normalization
(long doc is expected to have a smaller d)
pDML ( w | d )
log p ( q | d ) 
c( w, q ) log
 | q | log  d   c ( w, q ) p ( w | REF )

 d p ( w | REF )
wV ,c ( w , d )  0
wV
c ( w,q ) 0
Words in both query
and doc
•
•
IDF-like weighting
Ignore for ranking
Smoothing with p(w|C)  TF-IDF + length norm.
Smoothing implements traditional retrieval heuristics
LMs with simple smoothing can be computed as
efficiently as traditional retrieval models
The Dual-Role of Smoothing [Zhai & Lafferty 02]
long
Verbose
queries
Keyword
queries
long
short
short
Why does query type affect smoothing sensitivity?
Another Reason for Smoothing
Content words
Query = “the
pDML(w|d1):
0.04
pDML(w|d2):
0.02
algorithms
0.001
0.001
for
0.02
0.01
p( “algorithms”|d1) = p(“algorithm”|d2)
p( “data”|d1) < p(“data”|d2)
p( “mining”|d1) < p(“mining”|d2)
data
0.002
0.003
mining”
0.003
0.004
Intuitively, d2 should
have a higher score,
but p(q|d1)>p(q|d2)…
So we should make p(“the”) and p(“for”) less different for all docs,
and smoothing helps achieve this goal…
After smoothing with p( w | d )  0.1 pDML ( w | d )  0.9 p( w | REF ), p(q | d1)  p(q | d 2)!
Query
P(w|REF)
Smoothed p(w|d1):
Smoothed p(w|d2):
= “the
0.2
0.184
0.182
algorithms
for
0.00001
0.000109
0.000109
0.2
0.182
0.181
data
mining”
0.00001
0.000209
0.000309
0.00001
0.000309
0.000409
Two-stage Smoothing [Zhai & Lafferty 02]
Stage-1
Stage-2
-Explain unseen words
-Explain noise in query
-Dirichlet prior(Bayesian) -2-component mixture


Collection LM
P(w|d) = (1-)
c(w,d) +p(w|C)
|d|
+ p(w|U)
+
User background model
Can be approximated by p(w|C)
Estimating  using leave-one-out [Zhai & Lafferty 02]
w1
P(w1|d- w1)
log-likelihood
N
l1 (  | C )    c ( w, d i ) log(
w2
P(w2|d- w2)
i 1 wV
Leave-one-out
c ( w, d i )  1  p ( w | C )
)
| d i | 1  
Maximum Likelihood Estimator
...
wn
P(wn|d- wn)
μˆ  argmax l 1 (μ | C)
μ
Newton’s Method
Why would “leave-one-out” work?
20 word by author1
abc abc ab c d d
abc cd d d
abd ab ab ab ab
cd d e cd e
20 word by author2
abc abc ab c d d
abe cb e f
acf fb ef aff abef
cdc db ge f s
Suppose we keep sampling and get 10
more words. Which author is likely to
“write” more new words?
Now, suppose we leave “e” out…
 doesn’t have to be big
1
19
0
pml (" e " | author 2)
19
pml (" e " | author1) 
20 1


p (" e " | REF )
20   19 20  
20 0

psmooth (" e " | author 2) 

p (" e " | REF )
20   19 20  
psmooth (" e " | author1) 
 must be big! more smoothing
The amount of smoothing is closely related to
the underlying vocabulary size
Estimating  using Mixture Model
[Zhai & Lafferty 02]
Stage-2
Stage-1
d1

P(w|d1)

(1-)p(w|d1)+ p(w|U)
...
… ...
dN

N

P(w|dN)
1
Query
Q=q1…qm
(1-)p(w|dN)+ p(w|U)
Estimated in stage-1
p(q j | di ) 
Maximum Likelihood Estimator
Expectation-Maximization (EM) algorithm
c(q j , di )  ˆ p(q j | C )
| di |  ˆ
Automatic 2-stage results
 Optimal 1-stage results [Zhai & Lafferty 02]
Average precision (3 DB’s + 4 query types, 150 topics)
* Indicates significant difference
Collection
AP88-89
WSJ87-92
ZIFF1-2
query
SK
LK
SV
LV
SK
LK
SV
LV
SK
LK
SV
LV
Optimal-JM
20.3%
36.8%
18.8%
28.8%
19.4%
34.8%
17.2%
27.7%
17.9%
32.6%
15.6%
26.7%
Optimal-Dir
23.0%
37.6%
20.9%
29.8%
22.3%
35.3%
19.6%
28.2%
21.5%
32.6%
18.5%
27.9%
Auto-2stage
22.2%*
37.4%
20.4%
29.2%
21.8%*
35.8%
19.9%
28.8%*
20.0%
32.2%
18.1%
27.9%*
Completely automatic tuning of parameters IS POSSIBLE!
Feedback and Doc/Query Generation
Classic Prob. Model O( R  1| Q, D) 
P( D | Q, R  1)
P( D | Q, R  0)
Query likelihood
(“Language Model”) O( R  1| Q, D)  P(Q | D, R  1)
rameter
timation
(q1,d1,1)
(q1,d2,1)
(q1,d3,1)
(q1,d4,0)
(q1,d5,0)
(q3,d1,1)
(q4,d1,1)
(q5,d1,1)
(q6,d2,1)
(q6,d3,0)
P(D|Q,R=1)
P(D|Q,R=0)
P(Q|D,R=1)
Rel. doc model
NonRel. doc model
“Rel. query” model
Initial retrieval:
- query as rel doc vs. doc as rel query
- P(Q|D,R=1) is more accurate
Feedback:
- P(D|Q,R=1) can be improved for the
current query and future doc
- P(Q|D,R=1) can also be improved, but
for current doc and future query
Query-based feedback
Doc-based feedback
Difficulty in Feedback with Query Likelihood
•
Traditional query expansion [Ponte 98, Miller et al. 99, Ng 99]
– Improvement is reported, but there is a conceptual inconsistency
•
– What’s an expanded query, a piece of text or a set of terms?
Avoid expansion
– Query term reweighting [Hiemstra 01, Hiemstra 02]
– Translation models [Berger & Lafferty 99, Jin et al. 02]
•
•
•
– Only achieving limited feedback
Doing relevant query expansion instead [Nallapati et al 03]
The difficulty is due to the lack of a query/relevance model
The difficulty can be overcome with alternative ways of using
LMs for retrieval (e.g., relevance model [Lavrenko & Croft 01] ,
Query model estimation [Lafferty & Zhai 01b; Zhai & Lafferty 01b])
© ChengXiang Zhai, 2007
30
Two Alternative Ways of Using LMs
•
Classic Probabilistic Model :Doc-Generation as
opposed to Query-generation
O( R  1| Q, D) 
P( D | Q, R  1) P( D | Q, R  1)

P( D | Q, R  0)
P( D)
– Natural for relevance feedback
– Challenge: Estimate p(D|Q,R=1) without relevance feedback;
relevance model [Lavrenko & Croft 01] provides a good solution
•
Probabilistic Distance Model :Similar to the vectorspace model, but with LMs as opposed to TF-IDF
weight vectors
– A popular distance function: Kullback-Leibler (KL) divergence,
covering query likelihood as a special case
score(Q, D)   D(Q ||  D ), essentially  p( w | Q ) log p( w |  D )
wV
– Retrieval is now to estimate query & doc models and feedback
is treated as query LM updating [Lafferty & Zhai 01b; Zhai &
Lafferty 01b]
Both methods outperform the basic LM significantly
Query Model Estimation
[Lafferty & Zhai 01b, Zhai & Lafferty 01b]
•
•
Question: How to estimate a better query model than
the ML estimate based on the original query?
“Massive feedback”: Improve a query model through
co-occurrence pattern learned from
– A document-term Markov chain that outputs the query [Lafferty
& Zhai 01b]
– Thesauri, corpus [Bai et al. 05,Collins-Thompson & Callan 05]
•
Model-based feedback: Improve the estimate of query
model by exploiting pseudo-relevance feedback
– Update the query model by interpolating the original query
model with a learned feedback model [ Zhai & Lafferty 01b]
– Estimate a more integrated mixture model using pseudofeedback documents [ Tao & Zhai 06]
Feedback as Model Interpolation
[Zhai & Lafferty 01b]
Document D
D
D ( Q ||  D )
Query Q
Q
 Q '  (1   ) Q   F
=0
Q '  Q
No feedback
Results
=1
Q '   F
Full feedback
F
Feedback Docs
F={d1, d2 , …, dn}
Generative model
Divergence minimization
F Estimation Method I:
Generative Mixture Model
Background words

P(w| C)
w
F={D1, …, Dn}
P(source)
Topic words
1-
log p ( F |  ) 
P(w|  )
w
  c( w; D) log((1   ) p( w | )   p( w | C ))
DF wD
Maximum Likelihood
 F  arg max log p ( F |  )

The learned topic model is called a “parsimonious language model” in [Hiemstra et al. 04]
F Estimation Method II:
Empirical Divergence Minimization
Background model
C
C
d
close

far ()
F={D1, …, Dn}
d
Empirical divergence D ( , F , C ) 
Divergence minimization
D1
1
n
Dn
n
1
|F |
 D ( || 
i 1
Dj
)   D ( ||  C ))
 F  arg min D ( , F , C )

Example of Feedback Query Model
Trec topic 412: “airport security”
=0.9
W
security
airport
beverage
alcohol
bomb
terrorist
author
license
bond
counter-terror
terror
newsnet
attack
operation
headline
p(W|  F )
0.0558
0.0546
0.0488
0.0474
0.0236
0.0217
0.0206
0.0188
0.0186
0.0173
0.0142
0.0129
0.0124
0.0121
0.0121
Mixture model approach
Web database
Top 10 docs
=0.7
W
the
security
airport
beverage
alcohol
to
of
and
author
bomb
terrorist
in
license
state
by
p(W|  F )
0.0405
0.0377
0.0342
0.0305
0.0304
0.0268
0.0241
0.0214
0.0156
0.0150
0.0137
0.0135
0.0127
0.0127
0.0125
Model-based feedback
Improves over Simple LM [Zhai & Lafferty 01b]
collection
AvgPr
0.21
0.296
InitPr
0.617
0.591
3067/4805
3888/4805
AvgPr
0.256
0.282
InitPr
0.729
0.707
2853/4728
3160/4728
AvgPr
0.281
0.306
InitPr
0.742
0.732
Recall
1755/2279
1758/2279
AP88-89 Recall
TREC8 Recall
WEB
Simple LM Mixture
Improv.
pos +41%
pos -4%
pos +27%
pos +10%
pos -3%
pos +11%
pos +9%
pos -1%
pos +0%
Div.Min.
Improv.
0.295
pos +40%
0.617
pos +0%
3665/4805 pos +19%
0.269
pos +5%
0.705
pos -3%
3129/4728 pos +10%
0.312
pos +11%
0.728
pos -2%
1798/2279 pos +2%
What You Should Know
•
•
•
•
•
•
Derivation of query likelihood retrieval model using
query generation (what are the assumptions made?)
Dirichlet prior and Jelinek Mercer smoothing methods
Connection between query likelihood and TF-IDF
weighting + doc length normalization
The basic idea of two-stage smoothing
KL-divergence retrieval model
Basic idea of feedback methods (mixture model)

A Risk Minimization Framework for Information Retrieval

Transcript A Risk Minimization Framework for Information Retrieval

Directory