ACM SIGIR 2005 Tutorial - University of Illinois at Urbana

Download Report

Transcript ACM SIGIR 2005 Tutorial - University of Illinois at Urbana

Lectures 2 & 3:
Statistical Language Models
for Information Retrieval
ChengXiang Zhai (翟成祥)
Department of Computer Science
Graduate School of Library & Information Science
Institute for Genomic Biology, Statistics
University of Illinois, Urbana-Champaign
http://www-faculty.cs.uiuc.edu/~czhai, [email protected]
China-US-France Summer School, Lotus Hill Inst. 2008
Query Generation
P (Q , D | R  1)
P (Q , D | R  0 )
P (Q | D, R  1) P ( D | R  1)

P (Q | D , R  0 ) P ( D | R  0 )
P ( D | R  1)
 P (Q | D, R  1)
( Assume P (Q | D, R  0)  P (Q | R  0))
P ( D | R  0)
O( R  1 | Q , D ) 
Query likelihood p(q| d)
Document prior
Assuming uniform prior, we have O( R  1 | Q, D)  P (Q | D, R  1)
Now, the question is how to compute P(Q | D, R  1)
?
Generally involves two steps:
(1) estimate a language model based on D
(2) compute the query likelihood according to the estimated model
Leading to the so-called “Language Modeling Approach” …
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
2
Outline
1. Overview
2. The Basic Language Modeling Approach
3. More Advanced Language Models
4. Language Models for Special Retrieval Tasks
5. Summary
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
3
What is a Statistical LM?
• A probability distribution over word
sequences
– p(“Today is Wednesday”)  0.001
– p(“Today Wednesday is”)  0.0000000000001
– p(“The eigenvalue is positive”)  0.00001
• Context/topic dependent!
• Can also be regarded as a probabilistic
mechanism for “generating” text, thus also
called a “generative” model
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
4
Why is a LM Useful?
• Provides a principled way to quantify the
uncertainties associated with natural
language
• Allows us to answer questions like:
– Given that we see “John” and “feels”, how likely will we see
“happy” as opposed to “habit” as the next word?
(speech recognition)
– Given that we observe “baseball” three times and “game”
once in a news article, how likely is it about “sports”?
(text categorization, information retrieval)
– Given that a user is interested in sports news, how likely
would the user use “baseball” in a query?
(information retrieval)
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
5
Source-Channel Framework
(Model of Communication System [Shannon 48] )
Source
X
Noisy
Channel
Transmitter
(encoder)
P(X)
Y
Receiver
(decoder)
Destination
X’
P(X|Y)=?
P(Y|X)
Xˆ  argmax p( X | Y )  argmax p(Y | X ) p( X ) (Bayes Rule)
X
X
When X is text, p(X) is a language model
Many Examples:
Speech recognition:
Machine translation:
OCR Error Correction:
Information Retrieval:
Summarization:
X=Word sequence
X=English sentence
X=Correct word
X=Document
X=Summary
© ChengXiang Zhai, 2008
Y=Speech signal
Y=Chinese sentence
Y= Erroneous word
Y=Query
Y=Document
China-US-France Summer School, Lotus Hill Inst. 2008
6
The Simplest Language Model
(Unigram Model)
• Generate a piece of text by generating each word
independently
• Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)
• Parameters: {p(wi)} p(w )+…+p(w )=1 (N is voc. size)
• Essentially a multinomial distribution over words
• A piece of text can be regarded as a sample
1
N
drawn according to this word distribution
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
7
Text Generation with Unigram LM
(Unigram) Language Model 
p(w| )
Sampling
Document d
…
Topic 1:
Text mining
text 0.2
mining 0.1
assocation 0.01
clustering 0.02
…
food 0.00001
…
Text mining
paper
Given , p(d| ) varies according to d
…
Topic 2:
Health
food 0.25
nutrition 0.1
healthy 0.05
diet 0.02
Food nutrition
paper
…
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
8
Estimation of Unigram LM
(Unigram) Language Model  Estimation
Document
p(w| )=?
…
10/100
5/100
3/100
3/100
1/100
text 10
mining 5
association 3
database 3
algorithm 2
…
query 1
efficient 1
text ?
mining ?
assocation ?
database ?
…
query ?
…
Total #words
=100
How good is the estimated model ?
It gives our document sample the highest prob,
but it doesn’t generalize well… More about this later…
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
9
More Sophisticated LMs
• N-gram language models
– In general, p(w1 w2 ... wn)=p(w1)p(w2|w1)…p(wn|w1 …wn-1)
– n-gram: conditioned only on the past n-1 words
– E.g., bigram: p(w1 ... wn)=p(w1)p(w2|w1) p(w3|w2) …p(wn|wn-1)
• Remote-dependence language models (e.g.,
Maximum Entropy model)
• Structured language models (e.g., probabilistic
context-free grammar)
• Will not be covered in detail in this tutorial. If
interested, read [Jelinek 98, Manning & Schutze 99, Rosenfeld 00]
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
10
Why Just Unigram Models?
• Difficulty in moving toward more complex models
– They involve more parameters, so need more data
to estimate (A doc is an extremely small sample)
– They increase the computational complexity
significantly, both in time and space
• Capturing word order or structure may not add so
much value for “topical inference”
• But, using more sophisticated models can still be
expected to improve performance ...
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
11
Evaluation of SLMs
•
Direct evaluation criterion: How well does the model fit the
data to be modeled?
– Example measures: Data likelihood, perplexity, cross entropy,
Kullback-Leibler divergence (mostly equivalent)
•
Indirect evaluation criterion: Does the model help improve
the performance of the task?
– Specific measure is task dependent
– For retrieval, we look at whether a model helps improve
retrieval accuracy
– We hope more “reasonable” LMs would achieve better
retrieval performance
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
12
Representative LMs for IR
1998
1999
2000
2001
2002
2003
2004
2005 -
Smoothing examined
Bayesian Query likelihood
Zhai & Lafferty 01a
Zaragoza et al. 03.
Ponte & Croft 98
Hiemstra & Kraaij 99; Parameter Theoretical justification URL prior
Time prior
sensitivity Lafferty & Zhai 01a,01b Kraaij et al. 02
Miller et al. 99
Li & Croft 03
Two-stage LMs
Ng 00
Zhai & Lafferty 02
Query likelihood scoring
Basic LM (Query Likelihood)
Improved
Basic LM
Beyond unigram
Song & Croft 99
Translation model
Berger & Lafferty 99
Relevance LM
Parsimonious LM
Rel. Query FB Hiemstra et al. 04
Lavrenko & Croft 01
Nallanati et al 03
Model-based FB
Zhai & Lafferty 01b
Markov-chain query model
Lafferty & Zhai 01b
Query/Rel
Model &
Feedback
Special
IR tasks
Term-specific smoothing
Cluster LM
Cluster smoothing
Hiemstra 02
Kurland & Lee 04 Liu & Croft 04; Tao et al. 06
Title LM
Concept Likelihood Dependency LM Thesauri
Jin et al. 02
Srikanth & Srihari 03
Cao et al. 05
Gao et al. 04
Xu & Croft 99
Dissertations
Ponte 98
Lavrenko et al. 02 Ogilvie & Callan 03
Xu et al. 01
Zhang et al. 02
Zhai et al. 03
Cronen-Townsend et al. 02
Si et al. 02
Hiemstra 01
Berger 01
© ChengXiang Zhai, 2008
Zhai 02
Pesudo Query
Kurland et al. 05
Query expansion
Bai et al. 05
Rebust Est.
Tao & Zhai 06
Shen et al. 05
Tan et al. 06
Kurland & Lee 05
Lavrenko 04
Kraaij 04
Srikanth 04
China-US-France Summer School, Lotus Hill Inst. 2008
Tao 06
Kurland 06
13
Ponte & Croft’s Pioneering Work
[Ponte & Croft 98]
•
Contribution 1:
– A new “query likelihood” scoring method: p(Q|D)
– [Maron and Kuhns 60] had the idea of query likelihood, but didn’t
•
work out how to estimate p(Q|D)
Contribution 2:
– Connecting LMs with text representation and weighting in IR
– [Wong & Yao 89] had the idea of representing text with a
•
multinomial distribution (relative frequency), but didn’t study
the estimation problem
Good performance is reported using the simple query
likelihood method
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
14
Early Work (1998-1999)
•
•
•
•
•
At about the same time as SIGIR 98, in TREC 7, two groups
explored similar ideas independently: BBN [Miller et al., 99] &
Univ. of Twente [Hiemstra & Kraaij 99]
In TREC-8, Ng from MIT motivated the same query
likelihood method in a different way [Ng 99]
All following the simple query likelihood method; methods
differ in the way the model is estimated and the event model
for the query
All show promising empirical results
Main problems:
– Feedback is explored heuristically
– Lack of understanding why the method works….
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
15
Later Work (1999-)
• Attempt
to understand why LMs work [Zhai & Lafferty
01a, Lafferty & Zhai 01a, Ponte 01, Greiff & Morgan 03, Sparck Jones et al. 03,
Lavrenko 04]
• Further
extend/improve the basic LMs [Song & Croft 99,
Berger & Lafferty 99, Jin et al. 02, Nallapati & Allan 02, Hiemstra 02, Zaragoza
et al. 03, Srikanth & Srihari 03, Nallapati et al 03, Li &Croft 03, Gao et al. 04,
Liu & Croft 04, Kurland & Lee 04,Hiemstra et al. 04,Cao et al. 05, Tao et al. 06]
• Explore alternative ways of
using LMs for retrieval
(mostly query/relevance model estimation) [Xu & Croft
99, Lavrenko & Croft 01, Lafferty & Zhai 01a, Zhai & Lafferty 01b, Lavrenko 04,
Kurland et al. 05, Bai et al. 05,Tao & Zhai 06]
• Explore
the use of SLMs for special retrieval tasks
[Xu & Croft 99, Xu et al. 01, Lavrenko et al. 02, Cronen-Townsend et al. 02,
Zhang et al. 02, Ogilvie & Callan 03, Zhai et al. 03, Kurland & Lee 05, Shen et
al. 05, Balog et al. 06, Fang & Zhai 07]
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
16
The Basic Language Modeling
Approach
China-US-France Summer School, Lotus Hill Inst. 2008
The Basic LM Approach
[Ponte & Croft 98]
Document
Language Model
…
Text mining
paper
text ?
mining ?
assocation ?
clustering ?
…
food ?
…
Food nutrition
paper
…
Query =
“data mining algorithms”
?
Which model would most
likely have generated
this query?
food ?
nutrition ?
healthy ?
diet ?
…
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
18
Ranking Docs by Query Likelihood
Doc LM
Query likelihood
d1
 d1
p(q| d1)
d2
 d2
p(q| d2)
q
p(q| dN)
dN
dN
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
19
Modeling Queries: Different Assumptions
•
Multi-Bernoulli: Modeling word presence/absence
– q= (x1, …, x|V|), xi =1 for presence of word wi; xi =0 for absence
|V |
p(q  ( x1 ,..., x|V | ) | d )  p(wi  xi | d ) 
i 1
•
|V |

i 1, xi 1
p(wi  1| d )
– Parameters: {p(wi=1|d), p(wi=0|d)}
|V |

i 1, xi 0
p(wi  0 | d )
p(wi=1|d)+ p(wi=0|d)=1
Multinomial (Unigram LM): Modeling word frequency
– q=q1,…qm , where qj is a query word
|V |
m
p(q  q1...qm | d )  p(q j | d )   p(wi | d )c ( wi ,q )
j 1
i 1
– c(wi,q) is the count of word wi in query q
– Parameters: {p(wi|d)}
p(w1|d)+… p(w|v||d) = 1
[Ponte & Croft 98] uses Multi-Bernoulli; most other work uses multinomial
Multinomial seems to work better [Song & Croft 99, McCallum & Nigam 98,Lavrenko 04]
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
20
Retrieval as LM Estimation
• Document ranking based on query likelihood
m
|V |
i 1
i 1
log p(q | d )   log p(qi | d )   c( wi , q) log p( wi | d )
where, q  q1q2 ...qm
Document language model
• Retrieval problem  Estimation of p(wi|d)
• Smoothing is an important issue, and
distinguishes different approaches
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
21
How to Estimate p(w|d)?
• Simplest solution: Maximum Likelihood Estimator
– P(w|d) = relative frequency of word w in d
– What if a word doesn’t appear in the text? P(w|d)=0
• In general, what probability should we give a word
that has not been observed?
• If we want to assign non-zero probabilities to such
words, we’ll have to discount the probabilities of
observed words
• This is what “smoothing” is about …
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
22
Language Model Smoothing
(Illustration)
P(w)
Max. Likelihood Estimate
pML ( w ) 
count of w
count of all words
Smoothed LM
Word w
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
23
How to Smooth?
•
All smoothing methods try to
– discount the probability of words seen in a document
– re-allocate the extra counts so that unseen words will
have a non-zero count
•
Method 1 Additive smoothing [Chen & Goodman 98]: Add
a constant  to the counts of each word, e.g.,
“add 1”
Counts of w in d
“Add one”, Laplace
c( w, d )  1
p( w | d ) 
| d |  |V |
Vocabulary size
Length of d (total counts)
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
24
Improve Additive Smoothing
• Should all unseen words get equal
probabilities?
• We can use a reference model
unseen words
to discriminate
Discounted ML estimate
if w is seen in d
 pDML ( w | d )
p( w | d )  
 d p( w | REF ) otherwise
Reference language model
1
d 


pDML ( w | d )
w is seen
p ( w | REF )
Normalizer
Prob. Mass for unseen words
w is unseen
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
25
Other Smoothing Methods
•
Method 2 Absolute discounting [Ney et al. 94]: Subtract a
constant  from the counts of each word
# unique words
p (w | d ) 
•
max( c ( w, d )  ,0)  |d |u p ( w| REF )
|d |
Method 3 Linear interpolation [Jelinek-Mercer 80]: “Shrink”
uniformly toward p(w|REF)
c( w, d )
p( w | d )  (1   )
  p( w | REF )
|d |
ML estimate
© ChengXiang Zhai, 2008
parameter
China-US-France Summer School, Lotus Hill Inst. 2008
26
Other Smoothing Methods (cont.)
•
Method 4 Dirichlet Prior/Bayesian [MacKay & Peto 95, Zhai &
Lafferty 01a, Zhai & Lafferty 02]: Assume pseudo counts
p(w|REF)
parameter
c( w, d )   p( w | REF )
| d | c( w, d )

p (w | d ) 


p( w | REF )
| d | 
| d |  | d |
| d | 
•
Method 5 Good Turing [Good 53]: Assume total # unseen
events to be n1 (# of singletons), and adjust the seen
events in the same way
p (w | d ) 
c*( w, d )
|d |
; c *( w, d ) 
n
2* n2
c( w, d )  1
nc ( w,d ) 1;0*  1 ,1* 
,.....
nc ( w,d )
n0
n1
nr  the number of words with count r
What if nc ( w,d )  0? What about p  w | REF  ?
© ChengXiang Zhai, 2008
Heuristics needed
China-US-France Summer School, Lotus Hill Inst. 2008
27
So, which method is the best?
It depends on the data and the task!
Cross validation is generally used to choose the best
method and/or set the smoothing parameters…
For retrieval, Dirichlet prior performs well…
Backoff smoothing [Katz 87] doesn’t work well due to a lack
of 2nd-stage smoothing…
Note that many other smoothing methods exist
See [Chen & Goodman 98] and other publications in speech recognition…
China-US-France Summer School, Lotus Hill Inst. 2008
Comparison of Three Methods
[Zhai & Lafferty 01a]
Query T yp e
Title
Long
Jelinek- M ercer
0.228
0 .2 78
D irichlet
0 .2 56
0.276
A b s. D isco unt ing
0.237
0.260
Relative performance of JM, Dir. and AD
precision
0.3
TitleQuery
0.2
LongQuery
0.1
0
JM
DIR
AD
Method
Comparison is performed on a variety of test collections
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
29
Understanding Smoothing
Retrieval formula using
the general smoothing
scheme
The general smoothing scheme
Discounted ML estimate
if w is seen in d
 pDML ( w | d )
p( w | d )  
 d p( w | REF ) otherwise
log p(q | d )   c( w, q) log p( w | d )
Reference language model
wV


c( w, q) log pDML ( w | d ) 

c( w, q) log pDML ( w | d )   c( w, q) log  d p( w | REF ) 

c( w, q) log
wV ,c ( w, d )  0

wV ,c ( w, d )  0

wV ,c ( w, d )  0
c ( w, q )  0

wV ,c ( w, d ) 0
c( w, q) log  d p( w | REF )
wV

wV ,c ( w, d ) 0
c( w, q) log  d p( w | REF )
pDML ( w | d )
 | q | log  d   c( w, q) p( w | REF )
 d p( w | REF )
wV
The key rewriting step
Similar rewritings are very common when using LMs for IR…
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
30
Smoothing & TF-IDF Weighting
[Zhai & Lafferty 01a]
•
Plug in the general smoothing scheme to the query
likelihood retrieval formula, we obtain
TF weighting
Doc length normalization
(long doc is expected to have a smaller d)
pDML ( w | d )
log p(q | d ) 
c( w, q) log
 | q | log  d   c( w, q) p( w | REF )

 d p( w | REF )
wV ,c ( w, d )  0
wV
c ( w, q )  0
Words in both query
and doc
•
•
IDF-like weighting
Ignore for ranking
Smoothing with p(w|C)  TF-IDF + length norm.
Smoothing implements traditional retrieval heuristics
LMs with simple smoothing can be computed as
efficiently as traditional retrieval models
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
31
The Dual-Role of Smoothing [Zhai & Lafferty 02]
long
Verbose
queries
Keyword
queries
long
short
short
Why does query type affect smoothing sensitivity?
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
32
Another Reason for Smoothing
Content words
Query = “the
pDML(w|d1):
0.04
pDML(w|d2):
0.02
algorithms
0.001
0.001
for
0.02
0.01
p( “algorithms”|d1) = p(“algorithm”|d2)
p( “data”|d1) < p(“data”|d2)
p( “mining”|d1) < p(“mining”|d2)
data
0.002
0.003
mining”
0.003
0.004
Intuitively, d2 should
have a higher score,
but p(q|d1)>p(q|d2)…
So we should make p(“the”) and p(“for”) less different for all docs,
and smoothing helps achieve this goal…
After smoothingwith p(w | d )  0.1pDML (w | d )  0.9 p(w | REF), p(q | d1)  p(q | d 2)!
Query
P(w|REF)
Smoothed p(w|d1):
Smoothed p(w|d2):
= “the
0.2
0.184
0.182
algorithms
for
0.00001
0.000109
0.000109
0.2
0.182
0.181
© ChengXiang Zhai, 2008
data
mining”
0.00001
0.000209
0.000309
0.00001
0.000309
0.000409
China-US-France Summer School, Lotus Hill Inst. 2008
33
Two-stage Smoothing [Zhai & Lafferty 02]
Stage-1
Stage-2
-Explain unseen words
-Explain noise in query
-Dirichlet prior(Bayesian) -2-component mixture


Collection LM
P(w|d) = (1-)
c(w,d) +p(w|C)
|d|
+ p(w|U)
+
User background model
Can be approximated by p(w|C)
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
34
Estimating  using leave-one-out [Zhai & Lafferty 02]
w1
P(w1|d- w1)
log-likelihood
N
l1 (  | C )   c( w, di ) log(
w2
i 1 wV
P(w2|d- w2)
Leave-one-out
c( w, di )  1  p( w | C )
)
| di | 1  
Maximum Likelihood Estimator
...
μˆ  argmax l 1 (μ | C)
μ
wn
Newton’s Method
P(wn|d- wn)
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
35
Why would “leave-one-out” work?
20 word by author1
abc abc ab c d d
abc cd d d
abd ab ab ab ab
cd d e cd e
20 word by author2
abc abc ab c d d
abe cb e f
acf fb ef aff abef
cdc db ge f s
Suppose we keep sampling and get 10
more words. Which author is likely to
“write” more new words?
Now, suppose we leave “e” out…
 doesn’t have to be big
20 1


p (" e " | REF )
20   19 20  
20 0

psmooth (" e " | author 2) 

p (" e " | REF )
20   19 20  
1
19
0
pml (" e " | author 2)
19
pml (" e " | author1) 
psmooth (" e " | author1) 
 must be big! more smoothing
The amount of smoothing is closely related to
the underlying vocabulary size
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
36
Estimating  using Mixture Model
[Zhai & Lafferty 02]
Stage-2
Stage-1
d1

P(w|d1)

(1-)p(w|d1)+ p(w|U)
...
… ...
dN

N

P(w|dN)
1
Query
Q=q1…qm
(1-)p(w|dN)+ p(w|U)
Estimated in stage-1
p ( q j | di ) 
c(q j , di )  ˆ p(q j | C )
| di |  ˆ
Maximum Likelihood Estimator
Expectation-Maximization (EM) algorithm
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
37
Automatic 2-stage results
 Optimal 1-stage results [Zhai & Lafferty 02]
Average precision (3 DB’s + 4 query types, 150 topics)
* Indicates significant difference
Collection
AP88-89
WSJ87-92
ZIFF1-2
query
SK
LK
SV
LV
SK
LK
SV
LV
SK
LK
SV
LV
Optimal-JM
20.3%
36.8%
18.8%
28.8%
19.4%
34.8%
17.2%
27.7%
17.9%
32.6%
15.6%
26.7%
Optimal-Dir
23.0%
37.6%
20.9%
29.8%
22.3%
35.3%
19.6%
28.2%
21.5%
32.6%
18.5%
27.9%
Auto-2stage
22.2%*
37.4%
20.4%
29.2%
21.8%*
35.8%
19.9%
28.8%*
20.0%
32.2%
18.1%
27.9%*
Completely automatic tuning of parameters IS POSSIBLE!
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
38
The Notion of Relevance
Relevance
(Rep(q), Rep(d))
Similarity
Different
rep & similarity
P(d q) or P(q d)
Probabilistic inference
P(r=1|q,d) r {0,1}
Probability of Relevance
Regression
Model
(Fox 83)
…
Vector space
Prob. distr.
model
model
(Salton et al., 75) (Wong & Yao, 89)
Generative
Model
Doc
generation
Different
inference system
Query
generation
Prob. concept
space model
(Wong & Yao, 95)
Classical
Basic LM
prob. Model
approach
(Robertson &
(Ponte & Croft, 98)
Sparck Jones, 76)
Later, LMs are used along these lines too
© ChengXiang Zhai, 2008
Inference
network
model
(Turtle & Croft, 91)
Initially, LMs are applied
to IR in this way
China-US-France Summer School, Lotus Hill Inst. 2008
39
Interpretation of Query Likelhiood
[Lafferty & Zhai 01a]
P (Q , D | R  1)
P (Q , D | R  0 )
P (Q | D, R  1) P ( D | R  1)

P (Q | D , R  0 ) P ( D | R  0 )
P ( D | R  1)
 P (Q | D, R  1)
( Assume P (Q | D, R  0)  P (Q | R  0))
P ( D | R  0)
O( R  1 | Q , D ) 
Query likelihood p(q| d)
Document prior
Assuming uniform prior, we have
O( R  1 | Q, D)  P (Q | D, R  1)
Computing P(Q|D, R=1) generally involves two steps:
(1) estimate a language model based on D
(2) compute the query likelihood according to the estimated model
P(Q|D)=P(Q|D, R=1)! Prob. that a user who likes D would pose query Q
Relevance-based interpretation of the so-called
“document language model”
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
40
Variants of the Basic LM Approach
•
Different smoothing strategies
– Hidden Markov Models (essentially linear interpolation)
[Miller et al. 99]
•
•
– Smoothing with an IDF-like reference model [Hiemstra & Kraaij 99]
– Performance tends to be similar to the basic LM approach
– Many other possibilities for smoothing [Chen & Goodman 98]
Different priors
– Link information as prior leads to significant improvement of
Web entry page retrieval performance [Kraaij et al. 02]
– Time as prior [Li & Croft 03]
– PageRank as prior [Kurland & Lee 05]
Passage retrieval [Liu & Croft 02]
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
41
More Advanced Language
Models
China-US-France Summer School, Lotus Hill Inst. 2008
•
Improving the Basic LM Approach
Capturing limited dependencies
– Bigrams/Trigrams [Song & Croft 99]; Grammatical dependency [Nallapati
& Allan 02, Srikanth & Srihari 03, Gao et al. 04]
•
•
•
•
– Generally insignificant improvement as compared with other
extensions such as feedback
Full Bayesian query likelihood
[Zaragoza et al. 03]
– Performance similar to the basic LM approach
Translation model for p(Q|D,R) [Berger & Lafferty 99, Jin et al. 02,Cao et al.
05]
– Address polesemy and synonyms; improves over the basic LM
methods, but computationally expensive
Cluster-based smoothing/scoring [Liu & Croft 04, Kurland & Lee 04,Tao et
al. 06]
– Improves over the basic LM, but computationally expensive
Parsimonious LMs [Hiemstra et al. 04]:
– Using a mixture model to “factor out” non-discriminative words
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
43
Translation Models
• Directly modeling the “translation” relationship
between words in the query and words in a doc
• When relevance judgments are available, (q,d)
serves as data to train the translation model
•
Without relevance judgments, we can use
synthetic data [Berger & Lafferty 99], <title, body>[Jin et al.
02] , or thesauri [Cao et al. 05]
Basic translation model p(Q | D, R) 
m
  p ( q | w ) p( w
i 1 w j V
t
i
Translation model
© ChengXiang Zhai, 2008
j
j
| D)
Regular doc LM
China-US-France Summer School, Lotus Hill Inst. 2008
44
Cluster-based Smoothing/Scoring
•
•
•
Cluster-based smoothing: Smooth a document LM with a
cluster of similar documents [Liu & Croft 04]: improves over
the basic LM, but insignificantly
Document expansion smoothing: Smooth a document LM
with the neighboring documents (essentially one cluster
per document) [Tao et al. 06] : improves over the basic LM
more significantly
Cluster-based query likelihood: Similar to the translation
model, but “translate” the whole document to the query
through a set of clusters [Kurland & Lee 04]
p(Q | D, R) 

p(Q | C) p(C | D)
CClusters
How likely doc D
belongs to cluster C
Likelihood of Q
given C
Only effective when interpolated with the basic LM scores
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
45
Feedback in Language Models
China-US-France Summer School, Lotus Hill Inst. 2008
Overview of Feedback Techniques
•
Feedback as machine learning: many possibilities
– Standard ML: Given examples of relevant (and non-relevant) documents, learn
how to classify a new document as either “relevant” or “non-relevant”.
– “Modified” ML: Given a query and examples of relevant (and non-relevant)
documents, learn how to rank new documents based on relevance
– Challenges:
• Sparse data
• Censored sample
• How to deal with query?
•
– Modeling noise in pseudo feedback (as semi-supervised learning)
Feedback as query expansion: traditional IR
– Step 1: Term selection
– Step 2: Query expansion
•
– Step 3: Query term re-weighting
Traditional IR is still robust (Rocchio), but ML approaches can potentially be
more accurate
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
47
Feedback and Doc/Query Generation
Classic Prob. Model O( R  1| Q, D) 
P( D | Q, R  1)
P( D | Q, R  0)
Query likelihood
(“Language Model”) O( R  1| Q, D)  P(Q | D, R  1)
(q1,d1,1)
(q1,d2,1)
(q1,d3,1)
(q1,d4,0)
Parameter (q1,d5,0)
Estimation
(q3,d1,1)
(q4,d1,1)
(q5,d1,1)
(q6,d2,1)
(q6,d3,0)
P(D|Q,R=1)
P(D|Q,R=0)
P(Q|D,R=1)
Rel. doc model
NonRel. doc model
“Rel. query” model
Initial retrieval:
- query as rel doc vs. doc as rel query
- P(Q|D,R=1) is more accurate
Feedback:
- P(D|Q,R=1) can be improved for the
current query and future doc
- P(Q|D,R=1) can also be improved, but
for current doc and future query
Query-based feedback
© ChengXiang Zhai, 2008
Doc-based feedback
China-US-France Summer School, Lotus Hill Inst. 2008
48
Difficulty in Feedback with Query Likelihood
•
Traditional query expansion [Ponte 98, Miller et al. 99, Ng 99]
– Improvement is reported, but there is a conceptual inconsistency
•
– What’s an expanded query, a piece of text or a set of terms?
Avoid expansion
– Query term reweighting [Hiemstra 01, Hiemstra 02]
– Translation models [Berger & Lafferty 99, Jin et al. 02]
•
•
•
– Only achieving limited feedback
Doing relevant query expansion instead [Nallapati et al 03]
The difficulty is due to the lack of a query/relevance model
The difficulty can be overcome with alternative ways of using
LMs for retrieval (e.g., relevance model [Lavrenko & Croft 01] ,
Query model estimation [Lafferty & Zhai 01b; Zhai & Lafferty 01b])
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
49
Two Alternative Ways of Using LMs
• Classic Probabilistic Model :Doc-Generation as opposed to
Query-generation
O( R  1| Q, D) 
P( D | Q, R  1) P( D | Q, R  1)

P( D | Q, R  0)
P ( D)
– Natural for relevance feedback
•
– Challenge: Estimate p(D|Q,R=1) without relevance feedback;
relevance model [Lavrenko & Croft 01] provides a good solution
Probabilistic Distance Model :Similar to the vector-space
model, but with LMs as opposed to TF-IDF weight vectors
– A popular distance function: Kullback-Leibler (KL) divergence,
covering query likelihood as a special case
score(Q, D)   D(Q ||  D ), essentially  p( w | Q ) log p( w |  D )
wV
– Retrieval is now to estimate query & doc models and feedback
is treated as query LM updating [Lafferty & Zhai 01b; Zhai &
Lafferty 01b]
Both methods outperform the basic LM significantly
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
50
Relevance Model Estimation
[Lavrenko & Croft 01]
•
•
Question: How to estimate P(D|Q,R) (or p(w|Q,R)) without
relevant documents?
Key idea:
– Treat query as observations about p(w|Q,R)
•
– Approximate the model space with document models
Two methods for decomposing p(w,Q)
– Independent sampling (Bayesian model averaging)
p( w | Q, R)   p( w |  ) p( | Q, R)d   p( w |  ) p( | R) p(Q |  )d
D
D
D



DC
D
D
D
D

m
p( w |  D ) p( D | R) p(Q |  D )   p( w |  D ) p(q j | D )
DC
j 1
– Conditional sampling: p(w,Q)=p(w)p(Q|w)
m
p( w | Q, R  1)  p( w) p(Q | w)  p( w)  p(qi | D) p( D | w)
i 1 DC
p( w)   p( w | D) p( D)
DC
p( D | w) 
p ( w | D) p ( D)
p( w)
© ChengXiang Zhai, 2008
Original formula in [Lavranko &Croft 01]
p( D | w) 
p( w | D) p( w)
p ( D)
China-US-France Summer School, Lotus Hill Inst. 2008
51
Kernel-based Allocation [Lavrenko 04]
• A general generative model for text
 n

p( w1...wn )     p( wi |  )  p(d )

  i 1
1
p( d )   K w (d )
N wT
T= training data
An infinite mixture model
Kernel-based density function
Kernel function kw ( )  similarity(w, )
• Choices of the kernel function
– Delta kernel:
1
p( w1...wn ) 
N
n
 p(w | w)
wT i 1
i
Average probability of w1…wn over all training points
– Dirichlet kernel: allow a training point to “spread”
its influence
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
52
Query Model Estimation
[Lafferty & Zhai 01b, Zhai & Lafferty 01b]
•
•
Question: How to estimate a better query model than the
ML estimate based on the original query?
“Massive feedback”: Improve a query model through cooccurrence pattern learned from
– A document-term Markov chain that outputs the query [Lafferty
& Zhai 01b]
•
– Thesauri, corpus [Bai et al. 05,Collins-Thompson & Callan 05]
Model-based feedback: Improve the estimate of query
model by exploiting pseudo-relevance feedback
– Update the query model by interpolating the original query
model with a learned feedback model [ Zhai & Lafferty 01b]
– Estimate a more integrated mixture model using pseudofeedback documents [ Tao & Zhai 06]
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
53
Feedback as Model Interpolation
[Zhai & Lafferty 01b]
Document D
D
D( Q ||  D )
Query Q
Q
 Q '  (1   ) Q  F
=0
Q '  Q
No feedback
Results
=1
Q '   F
Full feedback
© ChengXiang Zhai, 2008
F
Feedback Docs
F={d1, d2 , …, dn}
Generative model
Divergence minimization
China-US-France Summer School, Lotus Hill Inst. 2008
54
F Estimation Method I:
Generative Mixture Model
Background words

P(w| C)
w
F={D1, …, Dn}
P(source)
Topic words
1-
log p( F |  ) 
P(w|  )
w
  c(w; D) log((1   ) p(w | )   p(w | C ))
DF wD
Maximum Likelihood
 F  arg max log p(F |  )

The learned topic model is called a “parsimonious language model” in [Hiemstra et al. 04]
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
55
F Estimation Method II:
Empirical Divergence Minimization
Background model
C
C
d
close

far ()
F={D1, …, Dn}
d
Empirical divergence D ( , F , C ) 
Divergence minimization
D1
1
n
Dn
n
1
|F |
 D( || 
i 1
Dj
)   D( || C ))
 F  arg min D ( , F ,C )
© ChengXiang Zhai, 2008

China-US-France Summer School, Lotus Hill Inst. 2008
56
Example of Feedback Query Model
Trec topic 412: “airport security”
=0.9
W
security
airport
beverage
alcohol
bomb
terrorist
author
license
bond
counter-terror
terror
newsnet
attack
operation
headline
Mixture model approach
p(W|  F )
0.0558
0.0546
0.0488
0.0474
0.0236
0.0217
0.0206
0.0188
0.0186
0.0173
0.0142
0.0129
0.0124
0.0121
0.0121
Web database
Top 10 docs
© ChengXiang Zhai, 2008
=0.7
W
the
security
airport
beverage
alcohol
to
of
and
author
bomb
terrorist
in
license
state
by
p(W|  F )
0.0405
0.0377
0.0342
0.0305
0.0304
0.0268
0.0241
0.0214
0.0156
0.0150
0.0137
0.0135
0.0127
0.0127
0.0125
China-US-France Summer School, Lotus Hill Inst. 2008
57
Model-based feedback
Improves over Simple LM [Zhai & Lafferty 01b]
collection
AvgPr
0.21
0.296
InitPr
0.617
0.591
3067/4805
3888/4805
AvgPr
0.256
0.282
InitPr
0.729
0.707
2853/4728
3160/4728
AvgPr
0.281
0.306
InitPr
0.742
0.732
Recall
1755/2279
1758/2279
AP88-89 Recall
TREC8 Recall
WEB
Simple LM Mixture
Improv.
pos +41%
pos -4%
pos +27%
pos +10%
pos -3%
pos +11%
pos +9%
pos -1%
pos +0%
Div.Min.
Improv.
0.295
pos +40%
0.617
pos +0%
3665/4805 pos +19%
0.269
pos +5%
0.705
pos -3%
3129/4728 pos +10%
0.312
pos +11%
0.728
pos -2%
1798/2279 pos +2%
Translation models, Relevance models, and Feedback-based query
models have all been shown to improve performance significantly over
the simple LMs (Parameter tuning is necessary in many cases, but see
[Tao & Zhai 06] for “parameter-free” pseudo feedback)
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
58
Some Further Improvement [Tao & Zhai 06]
• Document-specific mixing coefficient (model nonrelevant content)
• Use query as a prior
• Regularized EM
– Start with a strong prior
– Gradually reduce the strength on the prior to
achieve feedback effect
• Increase the robustness of the model
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
59
LMs for Special Retrieval Tasks
China-US-France Summer School, Lotus Hill Inst. 2008
Cross-Lingual IR
• Use query in language A (e.g., English) to retrieve
documents in language B (e.g., Chinese)
• Cross-lingual p(Q|D,R)
[Xu et al 01]
English
m
p(Q | D, R)  [ p(qi | REF )  (1   )
i 1
English

cVChinese
Chinese word
p(c | D) ptrans (qi | c)]
Translation
model
Chinese
• Cross-lingual p(D|Q,R) [Lavrenko et al 02]
p(c, q1...qm )
p (c | Q, R ) 
p(q1...qm )
Method 1:
Method 2:
p(c, q1...qm ) 
p(c, q1...qm ) 

( M E , M C )M

M C M
Estimate with a
bilingual lexicon
Or
Parallel corpora
Estimate with parallel corpora
m
p( M E , M C ) p(c | M c ) p(qi | M E )
i 1
m
p( M C ) p(c | M c ) p(qi | M C )
i 1
© ChengXiang Zhai, 2008
p(qi | M C ) 

cVChinese
ptrans (qi | c) p (c | M C )
China-US-France Summer School, Lotus Hill Inst. 2008
61
Distributed IR
•
•
•
Retrieve documents from multiple collections
The task is generally decomposed into two subtasks:
Collection selection and result fusion
Using LMs for collection selection [Xu &
Croft 99, Si et al. 02]
– Treat collection selection as “retrieving collections” as
opposed to “documents”
•
– Estimate each collection model by maximum likelihood
estimate [Si et al. 02] or clustering [Xu & Croft 99]
Using LMs for result fusion [ Si et al. 02]
– Assume query likelihood scoring for all collections, but on
each collection, a distinct reference LM is used for smoothing
– Adjust the bias score p(Q|D,Collection) to recover the fair
score p(Q|D)
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
62
Structured Document Retrieval
[Ogilvie & Callan 03]
D
Title
D1
Abstract
D2
Body-Part1
D3
Body-Part2
-Want to combine different parts of a
document with appropriate weights
-Anchor text can be treated as a “part” of a
document
- Applicable to XML retrieval
Select Dj and generate a
query word using Dj
Q  q1q2 ...qm
m
p (Q | D, R  1)   p (qi | D, R  1)
…
i 1
m
k
   s ( D j | D, R  1) p (qi | D j , R  1)
i 1 j 1
Dk
“part selection” prob. Serves as weight for Dj
Can be trained using EM
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
63
Personalized/Context-Sensitive Search
[Shen et al. 05, Tan et al. 06]
• User information and search context can be used to
estimate a better query model
Context-independent Query LM:
ˆQ  arg max p( | Query, Collection)
Context-sensitive Query LM:
ˆQ  arg max p( | Query, User, SearchContext, Collection)
Refinement of this model leads to specific retrieval formulas
Simple models often end up interpolating many unigram language
models based on different sources of evidence, e.g., short-term search
history [Shen et al. 05] or long-term search history [Tan et al. 06]
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
64
Modeling Redundancy
•
•
•
Given two documents D1 and D2, decide how redundant D1
(or D2) is w.r.t. D2 (or D1)
Redundancy of D1  “to what extent can D1 be explained by
a model estimated based on D2”
Use a unigram mixture model [Zhai 02]
log p ( D1 |  ,  D2 )   c( w, D1 ) log[ p ( w |  D2 )  (1   ) p ( w | REF )]
wV
 *  arg max  log p ( D1 |  , D )
2
Maximum Likelihood estimator
EM algorithm
•
•
LM for D2
Reference LM
Measure of
redundancy
See [Zhang et al. 02] for a 3-component redundancy model
Along a similar line, we could measure document similarity
in an asymmetric way [Kurland & Lee 05]
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
65
Predicting Query Difficulty
•
•
[Cronen-Townsend et al. 02]
Observations:
– Discriminative queries tend to be easier
– Comparison of the query model and the collection model can
indicate how discriminative a query is
Method:
– Define “query clarity” as the KL-divergence between an
estimated query model or relevance model and the collection
LM
p( w |  )
clarity (Q)   p( w | Q ) log
w
•
Q
p( w | Collection)
– An enriched query LM can be estimated by exploiting pseudo
feedback (e.g., relevance model)
Correlation between the clarity scores and retrieval
performance is found
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
66
Expert Finding
[Balog et al. 06, Fang & Zhai 07]
•
•
Task: Given a topic T, a list of candidates {Ci} , and a collection of
support documents S={Di}, rank the candidates according to the
likelihood that a candidate C is an expert on T.
Retrieval analogy:
– Query = topic T
– Document = Candidate C
– Rank according to P(R=1|T,C)
– Similar derivations to those on slides 55-56, 64 can be made
•
Candidate generation model:
Rank
O( R  1 | T , C ) 
•
 p(C | D, R  1)  p( D | T , R  1)
DS
Topic generation model:
Rank
O( R  1 | T , C ) 
 p(T | D, R  1) 
DS
p(C | D, R  1)
p(C | R  1)

 p(C | D' , R  1) p(C | R  0)
D 'S
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
67
Summary
China-US-France Summer School, Lotus Hill Inst. 2008
SLMs vs. Traditional IR
•
Pros:
– Statistical foundations (better parameter setting)
– More principled way of handling term weighting
– More powerful for modeling subtopics, passages,..
– Leverage LMs developed in related areas
•
– Empirically as effective as well-tuned traditional models with
potential for automatic parameter tuning
Cons:
– Lack of discrimination (a common problem with generative models)
– Less robust in some cases (e.g., when queries are semi-structured)
– Computationally complex
– Empirically, performance appears to be inferior to well-tuned fullfledged traditional methods (at least, no evidence for beating them)
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
69
What We Have Achieved So Far
•
•
Framework and justification for using LMs for IR
Several effective models are developed
–
–
–
–
•
•
Basic LM with Dirichlet prior smoothing is a reasonable baseline
Basic LM with informative priors often improves performance
Translation model handles polysemy & synonyms
Relevance model incorporates LMs into the classic probabilistic
IR model
– KL-divergence model ties feedback with query model estimation
– Mixture models can model redundancy and subtopics
Completely automatic tuning of parameters is possible
LMs can be applied to virtually any retrieval task with great
potential for modeling complex IR problems
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
70
Challenges and Future Directions
•
Challenge 1: Establish a robust and effective LM that
– Optimizes retrieval parameters automatically
– Performs as well as or better than well-tuned traditional
retrieval methods with pseudo feedback
– Is as efficient as traditional retrieval methods
Can LMs consistently (convincingly) outperform traditional methods
without sacrificing efficiency?
•
Challenge 2: Demonstrate consistent and substantial
improvement by going beyond unigram LMs
– Model limited dependency between terms
– Derive more principled weighting methods for phrases
Can we do much better by going beyond unigram LMs?
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
71
Challenges and Future Directions (cont.)
•
Challenge 3: Develop LMs that can support “life-time
learning”
– Develop LMs that can improve accuracy for a current query
through learning from past relevance judgments
– Support collaborative information retrieval
How can we learn effectively from past relevance judgments?
•
Challenge 4: Develop LMs that can model document
structures and subtopics
– Recognize query-specific boundaries of relevant passages
– Passage-based/subtopic-based feedback
– Combine different structural components of a document
How can we break the document unit in a principled way?
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
72
Challenges and Future Directions (cont.)
•
Challenge 5: Develop LMs to support personalized search
– Infer and track a user’s interests with LMs
– Incorporate user’s preferences and search context in retrieval
– Customize/organize search results according to user’s
interests
How can we exploit user information and search context to improve search?
•
Challenge 6: Generalize LMs to handle relational data
– Develop LMs for semi-structured data (e.g., XML)
– Develop LMs to handle structured queries
– Develop LMs for keyword search in relational databases
What role can LMs play when combining text with relational data?
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
73
Challenges and Future Directions (cont.)
•
Challenge 7: Develop LMs for hypertext retrieval
– Combine LMs with link information
– Modeling and exploiting anchor text
– Develop a unified LM for hypertext search
How can we develop an effective unified retrieval model for Web search?
•
Challenge 8: Develop LMs for retrieval with complex
information needs, e.g.,
– Subtopic retrieval
– Readability constrained retrieval
– Entity retrieval (e.g. expert search)
How can we exploit LMs to develop models for complex retrieval tasks?
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
74
Lectures 2 & 3: Key Points
• Statistical language models represent a new
generation of probabilistic model for retrieval
– Better connect IR with statistics (estimation)
– Better connect search with machine learning
(unsupervised, semi-supervised learning)
– Achieve good empirical performance
– Can model a variety of special retrieval problems
• Performance-wise, they haven’t yet convincingly
outperformed traditional TF-IDF models
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
75
References
[Agichtein & Cucerzan 05] E. Agichtein and S. Cucerzan, Predicting accuracy of extracting information from
unstructured text collections, Proceedings of ACM CIKM 2005. pages 413-420.
[Baeza-Yates & Ribeiro-Neto 99] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley,
1999.
[Bai et al. 05] Jing Bai, Dawei Song, Peter Bruza, Jian-Yun Nie, Guihong Cao, Query expansion using term relationships
in language models for information retrieval, Proceedings of ACM CIKM 2005, pages 688-695.
[Balog et al. 06] K. Balog, L. Azzopardi, M. de Rijke, Formal models for expert finding in enterprise corpora, Proceedings
of ACM SIGIR 2006, pages 43-50.
[Berger & Lafferty 99] A. Berger and J. Lafferty. Information retrieval as statistical translation. Proceedings of the ACM
SIGIR 1999, pages 222-229.
[Berger 01] A. Berger. Statistical machine learning for information retrieval. Ph.D. dissertation, Carnegie Mellon
University, 2001.
[Blei et al. 02] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. In T G Dietterich, S. Becker, and Z. Ghahramani,
editors, Advances in Neural Information Processing Systems 14, Cambridge, MA, 2002. MIT Press.
[Cao et al. 05] Guihong Cao, Jian-Yun Nie, Jing Bai, Integrating word relationships into language models, Proceedings of
ACM SIGIR 2005, Pages: 298 - 305.
[Carbonell and Goldstein 98]J. Carbonell and J. Goldstein, The use of MMR, diversity-based reranking for reordering
documents and producing summaries. In Proceedings of SIGIR'98, pages 335--336.
[Chen & Goodman 98] S. F. Chen and J. T. Goodman. An empirical study of smoothing techniques for language
modeling. Technical Report TR-10-98, Harvard University.
[Collins-Thompson & Callan 05] K. Collins-Thompson and J. Callan, Query expansing using random walk models,
Proceedings of ACM CIKM 2005, pages 704-711.
[Cronen-Townsend et al. 02] Steve Cronen-Townsend, Yun Zhou, and W. Bruce Croft. Predicting query performance. In
Proceedings of the ACM Conference on Research in Information Retrieval (SIGIR), 2002.
[Croft & Lafferty 03] W. B. Croft and J. Lafferty (ed), Language Modeling and Information Retrieval. Kluwer Academic
Publishers. 2003.
[Fang et al. 04] H. Fang, T. Tao and C. Zhai, A formal study of information retrieval heuristics, Proceedings of ACM
SIGIR 2004. pages 49-56.
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
76
References (cont.)
[Fang & Zhai 07] H. Fang and C. Zhai, Probabilistic models for expert finding, Proceedings of ECIR 2007.
[Fox 83] E. Fox. Expending the Boolean and Vector Space Models of Information Retrieval with P-Norm Queries and
Multiple Concept Types. PhD thesis, Cornell University. 1983.
[Fuhr 01] N. Fuhr. Language models and uncertain inference in information retrieval. In Proceedings of the Language
Modeling and IR workshop, pages 6--11.
[Gao et al. 04] J. Gao, J. Nie, G. Wu, and G. Cao, Dependence language model for information retrieval, In Proceedings of
ACM SIGIR 2004.
[Good 53] I. J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40(3
and 4):237--264, 1953.
[Greiff & Morgan 03] W. Greiff and W. Morgan, Contributions of Language Modeling to the Theory and Practice of IR, In W.
B. Croft and J. Lafferty (eds), Language Modeling for Information Retrieval, Kluwer Academic Pub. 2003.
[Grossman & Frieder 04] D. Grossman and O. Frieder, Information Retrieval: Algorithms and Heuristics, 2nd Ed, Springer,
2004.
[He & Ounis 05] Ben He and Iadh Ounis, A study of the Dirichlet priors for term frequency normalisation, Proceedings of
ACM SIGIR 2005, Pages 465 - 471
[Hiemstra & Kraaij 99] D. Hiemstra and W. Kraaij, Twenty-One at TREC-7: Ad-hoc and Cross-language track, In
Proceedings of the Seventh Text REtrieval Conference (TREC-7), 1999.
[Hiemstra 01] D. Hiemstra. Using Language Models for Information Retrieval. PhD dissertation, University of Twente,
Enschede, The Netherlands, January 2001.
[Hiemstra 02] D. Hiemstra. Term-specific smoothing for the language modeling approach to information retrieval: the
importance of a query term. In Proceedings of ACM SIGIR 2002, 35-41
[Hiemstra et al. 04] D. Hiemstra, S. Robertson, and H. Zaragoza. Parsimonious language models for information retrieval, In
Proceedings of ACM SIGIR 2004.
[Hofmann 99] T. Hofmann. Probabilistic latent semantic indexing. In Proceedings on the 22nd annual international ACMSIGIR 1999, pages 50-57.
[Jarvelin & Kekalainen 02] Cumulated gain-based evaluation of IR techniques, ACM TOIS, Vol. 20, No. 4, 422-446, 2002.
[Jelinek 98] F. Jelinek, Statistical Methods for Speech Recognition, Cambirdge: MIT Press, 1998.
[Jelinek & Mercer 80] F. Jelinek and R. L. Mercer. Interpolated estimation of markov source parameters from sparse data. In
E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in Practice. 1980. Amsterdam, North-Holland,.
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
77
References (cont.)
[Jeon et al. 03] J. Jeon, V. Lavrenko and R. Manmatha, Automatic Image Annotation and Retrieval using Cross-media
Relevance Models, In Proceedings of ACM SIGIR 2003
[Jin et al. 02] R. Jin, A. Hauptmann, and C. Zhai, Title language models for information retrieval, In Proceedings of ACM SIGIR
2002.
[Kalt 96] T. Kalt. A new probabilistic model of text classication and retrieval. University of Massachusetts Technical report TR9818,1996.
[Katz 87] S. M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer.
IEEE Transactions on Acoustics, Speech and Signal Processing, volume ASSP-35:400--401.
[Kraaij et al. 02] W. Kraaij,T. Westerveld, D. Hiemstra: The Importance of Prior Probabilities for Entry Page Search.
Proceedings of SIGIR 2002, pp. 27-34
[Kraaij 04] W. Kraaij. Variations on Language Modeling for Information Retrieval, Ph.D. thesis, University of Twente, 2004,
[Kurland & Lee 04] O. Kurland and L. Lee. Corpus structure, language models, and ad hoc information retrieval. In
Proceedings of ACM SIGIR 2004.
[Kurland et al. 05] Oren Kurland, Lillian Lee, Carmel Domshlak, Better than the real thing?: iterative pseudo-query processing
using cluster-based language models, Proceedings of ACM SIGIR 2005. pages 19-26.
[Kurland & Lee 05] Oren Kurland and Lillian Lee, PageRank without hyperlinks: structural re-ranking using links induced by
language models, Proceedings of ACM SIGIR 2005. pages 306-313.
[Lafferty and Zhai 01a] J. Lafferty and C. Zhai, Probabilistic IR models based on query and document generation. In
Proceedings of the Language Modeling and IR workshop, pages 1--5.
[Lafferty & Zhai 01b] J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information
retrieval. In Proceedings of the ACM SIGIR 2001, pages 111-119.
[Lavrenko & Croft 01] V. Lavrenko and W. B. Croft. Relevance-based language models. In Proceedings of the ACM SIGIR
2001, pages 120-127.
[Lavrenko et al. 02] V. Lavrenko, M. Choquette, and W. Croft. Cross-lingual relevance models. In Proceedings of SIGIR 2002,
pages 175-182.
[Lavrenko 04] V. Lavrenko, A generative theory of relevance. Ph.D. thesis, University of Massachusetts. 2004.
[Li & Croft 03] X. Li, and W.B. Croft, Time-Based Language Models, In Proceedings of CIKM'03, 2003
[Liu & Croft 02] X. Liu and W. B. Croft. Passage retrieval based on language models. In Proceedings of CIKM 2002, pages 1519.
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
78
References (cont.)
[Liu & Croft 04] X. Liu and W. B. Croft. Cluster-based retrieval using language models. In Proceedings of ACM SIGIR
2004.
[MacKay & Peto 95] D. MacKay and L. Peto. (1995). A hierarchical Dirichlet language model. Natural Language
Engineering, 1(3):289--307.
[Maron & Kuhns 60] M. E. Maron and J. L. Kuhns, On relevance, probabilistic indexing and information retrieval. Journal
of the ACM, 7:216--244.
[McCallum & Nigam 98] A. McCallum and K. Nigam (1998). A comparison of event models for Naïve Bayes text
classification. In AAAI-1998 Learning for Text Categorization Workshop, pages 41--48.
[Miller et al. 99] D. R. H. Miller, T. Leek, and R. M. Schwartz. A hidden Markov model information retrieval system. In
Proceedings of ACM-SIGIR 1999, pages 214-221.
[Minka & Lafferty 03] T. Minka and J. Lafferty, Expectation-propagation for the generative aspect model, In Proceedings
of the UAI 2002, pages 352--359.
[Nallanati & Allan 02] Ramesh Nallapati and James Allan, Capturing term dependencies using a language model based
on sentence trees. In Proceedings of CIKM 2002. 383-390
[Nallanati et al 03] R. Nallanati, W. B. Croft, and J. Allan, Relevant query feedback in statistical language modeling, In
Proceedings of CIKM 2003.
[Ney et al. 94] H. Ney, U. Essen, and R. Kneser. On Structuring Probabilistic Dependencies in Stochastic Language
Modeling. Comput. Speech and Lang., 8(1), 1-28.
[Ng 00]K. Ng. A maximum likelihood ratio information retrieval model. In Voorhees, E. and Harman, D., editors,
Proceedings of the Eighth Text REtrieval Conference (TREC-8), pages 483--492. 2000.
[Ogilvie & Callan 03] P. Ogilvie and J. Callan Combining Document Representations for Known Item Search. In
Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information
Retrieval (SIGIR 2003), pp. 143-150
[Ponte & Croft 98]] J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings
of ACM-SIGIR 1998, pages 275-281.
[Ponte 98] J. M. Ponte. A language modeling approach to information retrieval. Phd dissertation, University of
Massachusets, Amherst, MA, September 1998.
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
79
References (cont.)
[Ponte 01] J. Ponte. Is information retrieval anything more than smoothing? In Proceedings of the Workshop on
Language Modeling and Information Retrieval, pages 37-41, 2001.
[Robertson & Sparch-Jones 76] S. Robertson and K. Sparck Jones. (1976). Relevance Weighting of Search Terms.
JASIS, 27, 129-146.
[Robertson 77] S. E. Robertson. The probability ranking principle in IR. Journal of Documentation, 33:294-304, 1977.
[Robertson & Walker 94] S. E. Robertson and S. Walker, Some simple effective approximations to the 2-Poisson model
for probabilistic weighted retrieval. Proceedings of ACM SIGIR 1994. pages 232-241. 1994.
[Rosenfeld 00] R. Rosenfeld, Two decades of statistical language modeling: where do we go from here? In Proceedings
of IEEE, volume~88.
[Salton et al. 75] G. Salton, A. Wong and C. S. Yang, A vector space model for automatic indexing. Communications of
the ACM, 18(11):613--620.
[Salton & Buckley 88] G. Salton and C. Buckley, Term weighting approaches in automatic text retrieval, Information
Processing and Management, 24(5), 513-523. 1988.
[Shannon 48] Shannon, C. E. (1948).. A mathematical theory of communication. Bell System Tech. J. 27, 379-423, 623656.
[Shen et al. 05] X. Shen, B. Tan, and C. Zhai. Context-sensitive information retrieval with implicit feedback. In
Proceedings of ACM SIGIR 2005.
[Si et al. 02] L. Si , R. Jin, J. Callan and P.l Ogilvie. A Language Model Framework for Resource Selection and Results
Merging. In Proceedings of the 11th International Conference on Information and Knowledge Management (CIKM)
. 2002
[Singhal et al. 96] A. Singhal, C. Buckley, and M. Mitra, Pivoted document length normalization, Proceedings of ACM
SIGIR 1996.
[Singhal 01] A. Singhal, Modern Information Retrieval: A Brief Overview. Amit Singhal. In IEEE Data Engineering Bulletin
24(4), pages 35-43, 2001.
[Song & Croft 99] F. Song and W. B. Croft. A general language model for information retrieval. In Proceedings of Eighth
International Conference on Information and Knowledge Management (CIKM 1999)
[Sparck Jones 72] K. Sparck Jones, A statistical interpretation of term specificity and its application in retrieval. Journal of
Documentation 28, 11-21, 1972 and 60, 493-502, 2004.
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
80
References (cont.)
[Sparck Jones et al. 00] K. Sparck Jones, S. Walker, and S. E. Robertson, A probabilistic model of information retrieval:
development and comparative experiments - part 1 and part 2. Information Processing and Management,
36(6):779--808 and 809--840.
[Sparck Jones et al. 03] K. Sparck Jones, S. Robertson, D. Hiemstra, H. Zaragoza, Language Modeling and
Relevance, In W. B. Croft and J. Lafferty (eds), Language Modeling for Information Retrieval, Kluwer Academic
Pub. 2003.
[Srikanth & Srihari 03] M. Srikanth, R. K. Srihari. Exploiting Syntactic Structure of Queries in a Language Modeling
Approach to IR. in Proceedings of Conference on Information and Knowledge Management(CIKM'03).
[Srikanth 04] M. Srikanth. Exploiting query features in language modeling approach for information retrieval. Ph.D.
dissertation, State University of New York at Buffalo, 2004.
[Tan et al. 06] Bin Tan, Xuehua Shen, and ChengXiang Zhai,, Mining long-term search history to improve search
accuracy, Proceedings of ACM KDD 2006.
[Tao et al. 06] Tao Tao, Xuanhui Wang, Qiaozhu Mei, and ChengXiang Zhai, Language model information retrieval with
document expansion, Proceedings of HLT/NAACL 2006.
[Tao & Zhai 06] Tao Tao and ChengXiang Zhai, Regularized estimation of mixture models for robust pseudo-relevance
feedback. Proceedings of ACM SIGIR 2006.
[Turtle & Croft 91]H. Turtle and W. B. Croft, Evaluation of an inference network-based retrieval model. ACM
Transactions on Information Systems, 9(3):187--222.
[van Rijsbergen 86] C. J. van Rijsbergen. A non-classical logic for information retrieval. The Computer Journal, 29(6).
[Witten et al. 99] I.H. Witten, A. Mo#at, and T.C. Bell. Managing Gigabytes - Compressing and Indexing Documents and
Images. Academic Press, San Diego, 2nd edition, 1999.
[Wong & Yao 89] S. K. M. Wong and Y. Y. Yao, A probability distribution model for information retrieval. Information
Processing and Management, 25(1):39--53.
[Wong & Yao 95] S. K. M. Wong and Y. Y. Yao. On modeling information retrieval with probabilistic inference. ACM
Transactions on Information Systems, 13(1):69--99.
[Xu & Croft 99] J. Xu and W. B. Croft. Cluster-based language models for distributed retrieval. In Proceedings of the
ACM SIGIR 1999, pages 15-19,
[Xu et al. 01] J. Xu, R. Weischedel, and C. Nguyen. Evaluating a probabilistic model for cross-lingual information
retrieval. In Proceedings of the ACM-SIGIR 2001, pages 105-110.
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
81
References (cont.)
[Zaragoza et al. 03] Hugo Zaragoza, D. Hiemstra and M. Tipping, Bayesian extension to the language model for ad hoc
information retrieval. In Proceedings of SIGIR 2003: 4-9.
[Zhai & Lafferty 01a] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc
information retrieval. In Proceedings of the ACM-SIGIR 2001, pages 334-342.
[Zhai & Lafferty 01b] C. Zhai and J. Lafferty. Model-based feedback in the language modeling approach to information
retrieval, In Proceedings of the Tenth International Conference on Information and Knowledge Management (CIKM
2001).
[Zhai & Lafferty 02] C. Zhai and J. Lafferty. Two-stage language models for information retrieval. In Proceedings of the
ACM-SIGIR 2002, pages 49-56.
[Zhai et al. 03] C. Zhai, W. Cohen, and J. Lafferty, Beyond Independent Relevance: Methods and Evaluation Metrics for
Subtopic Retrieval, In Proceedings of ACM SIGIR 2003.
[Zhai & Lafferty 06] C. Zhai and J. Lafferty, A risk minimization framework for information retrieval, Information
Processing and Management, 42(1), Jan. 2006, pages 31-55.
[Zhai 02] C. Zhai, Language Modeling and Risk Minimization in Text Retrieval, Ph.D. thesis, Carnegie Mellon University,
2002.
[Zhai & Lafferty 06] C. Zhai and J. Lafferty, A risk minimization framework for information retrieval, Information
Processing and Management, 42(1), Jan. 2006, pages 31-55.
[Zhang et al. 02] Y. Zhang , J. Callan, and Thomas P. Minka, Novelty and redundancy detection in adaptive filtering. In
Proceedings of SIGIR 2002, 81-88
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
82
Discussion
•
•
Generative models for text vs. generative models for images/video
Query model in multimedia retrieval:
– Independent models for different media vs. joint models
– How to learn such a query model using “multimedia feedback”?
• Learn a text model from image feedback?
•
• Learn an image model from text feedback?
Special retrieval tasks for multimedia
– Entity retrieval
– Video summarization
– Cross-language image search
– …
© ChengXiang Zhai, 2008
China-US-France Summer School, Lotus Hill Inst. 2008
83