Probabilistic Model (SM)

Download Report

Transcript Probabilistic Model (SM)

Probabilistic Information Retrieval
[ ProbIR ]
Suman K Mitra
DAIICT, Gandhinagar
[email protected]
Acknowledgment
Alexander Dekhtyar, University of Maryland
Mandar Mitra, ISI, Kolkata
Prasenjit Majumder, DAIICT, Gandhinagar
Why use Probabilities?
•Information Retrieval deals with uncertain
information
•Probability is a measure of uncertainty
•Probabilistic Ranking Principle
•provable
•minimization of risk
•Probabilistic Inference
•To justify your decision
1. How good the representation is?
2. How exact the representation is?
3. How well is the query matched ?
4. How relevant is the result to the query?
1
Document
Representation
Document
Collection
3
Query
2
Query
Representation
4
Basic IR System
Approaches and main Contributors
Probability Ranking Principle – Robertson 1970
onwards
Information Retrieval as Probabilistic Inference
– Van Rijsbergen et al. 1970 onwards
Probabilistic Indexing – Fuhr et al. 1980 onwards
Bayesian Nets in Information Retrieval – Turtle,
Croft 1990 onwards
Probabilistic Logic Programming in Information
Retrieval – Fuhr et al. 1990 onwards
Probability Ranking Principle
1. Collection of
documents
Question: In what order
documents to present to user?
2. Representation of
documents
Logically: Best document first
and then next best and so on
3. User uses a query
4. Representation of
query
5. A set of documents
to return
Requirement: A formal way to
judge the goodness of
documents with respect to
query
Possibility: Probability of
relevance of the document with
respect to query
Probability Ranking Principle
If a retrieval system’s response to each request is a
ranking of the documents in the collections in order of
decreasing probability of goodness to the user who
submitted the request ...
… where the probabilities are estimated as accurately as
possible on the basis of whatever data made available to
the system for this purpose ...
… then the overall effectiveness of the system to its users
will be the best that is obtainable on the basis of that data.
W. S. Cooper
Probability Basics
Bayes’ Rule
Let a and b are two events
p(a | b) p(b)  p(a  b)  p(b | a) p(a)
p(b | a) p(a)
p ( a | b) 
p(b)
p(b | a ) p(a )
p ( a | b) 
p(b)
Odds of an event a is defined as
p(a)
p(a)
O( a ) 

p(a) 1  p(a)
Conditional probability satisfies all axioms of probability
(i)
0  p(a | b)  1
(ii)
p(S | b)  1
(iii)
If
a
i
are mutually exclusive events then


i 1
i 1
p( ai | b)   p(ai | b)
p(ab)
p ( a | b) 
, p(ab)  0, p(b)  0
p(b)
p(ab)
ab  b  p(ab)  p(b) 
1
p(b)
p(a | b)  0
p ( a | b)  1
Hence (i)
p( Sb) p(b)
p ( S | b) 

1
p(b)
p(b)
Hence (ii)


p( ai | b) 
p(( ai )b)
b
i 1
4
a
p(b)
i 1
a
1

p ( ai b)

i 1
p (b)
a


 p (a b)
i
i 1
[
p (b)
ab
i

i 1
i
p(b)

  p ( ai | b )
i 1
Hence (iii)
3
‘s are all mutually exclusive]

 p(a | b) p(b)
a
2
Probability Ranking Principle
Let x be a document in the collection.
Let R represent relevance of a document w.r.t. given (fixed) query
and let NR represent non-relevance.
p(R|x) - probability that a retrieved document x is relevant.
p(NR|x) - probability that a retrieved document x is non-relevant.
p( x | R) p( R)
p( R | x) 
p( x)
p( x | NR) p( NR)
p( NR | x) 
p( x)
p(R),p(NR) - prior probability of
retrieving a relevant and nonrelevant document respectively
p(x|R), p(x|NR) - probability that if a relevant (non-relevant)
document is retrieved, it is x.
Probability Ranking Principle
p( x | R) p( R)
p( R | x) 
p( x)
p( x | NR) p( NR)
p( NR | x) 
p( x)
Ranking Principle (Bayes’ Decision Rule):
If p(R|x) > p(NR|x) then x is relevant,
otherwise x is not relevant
Probability Ranking Principle
Is PRP minimizes the average probability of error?
Actual
X is R
X is NR
Decision
X is R
2
X is NR
1
 p ( R | x)
p(error | x)  
 p( NR | x)
If we decide NR
If we decide R
p(error)   p(error | x) p( x)
x
p(error) is minimal when all p(error|x) are minimal.
Bayes’ decision rule minimizes each p(error|x).
p(error)   p(error | x) p( x)
x
X is either in R or NR
 p(R | x) p( x)   p( NR | x) p( x)
  p( R | x) p( x)   p( NR | x) p( x)

xNR
xR
xNR
xR

 p( NR | x) p( x)   p( NR | x) p( x)
xNR

xNR
 { p(R | x)  p( NR | x)}p( x)   p( NR | x) p( x)
xNR
x
Constant
p(error)  { p( NR | x)  p( R | x)}p( x)  Cons tant
xR
Minimization of p(error)
Min
Minimization of 2p(error)
{ p( NR | x)  p(R | x)}p( x)  { p(R | x)  p( NR | x} p( x)
xR
xNR
p(x) is same for any x
Define
S1  {x : p( R | x)  p( NR | x)  0}
S2  {x : p( R | x)  p( NR | x)  0}
S1 and S2 are nothing but the decision for x to be in R
and NR respectively
Hence
S1 andS2 : The decision minimizes p(error)
Probability Ranking Principle
Issues
• How do we compute all those probabilities?
– Cannot compute exact probabilities, have to use
estimates from the (ground truth) data.
(Binary Independence Retrieval)
(Bayesian Networks??)
Assumptions
– “Relevance” of each document is independent of
relevance of other documents.
– Most applications are for Boolean model.
Probability Ranking Principle
Actual
X is R
X is NR
Decision
X is R
X is NR
• Simple case: no selection costs.
• x is relevant iff p(R|x) > p(NR|x)
(Bayes’ Decision Rule)
• PRP: Rank all documents by p(R|x).
Probability Ranking Principle
Actual
X is R
X is NR
Decision
X is R
X is NR
• More complex case: retrieval costs.
– C - cost of retrieval of relevant document
– C’ - cost of retrieval of non-relevant document
– let d, be a document
• Probability Ranking Principle: if
C  p( R | d )  C  (1  p( R | d ))  C  p( R | d )  C  (1  p( R | d ))
for all d’ not yet retrieved, then d is the next document to
be retrieved
Binary Independence Model
Binary Independence Model
• Traditionally used in conjunction with PRP
• “Binary” = Boolean: documents are represented
as binary vectors of terms:

x  ( x1 ,, xn )
–
xi  1 iff term i is present in document x.
–
• “Independence”: terms occur in documents
independently
• Different documents can be modeled as same
vector.
Binary Independence Model
• Queries: binary vectors of terms
• Given query q,
– for each document d, need to compute p(R|q,d).
– replace with computing p(R|q,x) where x is vector
representing d
• Interested only in ranking
• Use odds:



p ( R | q, x )
p ( R | q ) p ( x | R, q )
O ( R | q, x ) 
 
 
p( NR | q, x ) p( NR | q) p( x | NR, q)
Binary Independence Model



p ( R | q, x )
p ( R | q ) p ( x | R, q )
O ( R | q, x ) 

 

p( NR | q, x ) p( NR | q) p( x | NR, q)
Constant for
each query
Needs estimation
• Using Independence Assumption:

n
p( xi | R, q)
p( x | R, q)


p( x | NR, q) i 1 p( xi | NR, q)
n
•So : O( R | q, d )  O( R | q)  
i 1
p( xi | R, q)
p( xi | NR, q)
Binary Independence Model
n
O( R | q, d )  O( R | q)  
i 1
p( xi | R, q)
p( xi | NR, q)
• Since xi is either 0 or 1:
p( xi  1 | R, q)
p( xi  0 | R, q)
O( R | q, d )  O( R | q)  

xi 1 p( xi  1 | NR, q) xi 0 p( xi  0 | NR, q)
• Let pi  p( xi  1 | R, q); ri  p( xi  1 | NR, q);
• Assume, for all terms not occurring in the query (qi=0) pi  ri
Binary Independence Model

O ( R | q, x )  O ( R | q ) 

xi  qi 1
All matching terms
pi
1  pi

ri xi 0 1  ri
qi 1
Non-matching
query terms
pi (1  ri )
1  pi
 O( R | q )  

xi  qi 1 ri (1  pi ) qi 1 1  ri
All matching terms
All query terms
Binary Independence Model

O( R | q, x )  O( R | q) 
pi (1  ri )
1  pi


xi  qi 1 ri (1  pi ) qi 1 1  ri
Constant for
each query
• Retrieval Status Value:
Only quantity to be estimated
for rankings
pi (1  ri )
pi (1  ri )
RSV  log 
  log
ri (1  pi )
xi  qi 1
xi  qi 1 ri (1  pi )
Binary Independence Model
• All boils down to computing RSV.
pi (1  ri )
pi (1  ri )
RSV  log 
  log
ri (1  pi )
xi  qi 1
xi  qi 1 ri (1  pi )
pi (1  ri )
RSV   ci ; ci  log
ri (1  pi )
xi  qi 1
So, how do we compute ci’s from our data ?
Binary Independence Model
• Estimating RSV coefficients.
• For each term i look at the following table:
Documens Relevant Non-Relevant Total
Xi=1
Xi=0
s
S-s
n-s
N-n-S+s
n
N-n
Total
S
N-S
N
s
• Estimates: pi 
S
(n  s)
ri 
(N  S)
s ( S  s)
ci  K ( N , n, S , s)  log
(n  s ) ( N  n  S  s )
PRP and BIR: The lessons
• Getting reasonable approximations of
probabilities is possible.
• Simple methods work only with restrictive
assumptions:
– term independence
– terms not in query do not affect the outcome
– boolean representation of documents/queries
– document relevance values are independent
• Some of these assumptions can be removed
Probabilistic weighting scheme
s( N  n  S  s)
log
(n  s)(S  s)
Add 0.5 with each term
( s  0.5)(N  n  S  s  0.5)
log
(n  s  0.5)(S  s  0.5)
log function of ratio of probabilities may lead to positive or negative of infinity
Probabilistic weighting scheme [S.Robertson]
In general form, the weighting function is
(k3  1)q (k1  1) f
( s  0.5)(N  n  S  s  0.5)
.
. log
(k3  q) (k1L  f )
(n  s  0.5)(S  s  0.5)
k1 , k3 : are constants.
q : within query frequency (wqf),
f : within document frequency (wdf),
n : number of documents in the collection indexed by this term,
N : total number of documents in the collection,
s : number of relevant documents indexed by this term,
S : total number of relevant documents,
L : normalised document length (i.e. the length of this document
divided by the average length of documents in the collection).
Probabilistic weighting scheme [S.Robertson]
BM11
Stephen Robertson's BM11 uses the general form for
weights, but adds an extra item to the sum of term
weights to give the overall document score
(1  L)
k 2 nq
(1  L)
This term is 0 when L=1
nq : number of terms in the query (the query length),
k2 : another constant.
( s  0.5)(N  n  S  s  0.5)
(1  L) (k3  1)q (k1  1) f
.
. log
k 2 nq
(1  L) (k3  q) (k1L  f )
(n  s  0.5)(S  s  0.5)
Probabilistic weighting scheme
BM15
BM15 is same as BM11 with term k1L  f
replaced by k1  f
( s  0.5)(N  n  S  s  0.5)
(1  L) (k3  1)q (k1  1) f
.
. log
k 2 nq
(1  L) (k3  q) (k1L  f )
(n  s  0.5)(S  s  0.5)
( s  0.5)(N  n  S  s  0.5)
(1  L) (k3  1)q (k1  1) f
.
. log
k 2 nq
(1  L) (k3  q) (k1  f )
(n  s  0.5)(S  s  0.5)
Probabilistic weighting scheme
BM25
BM25 combines the B11 and B15 with a scaling factor, b,
which turns BM15 into BM11 as it moves from 0 to 1
(k3  1)q (k1  1) f
(s  0.5)(N  n  S  s  0.5)
.
. log
( k3  q ) ( k  f )
(n  s  0.5)(S  s  0.5)
k  k1 (bL  (1  b))
b=1
BM11
k2  0, b  1
Default values used:
b=0
General Form
k1  1, k2  0, k3  1, b  0.5
BM15
Bayesian Networks
1. Independent Assumption
x  x1, x2 ,........xn
Independent
Can they be dependent?
2. Binary Assumption
x
i
= 0 or 1
Can it be 0, 1, , …..n ?
A possible way could be Bayesian Networks
Bayesian Network (BN) Basics
Bayesian Network (BN) is a hybrid system of probability theory and graph theory.
Graph theory provides the user to build an interface to model highly interacting
variables. On the other hand probability theory ensures the system as a whole is
consistent, and provides ways to interface models to data.
BN
Modeling of JPD in a compact way by using graph
(s) to reflect conditional independence relationships
BN : Representation
Nodes
Arcs
Random variables
Conditional independence (or causality)
Undirected arcs  MRF
Directed arcs  BNs
BN : Advantages
•Arc from node A to B implies : A causes B
•Compact representation of JPD of nodes
•Easier to learn (Fit to data)
G = Global structure (DAG – Directed Acyclic Graph) that contains
x1
• a node for each variable xi U, i = 1,2, ….., n
x2
x3 •edges represent the probabilistic dependencies between
nodes of variables xi U
x4
M = set of local structures {M1 , M2 , …., Mn }. n mappings for n variables. Mi maps
each values of {xi , Par(xi )} to parameter . Here Par(xi ) denotes the set of parent
nodes of xi in G.
The joint probability distribution over U can be decomposed by the global structure
G as
P(U | B) = i P(xi | Par(xi ), , Mi , G)
* By Kevin Murphy
BN : An Example
Cloud
True
False
Sprinkler
On
Rain
Yes
p(C=T)
p(C=F)
1/2
1/2
Off
No
C
p(R=Y)
p(R=N)
T
0.8
0.2
F
0.2
0.8
Wet Grass
Yes
C
p(S=On)
p(S=Off)
T
0.1
0.9
F
0.5
0.5
No
S R
p(W=Y)
p(W=N)
On Y
0.99
0.01
On N
0.9
0.1
Off Y
0.9
0.1
Off N
0.0
1.0
p(C,S,R,W) = p(C ) p(S|C) p(R|C,S) p(W|C,S,R)
p(C,S,R,W) = p(C ) p(S|C) p(R|C) p(W|S,R)
BN : As Model
•Hybrid system of probability theory and graph theory
•Graph theory provides the user to build an interface to model the
highly interactive variables
•Probability theory provides ways to interface model to data
BN : Notations
B  ( Bs , )
Bs

: Structure of the Network
: Set of parameters that encode local prob. dist.
Bs again has two parts G and M
G is the global structure (DAG)
M is the mapping for n variables (arcs) {xi , par( xi )}
BN : Learning
Structure : DAG
Parameter : CPD
Known
Structure
Bs
Unknown
Known
Parameter

Unknown
Parameter can only be learnt if structure is either
known or learnt earlier
BN : Learning
Structure is known and full data is observed (nothing missing)
Parameter learning
MLE, MAP
Structure is known and full data is NOT observed (data missing)
Parameter learning
EM
BN : Learning
Learning Structure
•To find the best DAG that fits the data
Objective Function:
p( D | Bs ) p( Bs )
p( Bs | D) 
p ( D)
log( p( Bs | D))  log( p( D | Bs ))  log( p( Bs ))  log( p( D))
Constant and indep. of
Search Algorithm : NP hard
Performance criteria used : MIC, BIC
Bs
References (basics)
1. S. E. Roberctson, “The Probability Ranking Principle in IR”, Journal of Documentation, 33,
294-304, 1977.
2. K. S. Jones, S. Walker and S. E. Roberctson, “A Probabilistic model for information
retrieval: development and comparative experiments-Part-1”, Information Processing and
Management, 36, 779-808, 2000.
3. K. S. Jones, S. Walker and S. E. Roberctson, “A Probabilistic model for information
retrieval: development and comparative experiments-Part-2”, Information Processing and
Management, 36, 809-840, 2000.
4. S. E. Roberctson and H. Zaragoza, “The Probabilistic relevance framework: BM25 and
beyond Ranking Principle in IR”, Foundation and Trends in Information Retrieval, 3, 333389, 2009.
5. S. E. Roberctson, C. J. Van Rijsbergen and M. F. Porter, “Probabilistic models of indexing
and searching”, Information Retrieval Research, Oddy et al. (Ed.s), 36, 35-56, 1981.
6. N. Fuhr and C. Buckley, “A probabilistic learning approach for document indexing”, ACM
Tran. On Information systems, 9, 223-248, 1991.
7. H. R. Turtle and W. B. Croft, “Evaluation of an inference network based retrieval model,
ACM Tran. On Information systems, 7, 187-222, 1991.
Thanks You
Discussions