Transcript PowerPoint

CS 430 / INFO 430
Information Retrieval
Lecture 12
Probabilistic Information Retrieval
1
Course Administration
Discussion Class: Lucene
The Web site lists four questions to think about as
you prepare for this discussion class. Here is an
other:
Suppose that you are unhappy with the ranking of
results provided by Lucene. What can you do
about it?
2
Course Administration
Midterm Examination
Wednesday, October 12, 7:30 to 9:00, Upson B17
The topics to be are examined are all lectures and
discussion class readings before the midterm break.
See the Web site for a sample paper from a previous
year.
See the Web site for instructions about laptop
computers.
3
Course Administration
Discussion Class on October 19
This class will be held in Philips Hall 213
4
Three Approaches to Information
Retrieval
Many authors divide the methods of information retrieval into
three categories:
Boolean (based on set theory)
Vector space (based on linear algebra)
Probabilistic (based on Bayesian statistics)
In practice, the latter two have considerable overlap.
5
Probability revision: independent random
variables and conditional probability
Let a, b be two events, with probability P(a) and P(b).
Independent events
The events a and b are independent if and only if:
P(a  b) = P(b) P(a)
Conditional probability
P(a | b) is the probability of a given b, also called the conditional
probability of a given b.
Conditional independence
The events a1, ..., an are conditionally independent if and only if:
6
P(ai | b  aj) = P(ai | b) for all i and j
Example: independent random
variables and conditional probability
Independent
a and b are the results of throwing two dice
P(a=5 | b=3) = P(a=5) = 1/6
Not independent
a and b are the results of throwing two dice
t=a+b
P(t=8 | a=2) = 1/6
P(t=8 | a=1) = 0
7
Probability: Conditional probability
x
y
w
z
P(a) = x + y
P(b) = w + x
P(a | b) = x / (w + x)
P(a | b) P(b) = P(a  b) = P(b | a) P(a)
8
Probability Theory -- Bayes Theorem
Notation
Let a, b be two events.
P(a | b) is the probability of a given b
Bayes Theorem
P(a | b) =
P(a | b) =
P(b | a) P(a)
P(b)
P(b | a) P(a)
where a is the event not a
P(b)
Derivation
P(a | b) P(b) = P(a  b) = P(b | a) P(a)
9
Probability Theory -- Bayes Theorem
Terminology used with Bayes Theorem
P(b | a) P(a)
P(a | b) =
P(b)
P(a) is called the prior probability of a
P(a | b) is called the posterior probability
of b given a
10
Example of Bayes Theorem
Example
P(a | b) = x / (w+x) = x / P(b)
a Weight over 200 lb.
P(b | a) = x / (x+y) = x / P(a)
b Height over 6 ft.
x is P(a  b)
x
y
w
z
11
Probability Ranking Principle
"If a reference retrieval system’s response to each request is a
ranking of the documents in the collections in order of
decreasing probability of usefulness to the user who submitted
the request, where the probabilities are estimated as accurately
a possible on the basis of whatever data is made available to the
system for this purpose, then the overall effectiveness of the
system to its users will be the best that is obtainable on the
basis of that data."
W.S. Cooper
12
Probabilistic Ranking
Basic concept:
"For a given query, if we know some documents that are
relevant, terms that occur in those documents should be given
greater weighting in searching for other relevant documents.
"By making assumptions about the distribution of terms and
applying Bayes Theorem, it is possible to derive weights
theoretically."
Van Rijsbergen
13
Probabilistic Principle
Basic concept:
The probability that a document is relevant to a query is assumed to
depend on the terms in the query and the terms used to index the
document, only.
Given a user query q, the ideal answer set, R, is the set of all
relevant documents.
Given a user query q and a document dj in the collection, the
probabilistic model estimates the probability that the user will
find dj relevant, i.e., that dj is a member of R.
14
Probabilistic Principle
Initial probabilities:
Given a query q and a document dj the model needs an estimate
of the probability that the user finds dj relevant. i.e., P(R | dj).
Similarity measure:
The similarity (dj, q) is the ratio of the probability that dj is
relevant to q, to the probability that dj is not relevant to q.
This measure runs from near zero, if the probability is small that
the document is relevant, to large as the probability of relevance
approaches one.
15
Probabilistic Principle
similarity (dj, q) =
P(R | dj)
P(R | dj)
P(dj | R) P(R)
= P(d | R) P(R)
j
by Bayes Theorem
P(dj | R)
xk
P(dj | R)
where k is constant
=
P(dj | R) is the probability of randomly selecting dj from R.
16
Binary Independence Retrieval Model
(BIR)
Let x = (x1, x2, ... xn) be the term incidence vector for dj.
xi = 1 if term i is in the document and 0 otherwise.
We estimate P(dj | R) by P(x | R)
If the index terms are independent
P(x | R) = P(x1  R) P(x2  R) ... P(xn  R)
= P(x1 | R) P(x2 | R) ... P(xn | R)
= ∏ P(xi | R)
{This is known as the Naive Bayes probabilistic model.}
17
Binary Independence Retrieval Model
(BIR)
∏ P(xi | R)
S = similarity (dj, q) = k
∏ P(xi | R)
Since the xi are either 0 or 1, this can we written:
S = k
18
∏
xi = 1
P(xi = 1 | R)
P(xi = 1 | R)
∏
xi = 0
P(xi = 0 | R)
P(xi = 0 | R)
Binary Independence Retrieval Model
(BIR)
For terms that appear in the query let
pi = P(xi = 1 | R)
ri = P(xi = 1 | R)
For terms that do not appear in the query assume
pi = ri
S = k
= k
19
∏
∏
xi = qi = 1
xi = qi = 1
pi
ri
∏
xi = 0, qi = 1
pi (1 - ri)
ri(1 - pi)
∏
1 - pi
1 - ri
qi = 1
1 - pi
1 - ri
constant
for a
given
query
Binary Independence Retrieval Model
(BIR)
Taking logs and ignoring factors that are constant for a given
query, we have:
pi (1 - ri )
similarity (d, q) = ∑ log{(1 - p ) r }
i
i
where the summation is taken over those terms that appear in
both the query and the document.
20
Relationship to Term Vector Space
Model
Suppose that, in the term vector space, document d is
represented by a vector that has component in dimension i of:
log
{
pi (1 - ri )
(1 - pi) ri
}
and the query q is represented by a vector with value 1 in each
dimension that corresponds to a term in the vector.
Then the Binary Independence Retrieval similarity (d, q) is the
inner product of these two vectors.
Thus this approach can be considered as a probabilistic way of
determining term weights in the vector space model
21
Practical Application
The probabilistic model is an alternative to the term vector
space model.
The Binary Independence Retrieval similarity measure is used
instead of the cosine similarity measure to rank all documents
against the query q.
Techniques such as stoplists and stemming can be used with
either model.
Variations to the model result in slightly different expressions
for the similarity measure.
22
Practical Application
Early uses of probabilistic information retrieval
were based on relevance feedback
R is a set of documents that are guessed to be relevant and R
the complement of R.
1. Guess a preliminary probabilistic description of R and
use it to retrieve a first set of documents.
2. Interact with the user to refine the description of R
(relevance feedback).
3. Repeat, thus generating a succession of approximations
to R.
23
Initial Estimates of P(xi | R)
Initial guess, with no information to work from:
pi = P(xi | R) = c
ri = P(xi | R) = ni / N
where:
c is an arbitrary constant, e.g., 0.5
ni is the number of documents that contain xi
N is the total number of documents in the collection
24
Initial Similarity Estimates
With these assumptions:
pi (1 - ri )
similarity (d, q) = ∑ log{(1 - p ) r }
i
i
= ∑ log{(N - ni)/ni}
where the summation is taken over those terms that appear in
both the query and the document.
25
Improving the Estimates of P(xi | R)
Human feedback -- relevance feedback
Automatically
(a) Run query q using initial values. Consider the t top ranked
documents. Let si be the number of these documents that
contain the term xi.
(b) The new estimates are:
pi = P(xi | R) = si / t
ri = P(xi | R) = (ni - si) / (N - t)
26
Discussion of Probabilistic Model
Advantages
• Based on firm theoretical basis
Disadvantages
• Initial definition of R has to be guessed.
• Weights ignore term frequency
• Assumes independent index terms (as does vector model)
27