Classic IR Models
Download
Report
Transcript Classic IR Models
Classic IR Models
Boolean model
simple model based on set theory
queries as Boolean expressions
adopted by many commercial systems
Vector space model
queries and documents as vectors in an M-dimensional space
M is the number of terms
find documents most similar to the query in the Mdimensional space
Probabilistic model
a probabilistic approach
assume an ideal answer set for each query
iteratively refine the properties of the ideal answer set
E.G.M. Petrakis
Information Retrieval Models
1
Document Index Terms
Each document is represented by a set of
representative index terms or keywords
requires text pre-processing (off-line)
these terms summarize document contents
adjectives, adverbs, connectives are less useful
the index terms are mainly nouns (lexicon look-up)
Not all terms are equally useful
very frequent terms are not useful
very infrequent terms are not useful neither
terms have varying relevance (weights) when used
to describe documents
E.G.M. Petrakis
Information Retrieval Models
2
Text Preprocessing
Extract terms from documents and queries
document - query profile
Processing stages
word separation
sentence splitting
change terms to a standard form (e.g., lowercase)
eliminate stop-words (e.g. and, is, the, …)
reduce terms to their base form (e.g., eliminate
prefixes, suffixes)
construct term indices (usually inverted files)
E.G.M. Petrakis
Information Retrieval Models
3
Text Preprocessing Chart
from Baeza – Yates & Ribeiro – Neto, 1999
E.G.M. Petrakis
Information Retrieval Models
4
Inverted Index
index
άγαλμα
αγάπη
…
δουλειά
…
πρωί
…
ωκεανός
E.G.M. Petrakis
posting list
(1,2)(3,4)
(4,3)(7,5)
………
(10,3)
Information Retrieval Models
documents
1
2
3
4
5
6
7
8
9
10
11
5
Basic Notation
Document: usually text
D: document collection (corpus)
d: an instance of D
Query: same representation with documents
Q: set of all possible queries
q: an instance of Q
Relevance: R(d,q)
binary relation R: D x Q {0,1}
d is “relevant” to q iff R(d,q) = 1 or
degree of relevance: R(d,q) [0,1] or
probability of relevance R(d,q) = Prob(R|d,q)
E.G.M. Petrakis
Information Retrieval Models
6
Term Weights
T = {t1, t2, ….tM } the terms in corpus
N number of documents in corpus
dj a document
dj is represented by (w1j,w2j,…wMj) where
wij > 0 if ti appears in dj
wij = 0 otherwise
q is represented by (q1,q2,…qM)
R(d,q) > 0 if q and d have common terms
E.G.M. Petrakis
Information Retrieval Models
7
Term Weighting
docs
terms
t1
d1
d2
w11
w12
t2
tM
E.G.M. Petrakis
….
dN
w1N
w2i
wM1
wMN
Information Retrieval Models
8
Document Space (corpus)
D
q
query
relevant document
non-relevant document
E.G.M. Petrakis
Information Retrieval Models
9
Boolean Model
Based on set theory and Boolean algebra
Boolean queries: “John” and “Mary” not “Ann”
terms linked by “and”, “or”, “not”
terms weights are 0 or 1 (wij=0 or 1)
query terms are present or absent in a document
a document is relevant if the query condition is
satisfied
Pros: simple, in many commercial systems
Cons: no ranking, not easy for complex queries
E.G.M. Petrakis
Information Retrieval Models
10
Query Processing
For each term ti in query q={t1,t2,…tM}
1) use the index to retrieve all dj with wij > 0
2) sort them by decreasing order (e.g., by term
frequency)
Return documents satisfying the query
condition
Slow for many terms: involves set
intersections
Keep only the top K documents for each
term at step 2 or
Do not process all query terms
E.G.M. Petrakis
Information Retrieval Models
11
Vector Space Model
Documents and queries are M –
dimensional term vectors
non-binary weights to index terms
a query is similar to a document if their
vectors are similar
retrieved documents are sorted by
decreasing order
a document may match a query only
partially
SMART is the most popular implementation
E.G.M. Petrakis
Information Retrieval Models
12
Query – Document Similarity
q
qd
Sim (q , d )
| q || d |
ww
w
M
i 1
M
i 1
iq
2
iq
id
M
2
w
i 1 id
d
θ
Similarity is defined as the cosine of the
angle between document and query vectors
E.G.M. Petrakis
Information Retrieval Models
13
Weighting Scheme
tf x idf weighting scheme wij =
freqij
N
log
maxfreqli
ni
idfi
tf
ij
wij: weight of term ti associated with document dj
tfij frequency of term ti in document dj
max frequencytfli is computed over all terms in dj
tfij: normalized frequency
idfi: inverse document frequency
ni: number of documents where term ti occurs
E.G.M. Petrakis
Information Retrieval Models
14
Weight Normalization
Many ways to express weights
E.g., using log(tfij)
The weight is normalized in [0,1]
wij
(1 log(tf ij ))idfi
M
(
1
log(
tf
))
kj
k i
2
Normalize by document length
E.G.M. Petrakis
Information Retrieval Models
15
Normalization by Document
Length
The longer the document, the more
likely it is for a given term to appear
in it
Normalize the term weights by
document length (so longer documents
are not given more weight)
w'ij
E.G.M. Petrakis
wij
M
k 1
wkj
2
Information Retrieval Models
16
Comments on Term Weighting
tfij: term frequency – measures how well
a term describes a document
intra document characterization
idfi: terms appearing in many documents
are not very useful in distinguishing
relevant from non-relevant documents
inter document characterization
This scheme favors average terms
E.G.M. Petrakis
Information Retrieval Models
17
Comments on Vector Space Model
Pros:
at least as good as other models
approximate query matching: a query and a
document need not contain exactly the
same terms
allows for ranking of results
Cons:
assumes term independency
E.G.M. Petrakis
Information Retrieval Models
18
Document Distance
Consider documents d1, d2 with vectors u1, u2
their distance is defined as the length AB
distan ce (d1 , d 2 ) =
2 sin(θ / 2) =
2(1 - cos(θ )) =
2(1 - similarity(d1 , d 2 ))
E.G.M. Petrakis
Information Retrieval Models
19
Probabilistic Model
Computes the probability that the document
is relevant to the query
ranks the documents according to their probability
of being relevant to the query
Assumption: there is a set R of relevant
documents which maximizes the overall
probability of relevance
R: ideal answer set
R is not known in advance
initially assume a description (the terms) of R
iteratively refine this description
E.G.M. Petrakis
Information Retrieval Models
20
Basic Notation
D: corpus, d: an instance of D
Q: set of queries, q: an instance of Q
R {(d, q) | d D, q Q, d is relevant ot q}
R {(d, q) | d D, q Q, d is not relevant ot q}
P(R | d) : probability that d is relevant
P (R |d ) : probability that d is not
relevant
E.G.M. Petrakis
Information Retrieval Models
21
Probability of Relevance
P(R|d): probability that d is relevant
P (d|R )P (R )
Bayes rule P (R|d ) =
P (d )
P(d|R): probability of selecting d from R
P(R): probability of selecting R from D
P(d): probability of selecting d from D
E.G.M. Petrakis
Information Retrieval Models
22
Document Ranking
Take the odds of relevance as the rank
P( R|d ) P(d|R) P( R)
Sim(d|q)
P( R |d ) P(d|R ) P( R )
Minimizes probability of erroneous
judgment
P (R ), P (R ) are the same for all docs
P (d|R )
Sim(d|q ) =
P (d|R )
E.G.M. Petrakis
Information Retrieval Models
23
Ranking (cont’d)
Each document is represented by a set
of index terms t1,t2,..tM
assume binary terms wi for terms ti
d=(w1,w2,…wM) where
wi=1 if the term appears in d
wi=0 otherwise
Assuming independence of index terms
P(d|R ) t d P (ti|R)t d P( ti|R)
i
E.G.M. Petrakis
Information Retrieval Models
i
24
Ranking (conted)
By taking logarithms and by omitting
constant terms
P(d| R)
Sim(d/q) =
~
P(d| R )
P(ti | R)
1 - P(ti | R)
M
i1 w iq w idlog1 - P(t | R) + i1 w iq w idlog P(t | R)
i
i
M
R is initially unknown
E.G.M. Petrakis
Information Retrieval Models
25
Initial Estimation
Make simplifying assumptions such as
ni
P (ti |R ) = 0.5, P (ti |R ) =
N
where ni: number of documents containing ti
and N: total number of documents
Retrieve initial answer set using these
values
Refine answer iteratively
E.G.M. Petrakis
Information Retrieval Models
26
Improvement
Let V the number of documents retrieved
initially
Take the fist r answers as relevant
From them compute Vi: number of documents
containing ti
Update the initial probabilities:
Vi
ni - Vi
P(ti|R) = , P(ti|R ) =
V
N -V
Resubmit query and repeat until convergence
E.G.M. Petrakis
Information Retrieval Models
27
Comments on Probabilistic Model
Pros:
good theoretical basis
Cons:
need to guess initial probabilities
binary weights
independence assumption
Extensions:
relevance feedback: humans choose relevant docs
OKAPI formula for non – binary weights
E.G.M. Petrakis
Information Retrieval Models
28
Comparison of Models
The Boolean model is simple and used
used almost everywhere. It does not
allow for partial matches. It is the
weakest model
The Vector space model has been shown
(Salton and Buckley) to outperform the
other two models
Various extensions deal with their
weaknesses
E.G.M. Petrakis
Information Retrieval Models
29
Query Modification
The results are not always satisfactory
some answers are correct, others are not
queries can’t specify user’s needs precisely
Iteratively reformulate and resubmit
the query until the results become
satisfactory
Two approaches
relevance feedback
query expansion
E.G.M. Petrakis
Information Retrieval Models
30
Relevance Feedback
Mark answers as
relevant: positive examples
irrelevant: negative examples
Query: a point in document space
at each iteration compute new query point
the query moves towards an “optimal point”
that distinguishes relevant from nonrelevant document
the weights of query terms are modified
“term reweighting”
E.G.M. Petrakis
Information Retrieval Models
31
Rochio Vectors
q0
q1
optimal query
q2
E.G.M. Petrakis
Information Retrieval Models
32
Rochio Formula
Query point
n1
q q0 i 1 di n1
n2
j 1 d j
n2
di: relevant answer
dj: non-relevant answer
n1: number of relevant answers
n2: number or non-relevant answers
α, β, γ: relative strength (usually α=β=γ=1)
α = 1, β = 0.75, γ = 0.25: q0 and relevant
answers contain important information
E.G.M. Petrakis
Information Retrieval Models
33
Query Expansion
Adds new terms to the query which are
somehow related to existing terms
synonyms from dictionary (e.g., staff, crew)
semantically related terms from a
thesaurus (e.g., “wordnet”): man, woman,
man kind, human…)
terms with similar pronunciation (Phonix,
Soundex)
Better results in many cases but query
defocuses (topic drift)
E.G.M. Petrakis
Information Retrieval Models
34
Comments
Do all together
query expansion: new terms are added from
relevant documents, dictionaries, thesaurus
term reweighing by Rochio formula
If consistent relevance judgments are
provided
2-3 iterations improve results
quality depends on corpus
E.G.M. Petrakis
Information Retrieval Models
35
Extensions
Pseudo relevance feedback: mark top k
answers as relevant, bottom k answers
as non-relevant and apply Rochio
formula
Relevance models for probabilistic
model
evaluation of initial answers by humans
term reweighting model by Bruce Croft,
1983
E.G.M. Petrakis
Information Retrieval Models
36
Text Clustering
The grouping of similar vectors into
clusters
Similar documents tend to be relevant
to the same requests
Clustering on M-dimensional space
M number of terms
E.G.M. Petrakis
Information Retrieval Models
37
Clustering Methods
Sound methods based on the documentto-document similarity matrix
graph theoretic methods
O(N2) time
Iterative methods operating directly on
the document vectors
O(NlogN) or O(N2/logN) time
E.G.M. Petrakis
Information Retrieval Models
38
Sound Methods
1. Two documents with similarity > T
(threshold) are connected with an
edge [Duda&Hart73]
clusters: the connected components
(maximal cliques) of the resulting graph
problem: selection of appropriate
threshold T
E.G.M. Petrakis
Information Retrieval Models
39
Zahn’s method [Zahn71]
the dashed edge
is inconsistent
and is deleted
Find the minimum spanning tree
For each doc delete edges with length l > lavg
lavg: average distance if its incident edges
Or remove the longest edge (1 edge removed
=> 2 clusters, 2 edges removed => 3 clusters
Clusters: the connected components of the
graph
E.G.M. Petrakis
Information Retrieval Models
40
Iterative Methods
K-means clustering (K known in advance)
Choose some seed points (documents)
possible cluster centroids
Repeat until the centroids do not
change
assign each vector (document) to its
closest seed
compute new centroids
reassign vectors to improve clusters
E.G.M. Petrakis
Information Retrieval Models
41
Cluster Searching
The M-dimensional query vector is compared
with the cluster-centroids
search closest cluster
retrieve documents with similarity > T
E.G.M. Petrakis
Information Retrieval Models
42
References
"Modern Information Retrieval", Richardo BaezaYates, Addison Wesley 1999
"Searching Multimedia Databases by Content",
Christos Faloutsos, Kluwer Academic Publishers, 1996
Information Retrieval Resources
http://nlp.stanford.edu/IR-book/informationretrieval.html
TREC http://trec.nist.gov/
SMART http://en.wikipedia.org/wiki/SMART_
Information_Retrieval_System
LEMOUR http://www.lemurproject.org/
LUCENE
http://lucene.apache.org/
E.G.M.
Petrakis
Information Retrieval Models
43