Classic IR Models

Download Report

Transcript Classic IR Models

Classic IR Models
Boolean model
 simple model based on set theory
 queries as Boolean expressions
 adopted by many commercial systems
Vector space model
 queries and documents as vectors in an M-dimensional space
 M is the number of terms
 find documents most similar to the query in the Mdimensional space
Probabilistic model
 a probabilistic approach
 assume an ideal answer set for each query
 iteratively refine the properties of the ideal answer set
E.G.M. Petrakis
Information Retrieval Models
1
Document Index Terms
 Each document is represented by a set of
representative index terms or keywords




requires text pre-processing (off-line)
these terms summarize document contents
adjectives, adverbs, connectives are less useful
the index terms are mainly nouns (lexicon look-up)
 Not all terms are equally useful
 very frequent terms are not useful
 very infrequent terms are not useful neither
 terms have varying relevance (weights) when used
to describe documents
E.G.M. Petrakis
Information Retrieval Models
2
Text Preprocessing
Extract terms from documents and queries
 document - query profile
Processing stages
word separation
sentence splitting
change terms to a standard form (e.g., lowercase)
eliminate stop-words (e.g. and, is, the, …)
reduce terms to their base form (e.g., eliminate
prefixes, suffixes)
construct term indices (usually inverted files)
E.G.M. Petrakis
Information Retrieval Models
3
Text Preprocessing Chart
from Baeza – Yates & Ribeiro – Neto, 1999
E.G.M. Petrakis
Information Retrieval Models
4
Inverted Index
index
άγαλμα
αγάπη
…
δουλειά
…
πρωί
…
ωκεανός
E.G.M. Petrakis
posting list
(1,2)(3,4)
(4,3)(7,5)
………
(10,3)
Information Retrieval Models
documents
1
2
3
4
5
6
7
8
9
10
11
5
Basic Notation
Document: usually text
 D: document collection (corpus)
 d: an instance of D
Query: same representation with documents
 Q: set of all possible queries
 q: an instance of Q
Relevance: R(d,q)
binary relation R: D x Q  {0,1}
 d is “relevant” to q iff R(d,q) = 1 or
degree of relevance: R(d,q)  [0,1] or
probability of relevance R(d,q) = Prob(R|d,q)
E.G.M. Petrakis
Information Retrieval Models
6
Term Weights
 T = {t1, t2, ….tM } the terms in corpus
 N number of documents in corpus
 dj a document
 dj is represented by (w1j,w2j,…wMj) where
wij > 0 if ti appears in dj
wij = 0 otherwise
 q is represented by (q1,q2,…qM)
 R(d,q) > 0 if q and d have common terms
E.G.M. Petrakis
Information Retrieval Models
7
Term Weighting
docs
terms
t1
d1
d2
w11
w12
t2
tM
E.G.M. Petrakis
….
dN
w1N
w2i
wM1
wMN
Information Retrieval Models
8
Document Space (corpus)
D
q
query
relevant document
non-relevant document
E.G.M. Petrakis
Information Retrieval Models
9
Boolean Model
Based on set theory and Boolean algebra
Boolean queries: “John” and “Mary” not “Ann”
terms linked by “and”, “or”, “not”
terms weights are 0 or 1 (wij=0 or 1)
query terms are present or absent in a document
a document is relevant if the query condition is
satisfied
 Pros: simple, in many commercial systems
 Cons: no ranking, not easy for complex queries
E.G.M. Petrakis
Information Retrieval Models
10
Query Processing
 For each term ti in query q={t1,t2,…tM}
1) use the index to retrieve all dj with wij > 0
2) sort them by decreasing order (e.g., by term
frequency)
 Return documents satisfying the query
condition
 Slow for many terms: involves set
intersections
 Keep only the top K documents for each
term at step 2 or
 Do not process all query terms
E.G.M. Petrakis
Information Retrieval Models
11
Vector Space Model
Documents and queries are M –
dimensional term vectors
non-binary weights to index terms
a query is similar to a document if their
vectors are similar
retrieved documents are sorted by
decreasing order
a document may match a query only
partially
SMART is the most popular implementation
E.G.M. Petrakis
Information Retrieval Models
12
Query – Document Similarity
q


 
qd
Sim (q , d )    
| q || d |
 ww
 w
M
i 1
M
i 1
iq
2
iq
id
M
2
w
i 1 id
d
θ
Similarity is defined as the cosine of the
angle between document and query vectors
E.G.M. Petrakis
Information Retrieval Models
13
Weighting Scheme
 tf x idf weighting scheme wij =
freqij
N
log
maxfreqli
ni
 
idfi
tf
ij
 wij: weight of term ti associated with document dj
 tfij frequency of term ti in document dj
 max frequencytfli is computed over all terms in dj
 tfij: normalized frequency
 idfi: inverse document frequency
 ni: number of documents where term ti occurs
E.G.M. Petrakis
Information Retrieval Models
14
Weight Normalization
Many ways to express weights
E.g., using log(tfij)
The weight is normalized in [0,1]
wij 
(1  log(tf ij ))idfi

M
(
1

log(
tf
))
kj
k i
2
Normalize by document length
E.G.M. Petrakis
Information Retrieval Models
15
Normalization by Document
Length
The longer the document, the more
likely it is for a given term to appear
in it
Normalize the term weights by
document length (so longer documents
are not given more weight)
w'ij 
E.G.M. Petrakis
wij

M
k 1
wkj
2
Information Retrieval Models
16
Comments on Term Weighting
tfij: term frequency – measures how well
a term describes a document
intra document characterization
idfi: terms appearing in many documents
are not very useful in distinguishing
relevant from non-relevant documents
inter document characterization
This scheme favors average terms
E.G.M. Petrakis
Information Retrieval Models
17
Comments on Vector Space Model
Pros:
at least as good as other models
approximate query matching: a query and a
document need not contain exactly the
same terms
allows for ranking of results
Cons:
assumes term independency
E.G.M. Petrakis
Information Retrieval Models
18
Document Distance
Consider documents d1, d2 with vectors u1, u2
their distance is defined as the length AB
distan ce (d1 , d 2 ) =
2 sin(θ / 2) =
2(1 - cos(θ )) =
2(1 - similarity(d1 , d 2 ))
E.G.M. Petrakis
Information Retrieval Models
19
Probabilistic Model
Computes the probability that the document
is relevant to the query
ranks the documents according to their probability
of being relevant to the query
Assumption: there is a set R of relevant
documents which maximizes the overall
probability of relevance
 R: ideal answer set
 R is not known in advance
initially assume a description (the terms) of R
iteratively refine this description
E.G.M. Petrakis
Information Retrieval Models
20
Basic Notation
D: corpus, d: an instance of D
Q: set of queries, q: an instance of Q
R  {(d, q) | d  D, q  Q, d is relevant ot q}

R  {(d, q) | d  D, q  Q, d is not relevant ot q}
P(R | d) : probability that d is relevant
P (R |d ) : probability that d is not
relevant
E.G.M. Petrakis
Information Retrieval Models
21
Probability of Relevance
P(R|d): probability that d is relevant
P (d|R )P (R )
Bayes rule P (R|d ) =
P (d )
P(d|R): probability of selecting d from R
P(R): probability of selecting R from D
P(d): probability of selecting d from D
E.G.M. Petrakis
Information Retrieval Models
22
Document Ranking
Take the odds of relevance as the rank
P( R|d ) P(d|R) P( R)
Sim(d|q) 

P( R |d ) P(d|R ) P( R )
Minimizes probability of erroneous
judgment
 P (R ), P (R ) are the same for all docs
P (d|R )
Sim(d|q ) =
P (d|R )
E.G.M. Petrakis
Information Retrieval Models
23
Ranking (cont’d)
Each document is represented by a set
of index terms t1,t2,..tM
assume binary terms wi for terms ti
d=(w1,w2,…wM) where
wi=1 if the term appears in d
wi=0 otherwise
Assuming independence of index terms
P(d|R )  t d P (ti|R)t d P( ti|R)
i
E.G.M. Petrakis
Information Retrieval Models
i
24
Ranking (conted)
By taking logarithms and by omitting
constant terms
P(d| R)
Sim(d/q) =
~
P(d| R )
P(ti | R)
1 - P(ti | R)
M
i1 w iq w idlog1 - P(t | R) + i1 w iq w idlog P(t | R)
i
i
M
R is initially unknown
E.G.M. Petrakis
Information Retrieval Models
25
Initial Estimation
Make simplifying assumptions such as
ni
P (ti |R ) = 0.5, P (ti |R ) =
N
where ni: number of documents containing ti
and N: total number of documents
Retrieve initial answer set using these
values
Refine answer iteratively
E.G.M. Petrakis
Information Retrieval Models
26
Improvement
Let V the number of documents retrieved
initially
Take the fist r answers as relevant
From them compute Vi: number of documents
containing ti
Update the initial probabilities:
Vi
ni - Vi
P(ti|R) = , P(ti|R ) =
V
N -V
Resubmit query and repeat until convergence
E.G.M. Petrakis
Information Retrieval Models
27
Comments on Probabilistic Model
 Pros:
good theoretical basis
 Cons:
need to guess initial probabilities
binary weights
independence assumption
Extensions:
relevance feedback: humans choose relevant docs
OKAPI formula for non – binary weights
E.G.M. Petrakis
Information Retrieval Models
28
Comparison of Models
The Boolean model is simple and used
used almost everywhere. It does not
allow for partial matches. It is the
weakest model
The Vector space model has been shown
(Salton and Buckley) to outperform the
other two models
Various extensions deal with their
weaknesses
E.G.M. Petrakis
Information Retrieval Models
29
Query Modification
The results are not always satisfactory
some answers are correct, others are not
queries can’t specify user’s needs precisely
Iteratively reformulate and resubmit
the query until the results become
satisfactory
Two approaches
relevance feedback
query expansion
E.G.M. Petrakis
Information Retrieval Models
30
Relevance Feedback
Mark answers as
relevant: positive examples
irrelevant: negative examples
Query: a point in document space
at each iteration compute new query point
the query moves towards an “optimal point”
that distinguishes relevant from nonrelevant document
the weights of query terms are modified
“term reweighting”
E.G.M. Petrakis
Information Retrieval Models
31
Rochio Vectors
q0
q1
optimal query
q2
E.G.M. Petrakis
Information Retrieval Models
32
Rochio Formula
Query point


 n1  
q  q0  i 1 di n1
n2

 j 1 d j
n2
di: relevant answer
dj: non-relevant answer
n1: number of relevant answers
n2: number or non-relevant answers
α, β, γ: relative strength (usually α=β=γ=1)
α = 1, β = 0.75, γ = 0.25: q0 and relevant
answers contain important information
E.G.M. Petrakis
Information Retrieval Models
33
Query Expansion
Adds new terms to the query which are
somehow related to existing terms
synonyms from dictionary (e.g., staff, crew)
semantically related terms from a
thesaurus (e.g., “wordnet”): man, woman,
man kind, human…)
terms with similar pronunciation (Phonix,
Soundex)
Better results in many cases but query
defocuses (topic drift)
E.G.M. Petrakis
Information Retrieval Models
34
Comments
Do all together
query expansion: new terms are added from
relevant documents, dictionaries, thesaurus
term reweighing by Rochio formula
If consistent relevance judgments are
provided
2-3 iterations improve results
quality depends on corpus
E.G.M. Petrakis
Information Retrieval Models
35
Extensions
Pseudo relevance feedback: mark top k
answers as relevant, bottom k answers
as non-relevant and apply Rochio
formula
Relevance models for probabilistic
model
evaluation of initial answers by humans
term reweighting model by Bruce Croft,
1983
E.G.M. Petrakis
Information Retrieval Models
36
Text Clustering
The grouping of similar vectors into
clusters
Similar documents tend to be relevant
to the same requests
Clustering on M-dimensional space
M number of terms
E.G.M. Petrakis
Information Retrieval Models
37
Clustering Methods
Sound methods based on the documentto-document similarity matrix
graph theoretic methods
O(N2) time
Iterative methods operating directly on
the document vectors
O(NlogN) or O(N2/logN) time
E.G.M. Petrakis
Information Retrieval Models
38
Sound Methods
1. Two documents with similarity > T
(threshold) are connected with an
edge [Duda&Hart73]
 clusters: the connected components
(maximal cliques) of the resulting graph
 problem: selection of appropriate
threshold T
E.G.M. Petrakis
Information Retrieval Models
39
Zahn’s method [Zahn71]
the dashed edge
is inconsistent
and is deleted
Find the minimum spanning tree
For each doc delete edges with length l > lavg
 lavg: average distance if its incident edges
Or remove the longest edge (1 edge removed
=> 2 clusters, 2 edges removed => 3 clusters
Clusters: the connected components of the
graph
E.G.M. Petrakis
Information Retrieval Models
40
Iterative Methods
K-means clustering (K known in advance)
Choose some seed points (documents)
possible cluster centroids
Repeat until the centroids do not
change
assign each vector (document) to its
closest seed
compute new centroids
reassign vectors to improve clusters
E.G.M. Petrakis
Information Retrieval Models
41
Cluster Searching
The M-dimensional query vector is compared
with the cluster-centroids
search closest cluster
retrieve documents with similarity > T
E.G.M. Petrakis
Information Retrieval Models
42
References
 "Modern Information Retrieval", Richardo BaezaYates, Addison Wesley 1999
 "Searching Multimedia Databases by Content",
Christos Faloutsos, Kluwer Academic Publishers, 1996
 Information Retrieval Resources
http://nlp.stanford.edu/IR-book/informationretrieval.html
 TREC http://trec.nist.gov/
 SMART http://en.wikipedia.org/wiki/SMART_
Information_Retrieval_System
 LEMOUR http://www.lemurproject.org/

LUCENE
http://lucene.apache.org/
E.G.M.
Petrakis
Information Retrieval Models
43