Modeling (Chap. 2)

Download Report

Transcript Modeling (Chap. 2)

Modeling (Chap. 2)
Modern Information Retrieval
Spring 2000
Introduction



Traditional IR systems adopt index
terms to index, retrieve documents
An index term is simply any word that
appears in text of documents
Retrieval based on index terms is
simple

premise is that semantics of documents and
user information can be expressed through
set of index terms

Key Question
semantics in document (user request)
lost when text replaced with set of words
 matching between documents and user
request done in very imprecise space of
index terms (low quality retrieval)
 problem worsened for users with no
training in properly forming queries
(cause of frequent dissatisfaction of Web
users with answers obtained)

Taxonomy of IR Models

Three classic models
 Boolean

documents and queries represented as
sets of index terms
 Vector

documents and queries represented as
vectors in t-dimensional space
 Probabilistic

document and query representations
based on probability theory
Basic Concepts


Classic models consider that each
document is described by index terms
Index term is a (document) word that
helps in remembering document’s
main themes



index terms used to index and summarize
document content
in general, index terms are nouns (because
meaning by themselves)
index terms may consider all distinct words in a
document collection




Distinct index terms have varying
relevance when describing document
contents
Thus numerical weights assigned to
each index term of a document
Let ki be index term, dj document,
and wi,j  0 be weight for pair (ki, dj)
Weight quantifies importance of index
term for describing document
semantic contents
Definition (pp. 25)





Let t be no. of index terms in system and
ki be generic index term.
K = {k1, …, kt} is set of all index terms.
A weight wi,j > 0 associated with each
index term ki of document dj.
For index term that does not appear in
document text, wi,j = 0.
Document
 dj associated with
 index term
vector d j represented by dj = (w1,j, w2,j,
…wt,j)
Boolean Model



Simple retrieval model based on set
theory and Boolean algebra
framework is easy to grasp by users
(concept of set is intuitive)
Queries specified as Boolean
expressions which have precise
semantics
Drawbacks

Retrieval strategy is binary decision
(document is relevant/non-relevant)



prevents good retrieval performance
not simple to translate information
need into Boolean expression (difficult
and awkward to express)
dominant model with commercial DB
systems
Boolean Model (Cont.)




Considers that index terms are present
or absent in document
index term weights are binary, I.e. wi,j
 {0,1}
query q composed of index terms
linked by not, and, or
query is Boolean expression which
can be represented as DNF
Boolean Model (Cont.)

Query [q=ka  (kb  kc)] can be written

in DNF as [ q dnf = (1,1,1)  (1,1,0) 
(1,0,0)]


each component is binary weighted vector
associated with tuple (ka, kb, kc)
binary weighted vectors are called

q
conjunctive components of dnf
Boolean Model (cont.)





Index term weight variables are all binary,
I.e. wi,j  {0,1}
query q is a Boolean expression

Let q dnf be DNF for query q

Let q cc be any conjunctive components

q
of dnf
Similarity of document
d to query q is

 j 


q
q
q
sim(d
 j,q) = 1 if  cc | ( cc  dnf)  (ki,gi(d j) =
gi( qcc)) where gi(dj) = wi,j
sim(dj,q) = 0 otherwise
Boolean Model (Cont.)




If sim(dj,q) = 1 then Boolean model
predict that document dj is relevant to
query q (it might not be)
Otherwise, prediction is that document is
not relevant
Boolean model predicts that each
document is either relevant or nonrelevant
no notion of partial match

Main advantages
 clean
formalism
 simplicity

Main disadvantages
 exact
matching lead to retrieval of too
few or too many documents

index term weighting can lead to
improvement in retrieval performance
Vector Model



Assign non-binary weights to index terms
in queries and documents
term weights used to compute degree of
similarity between document and user
query
by sorting retrieved documents in
decreasing order (of degree of similarity),
vector model considers partially matched
documents

ranked document answer set a lot more
precise (than answer set by Boolean model)
Vector Model (Cont.)
Weight wi,j for pair (ki, dj) is positive and
non-binary

index terms in query are also weighted

Let wi,q be weight associated with pair
[ki,q], where wi,q  0



query vector q defined as q = (w1,q, w2,q,
…, wt,q) where t is total no. of index
terms in system

vector for document dj is represented by
dj = (w1,j, w2,j, …, wt,j)

Vector Model (Cont.)



Document dj and user query q
represented as t-dimensional vectors.
evaluate degree of similarity of dj with
regard to q as correlation
between

vectors d j and q.
Correlation can be quantified by cosine
of angle between
these two vectors


sim(dj,q) =

dj  q


| dj |  | q |
Vector Model (Cont.)



Sim(q,dj) varies from 0 to +1.
Ranks documents according to degree of
similarity to query
document may be retrieved even if it
partially matches query

establish a threshold on sim(dj,q) and retrieve
documents with degree of similarity above
threshold
Index term weights




Documents are collection C of objects
User query is set A of objects
IR problem is to determine which
documents are in set A and which are
not (I.e. clustering problem)
In clustering problem


intra-cluster similarity (features which better
describe objects in set A)
inter-cluster similarity (features which better
distinguish objects in set A from remaining
objects in collection C

In vector model, intra-cluster similarity
quantified by measuring raw frequency
of term ki inside document dj (tf factor)


how well term describes document contents
inter-cluster dissimilarity quantified by
measuring inverse of frequency of term ki
among documents in collection (idf factor)

terms which appear in many documents are
not very useful for distinguishing relevant
document from non-relevant one
Definition (pp.29)



Let N be total no. of documents in
system
let ni be number of documents in which
index term ki appears
let freqi,j be raw frequency of term ki in
document dj



no. of times term ki mentioned in text of
document dj
Normalized frequency fi,j of term ki in dj
freqi, j
fi,j =
max freql, j



Maximum computed over all terms
mentioned in text of document dj
if term ki does not appear in document dj
then fi,j = 0
let idfi, inverse document frequency for ki
be
N


idfi = log ni
best known term weighting scheme

wi,j = fi,j  log N
ni

Advantages of vector model




term weighting scheme improves retrieval
performance
retrieve documents that approximate query
conditions
sorts documents according to degree of
similarity to query
Disadvantage

index terms are mutually independent
Probabilistic Model

Given user query, there is set of
documents containing exactly relevant
documents.



Ideal answer set
given description of ideal answer set, no
problem in retrieving its documents
querying process is process of specifying
properties of ideal answer set


the properties are not exactly known
there are index terms whose semantics are
used to characterize these properties
Probabilistic Model (Cont.)




These properties not known at query
time
effort has to be made to initially guess
what they (I.e. properties) are
initial guess generate preliminary
probabilistic description of ideal answer
set to retrieve first set of documents
user interaction initiated to improve
probabilistic description of ideal answer
set



User examine retrieved documents and
decide which ones are relevant
this information used to refine description
of ideal answer set
by repeating this process, such
description will evolve and be closer to
ideal answer set
Fundamental Assumption

Given user query q and document dj in
collection, probabilistic model estimate
probability that user will find document dj
relevant




assumes that probability of relevance depends on
query and document representations only
assumes that there is subset of all documents
which user prefers as answer set for query q
such ideal answer set is labeled R
documents in set R are predicted to be relevant to
query

Given query q, probabilistic model
assigns to each document dj the ratio
P(dj relevant-to q)/P(dj non-relevant-to q)
measure of similarity to query
 odds of document dj being relevant to
query q







Index term weight variables are all binary
I.e. wi,j  {0,1}, wi,q  {0,1}
query q is subset of index terms
let R be set of documents known (initially
guessed) to be relevant
let R be complement of R
let P(R| d j) be probability that document
dj is relevant to query q

let P( R | d j) be probability that document
dj not relevant to query q.




Similarity sim(dj,q) of document dj to
query q is ratio
sim(dj,q) =

P ( R | dj )

P ( R | dj )
sim(dj,q) ~

P ( dj | R )

P ( dj | R )
t

i 1

sim(dj,q) ~  wi,q  wi,j   log 1  P(ki | R)  log
P(ki | R)
1  P(ki | R ) 

P(ki | R ) 

How to compute P(ki|R) and P(ki| R )
initially ?




assume P(ki|R) is constant for all index terms
ki (typically 0.5)
P(ki|R) = 0.5
assume distribution of index terms among
non-relevant documents approximated by
distribution of index terms among all
documents in collection
P(ki| R) = ni/N where ni is no. of documents
containing index term ki; N is total no. of doc.



Let V be subset of documents initially
retrieved and ranked by model
let Vi be subset of V composed of
documents in V with index term ki
P(ki|R) approximated by distribution of
index term ki among doc. retrieved


P(ki|R) = Vi / V
P(ki| R) approximated by considering all
non-retrieved doc. are not relevant

P(ki|R) =
ni  Vi
N V

Advantages


documents ranked in decreasing order of
their probability of being relevant
Disadvantages
need to guess initial separation of
relevant and non-relevant sets
 all index term weights are binary
 index terms are mutually independent
