Transcript Document

Intro to Information Retrieval
By the end of the lecture you should be able to:
 explain the differences between database
and information retrieval technologies
 describe the basic maths underlying settheoretic and vector models of classical IR.
Reminder: efficiency is vital
 Reminder: Google finds documents which
match your keywords; this must be done
EFFICIENTLY – cant just go through each
document from start to end for each keyword
 So, cache stores copy of document, and also
a “cut-down” version of the document for
searching: just a “bag of words”, a sorted list
(or array/vector/…) of words appearing in the
document (with links back to full document)
 Try to match keywords against this list; if
found, then return the full document
 Even cleverer: dictionary and inverted file…
Inverted file structure
dictionary
Inverted or postings file
Data file
1
Term 1 (2)
1
Term 2 (3)
3
Term 3 (1)
6
Term 4 (3)
7
Term 5 (4)
.
.
9
.
.
2
1
2
3
2
2
3
4
.
.
Doc 1
Doc2
Doc3
Doc4
Doc5
Doc6
.
.
IR vs DBMS
match
inference
model
data
query language
query
specification
items wanted
error response
DBMS
IR
exact
deduction
deterministic
record/field
artificial
complete
partial or best match
induction
probabilistic
text document
natural?
incomplete
matching
sensitive
relevant
insensitive
informal introduction
 IR was developed for bibliographic systems.
We shall refer to ‘documents’, but the
technique extends beyond items of text.
 central to IR is representation of a document
by a set of ‘descriptors’ or ‘index terms’
(“words in the document”).
 searching for a document is carried out
(mainly) in the ‘space’ of index terms.
 we need a language for formulating queries,
and a method for matching queries with
document descriptors.
architecture
user
query
hits
feedback
Query matching
Learning
component
Object base
(objects and their
descriptions)
basic notation
Given a list of m documents, D, and a list of n index
terms, T, we define wi,j  0 to be a weight associated
with the ith keyword and the jth document.
For the jth document, we define an index term vector, dj :
dj = (w1,j , w2,j , …., wn,j )
For example: D = { d1, d2, d3},
Recipe for jam pudding
T = {pudding, jam, traffic, lane, treacle}
d1 = (1, 1, 0, 0, 0),
d2 = (0, 0, 1, 1, 0),
d3 = (1, 1, 1, 1, 0)
DoT report on traffic lanes
Radio item on traffic jam in Pudding Lane
set theoretic, Boolean model
 Queries are Boolean expressions formed using
keywords, eg:
(‘Jam’  ‘Treacle’) ’Pudding’ ¬ ‘Lane’ ¬ ‘Traffic’
 Query is re-expressed in disjunctive normal form
CF: T = {pudding, jam, traffic, lane, treacle}
(DNF)
eg (1, 1, 0, 0, 0)  (1, 0, 0, 0, 1)  (1, 1, 0, 0, 1)
To match a document with a query:
sim(d, qDNF)
= 1 if d is equal to a component of
qDNF
= 0 otherwise
(1, 1, 0, 0, 0)  (1, 0, 0, 0, 1)  (1, 1, 0, 0, 1)
T = {pudding, jam, traffic, lane, treacle}
pudding
treacle
jam
lane
traffic
d1 = (1, 1, 0, 0, 0),
d2 = (0, 0, 1, 1, 0),
d3 = (1, 1, 1, 1, 0)
collecting results
T = {pudding, jam, traffic, lane, treacle}
pudding
Query:
(‘Jam’  ‘Treacle’)
’Pudding’ ¬ ‘Lane’
¬ ‘Traffic’
treacle
(jam  treacle) (pudding)
- Lane - Traffic
jam
lane
traffic
Answer: d1 = (1, 1, 0, 0, 0)
Jam pud recipe
Statistical vector model
 weights, 1  wi,j  0, no longer binary-valued
 query also represented by a vector
q = (w1q, w2q, …, wnq)
– eg q = (1.0, 0.6, 0.0, 0.0, 0.8)
CF: T = {pudding, jam, traffic, lane, treacle}
to match jth document with a query:
sim(dj, q) = dj  q /( | dj | ×| q | )
=
n

n

i=1
i=1
(wij × wiq)
wij
2×
n
2
w
iq
i=1
Cosine coefficient

T1
D1

n
i=1
n
(wij × wiq)
2×
w
ij
i=1
= cos()

n
i=1
w11
w1q
Q

w21
w2q
T2
wiq 2
Cosine coefficient

T1
D1
w11
w1q

n
n
i=1
(wij × wiq)
2×
w
ij
i=1
= cos(0)

n
i=1
Q
=0
w21
w2q
T2
wiq 2
=1
Cosine coefficient

T1
w11

D1
n
i=1
n
(wij × wiq)
2×
w
ij
i=1
= cos(90º)

n
i=1
= 90º
w1q= 0
w21= 0
Q
w2q
T2
wiq 2
=0
q = (1.0, 0.6, 0.0, 0.0, 0.8)
d1 = (0.8, 0.8, 0.0, 0.0, 0.2)

n
i=1
Jam pud recipe
(wij × wiq)
= 0.8×1.0 + 0.8×0.6 + 0.0×0.0 + 0.0×0.0 + 0.2×0.8
= 1.44




n
i=1
wij 2
n
= 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0
2
w
iq
i=1
n
i=1
n
i=1
= 0.82 + 0.82 + 0.02 + 0.02 + 0.22 = 1.32
(wij × wiq)
wij
2×

=
n
2
w
iq
i=1
1.44
1.32 × 2.0
= 0.89
q = (1.0, 0.6, 0.0, 0.0, 0.8)
d2 = (0.0, 0.0, 0.9, 0.8, 0),

n
i=1
DoT Report
(wij × wiq)
= 0.0×1.0 + 0.0×0.6 + 0.9×0.0 + 0.8×0.0 + 0.0×0.8
= 0.0




n
i=1
wij 2
n
= 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0
2
w
iq
i=1
n
i=1
n
i=1
= 0.02 + 0.02 + 0.92 + 0.82 + 0.02 = 1.45
(wij × wiq)
wij
2×

=
n
2
w
iq
i=1
0.0
1.45 × 2.0
= 0.0
q = (1.0, 0.6, 0.0, 0.0, 0.8)
d3 = (0.6, 0.9, 1.0, 0.6, 0.0)

n
i=1
Radio Traffic Report
(wij × wiq)
= 0.6×1.0 + 0.9×0.6 + 1.0×0.0 + 0.6×0.0 + 0.0×0.8
= 1.14




n
i=1
wij 2
n
= 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0
2
w
iq
i=1
n
i=1
n
i=1
= 0.62 + 0.92 + 1.02 + 0.62 + 0.02 = 2.53
(wij × wiq)
wij
2×

=
n
2
w
iq
i=1
1.14
2.53 × 2.0
= 0.51
collecting results
CF: T = {pudding, jam, traffic, lane, treacle}
q = (1.0, 0.6, 0.0, 0.0, 0.8)
Rank
document vector
1.
2.
d1 = (0.8, 0.8, 0.0, 0.0, 0.2)
d3 = (0.6, 0.9, 1.0, 0.6, 0.0)
document
(sim)
Jam pud recipe
Radio Traffic
Report
(0.89)
(0.51)
Discussion: Set theoretic model
 Boolean model is simple, queries have
precise semantics, but it is an ‘exact match’
model, and does not Rank results
 Boolean model popular with bibliographic
systems; available on some search engines
 Users find Boolean queries hard to formulate
 Attempts to use set theoretic model as basis
for a partial-match system: Fuzzy set model
and the extended Boolean model.
Discussion: Vector Model
 Vector model is simple, fast and results show
leads to ‘good’ results.
 Partial matching leads to ranked output
 Popular model with search engines
 Underlying assumption of term independence
(not realistic! Phrases, collocations, grammar)
 Generalised vector space model relaxes the
assumption that index terms are pairwise
orthogonal (but is more complicated).
questions raised
 Where do the index terms come from?
(ALL the words in the source documents?)
 What determines the weights?
 How well can we expect these systems to
work for practical applications?
 How can we improve them?
 How do we integrate IR into more traditional
DB management?
Questions to think about
 Why is traditional database unsuited to
retrieval of unstructured information?
 How would you re-express a Boolean query,
eg (A or B or (C and not D)), in disjunctive
normal form?
 For the matching coefficient, sim(., .) show
that 0  sim(., .)  1, and that sim(a, a) = 1.
 Compare and contrast the ‘vector’ and ‘set
theoretic’ models in terms of power of
representation of documents and queries.