Transcript Document
Intro to Information Retrieval
By the end of the lecture you should be able to:
explain the differences between database
and information retrieval technologies
describe the basic maths underlying settheoretic and vector models of classical IR.
Reminder: efficiency is vital
Reminder: Google finds documents which
match your keywords; this must be done
EFFICIENTLY – cant just go through each
document from start to end for each keyword
So, cache stores copy of document, and also
a “cut-down” version of the document for
searching: just a “bag of words”, a sorted list
(or array/vector/…) of words appearing in the
document (with links back to full document)
Try to match keywords against this list; if
found, then return the full document
Even cleverer: dictionary and inverted file…
Inverted file structure
dictionary
Inverted or postings file
Data file
1
Term 1 (2)
1
Term 2 (3)
3
Term 3 (1)
6
Term 4 (3)
7
Term 5 (4)
.
.
9
.
.
2
1
2
3
2
2
3
4
.
.
Doc 1
Doc2
Doc3
Doc4
Doc5
Doc6
.
.
IR vs DBMS
match
inference
model
data
query language
query
specification
items wanted
error response
DBMS
IR
exact
deduction
deterministic
record/field
artificial
complete
partial or best match
induction
probabilistic
text document
natural?
incomplete
matching
sensitive
relevant
insensitive
informal introduction
IR was developed for bibliographic systems.
We shall refer to ‘documents’, but the
technique extends beyond items of text.
central to IR is representation of a document
by a set of ‘descriptors’ or ‘index terms’
(“words in the document”).
searching for a document is carried out
(mainly) in the ‘space’ of index terms.
we need a language for formulating queries,
and a method for matching queries with
document descriptors.
architecture
user
query
hits
feedback
Query matching
Learning
component
Object base
(objects and their
descriptions)
basic notation
Given a list of m documents, D, and a list of n index
terms, T, we define wi,j 0 to be a weight associated
with the ith keyword and the jth document.
For the jth document, we define an index term vector, dj :
dj = (w1,j , w2,j , …., wn,j )
For example: D = { d1, d2, d3},
Recipe for jam pudding
T = {pudding, jam, traffic, lane, treacle}
d1 = (1, 1, 0, 0, 0),
d2 = (0, 0, 1, 1, 0),
d3 = (1, 1, 1, 1, 0)
DoT report on traffic lanes
Radio item on traffic jam in Pudding Lane
set theoretic, Boolean model
Queries are Boolean expressions formed using
keywords, eg:
(‘Jam’ ‘Treacle’) ’Pudding’ ¬ ‘Lane’ ¬ ‘Traffic’
Query is re-expressed in disjunctive normal form
CF: T = {pudding, jam, traffic, lane, treacle}
(DNF)
eg (1, 1, 0, 0, 0) (1, 0, 0, 0, 1) (1, 1, 0, 0, 1)
To match a document with a query:
sim(d, qDNF)
= 1 if d is equal to a component of
qDNF
= 0 otherwise
(1, 1, 0, 0, 0) (1, 0, 0, 0, 1) (1, 1, 0, 0, 1)
T = {pudding, jam, traffic, lane, treacle}
pudding
treacle
jam
lane
traffic
d1 = (1, 1, 0, 0, 0),
d2 = (0, 0, 1, 1, 0),
d3 = (1, 1, 1, 1, 0)
collecting results
T = {pudding, jam, traffic, lane, treacle}
pudding
Query:
(‘Jam’ ‘Treacle’)
’Pudding’ ¬ ‘Lane’
¬ ‘Traffic’
treacle
(jam treacle) (pudding)
- Lane - Traffic
jam
lane
traffic
Answer: d1 = (1, 1, 0, 0, 0)
Jam pud recipe
Statistical vector model
weights, 1 wi,j 0, no longer binary-valued
query also represented by a vector
q = (w1q, w2q, …, wnq)
– eg q = (1.0, 0.6, 0.0, 0.0, 0.8)
CF: T = {pudding, jam, traffic, lane, treacle}
to match jth document with a query:
sim(dj, q) = dj q /( | dj | ×| q | )
=
n
n
i=1
i=1
(wij × wiq)
wij
2×
n
2
w
iq
i=1
Cosine coefficient
T1
D1
n
i=1
n
(wij × wiq)
2×
w
ij
i=1
= cos()
n
i=1
w11
w1q
Q
w21
w2q
T2
wiq 2
Cosine coefficient
T1
D1
w11
w1q
n
n
i=1
(wij × wiq)
2×
w
ij
i=1
= cos(0)
n
i=1
Q
=0
w21
w2q
T2
wiq 2
=1
Cosine coefficient
T1
w11
D1
n
i=1
n
(wij × wiq)
2×
w
ij
i=1
= cos(90º)
n
i=1
= 90º
w1q= 0
w21= 0
Q
w2q
T2
wiq 2
=0
q = (1.0, 0.6, 0.0, 0.0, 0.8)
d1 = (0.8, 0.8, 0.0, 0.0, 0.2)
n
i=1
Jam pud recipe
(wij × wiq)
= 0.8×1.0 + 0.8×0.6 + 0.0×0.0 + 0.0×0.0 + 0.2×0.8
= 1.44
n
i=1
wij 2
n
= 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0
2
w
iq
i=1
n
i=1
n
i=1
= 0.82 + 0.82 + 0.02 + 0.02 + 0.22 = 1.32
(wij × wiq)
wij
2×
=
n
2
w
iq
i=1
1.44
1.32 × 2.0
= 0.89
q = (1.0, 0.6, 0.0, 0.0, 0.8)
d2 = (0.0, 0.0, 0.9, 0.8, 0),
n
i=1
DoT Report
(wij × wiq)
= 0.0×1.0 + 0.0×0.6 + 0.9×0.0 + 0.8×0.0 + 0.0×0.8
= 0.0
n
i=1
wij 2
n
= 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0
2
w
iq
i=1
n
i=1
n
i=1
= 0.02 + 0.02 + 0.92 + 0.82 + 0.02 = 1.45
(wij × wiq)
wij
2×
=
n
2
w
iq
i=1
0.0
1.45 × 2.0
= 0.0
q = (1.0, 0.6, 0.0, 0.0, 0.8)
d3 = (0.6, 0.9, 1.0, 0.6, 0.0)
n
i=1
Radio Traffic Report
(wij × wiq)
= 0.6×1.0 + 0.9×0.6 + 1.0×0.0 + 0.6×0.0 + 0.0×0.8
= 1.14
n
i=1
wij 2
n
= 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0
2
w
iq
i=1
n
i=1
n
i=1
= 0.62 + 0.92 + 1.02 + 0.62 + 0.02 = 2.53
(wij × wiq)
wij
2×
=
n
2
w
iq
i=1
1.14
2.53 × 2.0
= 0.51
collecting results
CF: T = {pudding, jam, traffic, lane, treacle}
q = (1.0, 0.6, 0.0, 0.0, 0.8)
Rank
document vector
1.
2.
d1 = (0.8, 0.8, 0.0, 0.0, 0.2)
d3 = (0.6, 0.9, 1.0, 0.6, 0.0)
document
(sim)
Jam pud recipe
Radio Traffic
Report
(0.89)
(0.51)
Discussion: Set theoretic model
Boolean model is simple, queries have
precise semantics, but it is an ‘exact match’
model, and does not Rank results
Boolean model popular with bibliographic
systems; available on some search engines
Users find Boolean queries hard to formulate
Attempts to use set theoretic model as basis
for a partial-match system: Fuzzy set model
and the extended Boolean model.
Discussion: Vector Model
Vector model is simple, fast and results show
leads to ‘good’ results.
Partial matching leads to ranked output
Popular model with search engines
Underlying assumption of term independence
(not realistic! Phrases, collocations, grammar)
Generalised vector space model relaxes the
assumption that index terms are pairwise
orthogonal (but is more complicated).
questions raised
Where do the index terms come from?
(ALL the words in the source documents?)
What determines the weights?
How well can we expect these systems to
work for practical applications?
How can we improve them?
How do we integrate IR into more traditional
DB management?
Questions to think about
Why is traditional database unsuited to
retrieval of unstructured information?
How would you re-express a Boolean query,
eg (A or B or (C and not D)), in disjunctive
normal form?
For the matching coefficient, sim(., .) show
that 0 sim(., .) 1, and that sim(a, a) = 1.
Compare and contrast the ‘vector’ and ‘set
theoretic’ models in terms of power of
representation of documents and queries.