Transcript PowerPoint

CS 430: Information Discovery
Lecture 12
Extending the Boolean Model
1
Course Administration
Midterm examination:
Date: Wednesday, 31 October, 7:30 to 8:30 p.m.
Room: TBA
Open book
Assignment 1:
Grades have been sent by email. If you have not
received a grade, please send a message to
[email protected]
2
Problems with the Boolean model
Counter-intuitive results:
Query q = A and B and C and D and E
Document d has terms A, B, C and D, but not E
Intuitively, d is quite a good match for q, but it is rejected by the
Boolean model.
Query q = A or B or C or D or E
Document d1 has terms A, B, C, D and E
Document d2 has term A, but not B, C, D or E
Intuitively, d1 is a much better match than d2, but the Boolean
model ranks them as equal.
3
Problems with the Boolean model
(continued)
•
Boolean model has no way to rank documents.
•
Boolean model allows for no uncertainty in assigning
index terms to documents.
•
The Boolean model has no provision for assigning weights
to the importance of query terms.
Boolean is all or nothing.
4
Boolean model as sets
d and q are either in the set A or not in A. There
is no halfway!
q
d
A
5
Extending the Boolean model
Term weighting
•
Give weights to terms in documents and/or queries.
•
Combine standard Boolean retrieval with vector ranking
of results
Fuzzy sets
•
6
Relax the boundaries of the sets used in Boolean retrieval
Ranking methods in Boolean systems
SIRE (Syracuse Information Retrieval Experiment)
Term weights
•
Add term weights to documents
Weights calculated by the standard method of
term frequency * inverse document frequency.
Ranking
• Calculate results set by standard Boolean methods
•
7
Rank results by vector distances
Relevance feedback in SIRE
SIRE (Syracuse Information Retrieval Experiment)
Relevance feedback is particularly important with Boolean
retrieval because it allow the results set to be expanded
8
•
Results set is created by standard Boolean retrieval
•
User selects one document from results set
•
Other documents in collection are ranked by vector
distance from this document
Boolean model as fuzzy sets
q is more or less in A. There is a halfway!
q
d
A
9
Basic concept
• A document has a term weights associated with each index
term. The term weight measures the degree to which that term
characterizes the document.
• Term weights are in the range [0, 1]. (In the standard Boolean
model all weights are either 0 or 1.)
• For a given query, calculate the similarity between the query
and each document in the collection.
• This calculation is needed for every document that has a nonzero weight for any of the terms in the query.
10
MMM: Mixed Min and Max model
Fuzzy set theory
dA is the degree of membership of an element to set A
intersection (and)
dAB = min(dA, dB)
union (or)
dAB = max(dA, dB)
11
MMM: Mixed Min and Max model
Fuzzy set theory example
standard
set theory
12
fuzzy
set theory
dA
1
1
0
0
0.5
dB
1
0
1
0
and dAB
1
0
0
or dAB
1
1
1
0.5
0
0
0.7
0 0.7
0
0
0.5
0
0
0
0
0.7
0.5 0.7
0
MMM: Mixed Min and Max model
Terms: A1, A2, . . . , An
Document D, with index-term weights: dA1, dA2, . . . , dAn
Qor = (A1 or A2 or . . . or An)
Query-document similarity:
S(Qor, D) = Cor1 * max(dA1, dA2,.. , dAn) + Cor2 * min(dA1, dA2,.. , dAn)
where Cor1 + Cor2 = 1
13
MMM: Mixed Min and Max model
Terms: A1, A2, . . . , An
Document D, with index-term weights: dA1, dA2, . . . , dAn
Qand = (A1 and A2 and . . . and An)
Query-document similarity:
S(Qand, D) = Cand1 * min(dA1,.. , dAn) + Cand2 * max(dA1,.. , dAn)
where Cand1 + Cand2 = 1
14
MMM: Mixed Min and Max model
Experimental values:
Cand1 in range [0.5, 0.8]
Cor1 > 0.2
Computational cost is low. Retrieval performance much
improved.
15
Paice Model
Paice model is a relative of the MMM model.
The MMM model considers only the maximum and minimum
document weights.
The Paice model takes into account all of the document weights.
Computational cost is higher than from MMM. Retrieval
performance is improved.
See Frake, pages 396-397 for more details
16
P-norm model
Terms: A1, A2, . . . , An
Document D, with term weights: dA1, dA2, . . . , dAn
Query terms are given weights, a1, a2, . . . ,an, which indicate
their relative importance.
Operators have coefficients that indicate their degree of
strictness
Query-document similarity is calculated by considering each
document and query as a point in n space.
See Frake, pages 397-398 for details
17
Test data
CISI
CACM
INSPEC
P-norm
79
106
210
Paice
77
104
206
MMM
68
109
195
Percentage improvement over standard Boolean model
(average best precision)
Lee and Fox, 1988
18
Reading
E. Fox, S. Betrabet, M. Koushik, W. Lee, Extended Boolean
Models, Frake, Chapter 15
Methods based on fuzzy set concepts
19