4bslides/cs466-lecture-vi-wsk.ppt

Download Report

Transcript 4bslides/cs466-lecture-vi-wsk.ppt

Today’s Topics
•
•
•
•
•
Boolean IR
Signature files
Inverted files
PAT trees
Suffix arrays
Boolean IR
•
•
- Pre 1970’s
- Dominant industrial model
through 1994
(Lexis-Nexis, DIALOG)
Documents composed of TERMS(words, stems)
Express result in set-theoretic terms
Doc’s containing
term A
term B
term C
A AND B
Doc’s containing
term A
term B
term C
(A AND B) OR C
Boolean Operators
Doc’s containing term A
A AND B
A OR B
(A AND B) OR C
A AND ( NOT B )
Proximity
Operators
(Extended
ANDs)
Adjacent AND g
Proximity window g
(in +/- K words)
“A B”
A w/10 B
g A w/sent B
e.g. “Johns Hopkins”
“The Who”
A and B within
+/- 10 words
A + B in same sentence
Boolean IR(implementation)
• Bit vectors
Termi
V1 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0
V2 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0
Impractical g very sparse(wastefully big)
g costly to compare
• Inverted files(a.k.a. Index)
• PAT tree(more powerful index)
Problems with Boolean IR
• Does not effectively support relevance ranking of
returned documents
• Base model : expression satisfaction is Boolean
A document matches expression or it doesn’t
• Extension to permit ordering : (A AND B) OR C
– Supermatches(5 terms/doc > 3 terms/doc)
– Partial matches
(expression incompletely satisfied – give partial credit)
– Importance weighting(10A OR 5B)
Weight/importance
Boolean IR
• Advantages : Can directly control search
Good for precise queries in structured data
(e.g. database search or legal index)
• Disadvantages : Must directly control search
– Users should be familiar with domain and term
space(know what to ask for and exclude)
– Poor at relevance ranking
– Poor at weighted query expansion, user modelling etc.
Signature Files
Document
Bit
00000 10101001000000000
vector
Mapping
function f( )
Signature 0 0 1 0 0 1 0 0 0 0 0
Superimposed
Coding
Using some mapping/
Hash function
fewer bits
Problem : several different document bit vectors(i.e. different words)
get mapped to same signature.
(use stoplist to help avoid common words from overwhelming signatures)
False Drop Problem
• On retrieval, all documents/bit vectors mapped to
the same signature are retrieved(returned)
• Only a portion are relevant
• Need to do secondary validation step to make sure
target words actually match
Prob(False Drop) = Prob(Signature qualifies & Text does not)
Efficiency Problem
Testing for signature match may require linear
scan through all document signatures
Vertical Partitioning
• Improves sig1, sig2 comparison speed,
but still requires O(N) linear search of all signatures
• Options :
sig
- Bit sliced onto different devices for parallel comparision
- And together matches on each segment
sig1
sig2
comp
AND
AND
g result
Horizontal Partitioning
• Goal : avoid sequential scanning of the
signature file
Input
signature
Signature
Database
Hash function
or index
yielding specific
candidates to try
Inverted Files
• Like an index to a book
Documents
Terms index
Baum 14
39
156
Bayes 39
45
156
290
41
Viterbi 86
156
217
14
15
16
17
37
38
39
40
Inverted Files
• Very efficient for single word queries
Just enumerate documents pointed to by index
O( |A| ) = O(SA)
• Efficient for OR’s
Just enumerate both lists and remove duplicates
O(SA + SB)
AND’s using Inverted Files
(meet search)
Method 1:
Ai
i
Index for
Bayes
14
39
156
227
319
Bj
Index for
Viterbi
39
45
58
96
156
208
j
O(SA + SB )
same as OR,
but smaller output
• Begin with two pointers(i, j) on list # is in index(A,B)
• if A[ i ] = B[ i ], write A[ i ] to output
• if A[ i ] < B[ i ], i++ else j++
AND’s using Inverted Files
Method 2: Useful if one index is smaller than the other(SA << SB
)
(Hopkins)
Bj
For all members of A
1
(Johns)
5
bsearch (A[ i ], B)
Ai
25
(do binary search
39
28
227
into larger index)
39
for all members of
45
smaller index
A AND B AND C
58
96
Order by smaller list pairwise
156
Cost : SA * log2 (SB )
can achieve SA * log log (SB )
Proximity Search
Document level indexes not adequate
Option 1 :
A
H
Doc 1
Anthony
Doc 2
Johns
J
H
Doc 3
H
Doc i
Size of corpus = size of index
Before :
Match if ptrA = ptrB
Now :
“A B” = match if ptrA = ptrB -1
Hopkins
A
Index to corpus
Position offset
A w/10 B = match if
| ptrA - ptrB |  10
Variations 1
Don’t index function words
index
wordlist
Johns
The
X
Do linear match search in corpus
g savings on 50% index size
g potential speed improvement
given data access costs
The
Johns
Hopkins
*
Variations 2 : Multilevel Indexes
Position level
Doc level
Johns
Hopkins
Anthony
Johns
Johns
Hopkins
Anthony
Hopkins
Hopkins
Anthony
g Supports parallel
search
g May have paging
cost advantage
g Cost – large index
N + dV
Avg. Doc/vocab size
Interpolation Search
value
Bi
cell
17
174
18
195
19 * 211 *
20
226
21
230
22
231
23
246
K
max size
# bins
48
49
50
51
483
496
521
526
100
995
Useful when data are numeric and
uniformly distributed
# of cells in index : 100
Values range from 0 … 1000
Goal : looking for the value 211
Binary search :
begin looking at cell 50
Interpolation search :
better guess for 1st cell to examine?
Binary Search
Bsearch(low, high, key)
mid = (high + low) / 2
If (key = A[mid])
return mid
Else if (key < A[mid])
Bsearch (low, mid-1, key)
Else
Bsearch(mid+1, high, key)
Interpolation Search
Isearch(low, high, key)
mid = best estimate of pos
mid = low + (high – low) *
(expected % of way
through range)
key - A[ low ]
A[ high ] - A[ low]
Comparison
Typical sequence of cell’s tested :
Binary Search
Interpolation Search
50
21
25
19. g go directly to
expected region
12
log log (N)
18
22
21
19.
Cost of Computing Inverted Index
1. Simple
word position pairs
and sort
Corpus size gN log N
2. If N >> memory size
1)
2)
3)
4)
Tokenize(words g integers)
Create histogram
Allocate space in index
Do multipass(K-pass) through corpus only adding
tokens in K bins
K-pass Indexing
index
W1
W2
Block1
(pass
K = 1)
W3
Time = KN + 1
But big win over
N log N on paging
W4
K=2
Vector Models for IR
• Gerald Salton, Cornell
(Salton + Lesk, 68)
(Salton, 71)
(Salton + McGill, 83)
• SMART System
Chris Buckely, Cornell
g Current keeper of the flame
Salton’s magical automatic retrieval tool(?)
Vector Models for IR
Boolean Model
Doc V1 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0
Doc V2 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0
SMART Vector Model
Termi
Word
Stem
Special compounds
Doc V1 1.0 3.5 4.6 0.1 0.0 0.0
Doc V2 0.0 0.0 0.0 0.1 4.0 0.0
SMART vectors are composed of real valued Term weights
NOT simply Boolean Term Present or NOT
Example
Comput* C++ Sparc
genome Bilog*
Compiler
protein
DNA
Doc V1 3 5 4
1 0
Doc V2 1 0 0
0 5 3 1 4
1 0 0
Doc V3 2 8 0 1 0 1 0 0
Issues
• How are weights determined?
(simple option :
jraw freq.
kweighted by region, titles, keywords)
• Which terms to include? Stoplists
• Stem or not?
QUERIES and Documents
share same vector representaion
D1
D2

Q

D3
Given Qeury DQ g map to vector VQ
and find document Di : sim (Vi ,VQ) is greatest
Similarity Functions
• Many other options availabe(Dice, Jaccard)
• Cosine similarity is self normalizing
V1 100 200 300 50
V2
1 2
D2
3 0.5
V3 10 20 30 5

Q
 
Can use arbitrary integer values
(don’t need to be probabilities)
D3