Efficient Approximate Search on String Collections Marios Hadjieleftheriou Chen Li Outline Part 1:  Motivation and preliminaries  Inverted list based algorithms Part 2:  Gram signature.

Download Report

Transcript Efficient Approximate Search on String Collections Marios Hadjieleftheriou Chen Li Outline Part 1:  Motivation and preliminaries  Inverted list based algorithms Part 2:  Gram signature.

Efficient Approximate Search on String Collections
Marios Hadjieleftheriou
Chen Li
1
Outline
Part 1:
 Motivation and preliminaries
 Inverted list based algorithms
Part 2:
 Gram signature algorithms
 Length normalized algorithms
 Selectivity estimation
 Conclusion and future directions
2
Web Search
Actual
queries
gathered
by
Google
http://www.google.com/jobs/britney.html

Errors in queries

Errors in data

Bring query and meaningful
results closer together
3
Record Linkage
R
informix
microsoft
…
…
S
infromix
…
mcrosoft
…

Record linkage



Edit distance
Jaccard
Cosine
…
4
Document Cleaning
Should be “Niels Bohr”
Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope
5
Demos



http://directory.uci.edu/
http://psearch.ics.uci.edu/advanced/
http://psearch.ics.uci.edu/
6
State-of-the-art: Oracle 10g and older





Supported by Oracle Text
CREATE TABLE engdict(word VARCHAR(20), len INT);
Create preferences for text indexing:
begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST');
ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH');
ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0');
ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000');
ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE');
ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; /
CREATE INDEX fuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS
ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF');
Usage:
SELECT * FROM engdict
WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0;

Limitation: cannot handle errors in the first letters:
Katherine versus Catherine
7
Microsoft SQL Server [CGG+05]





Data cleaning tools available in SQL Server 2005
Part of Integration Services
Supports fuzzy lookups
Uses data flow pipeline of transformations
Similarity function: tokens with TF/IDF scores
8
Lucene



Using Levenshtein Distance (Edit Distance).
Example: roam~0.8
Prefix filtering followed by a scan (Efficiency?)
9
Problem Formulation
Find strings similar to a given string
Performance is important!
-10 ms: 100 queries per second (QPS)
- 5 ms: 200 QPS
Similarity Functions


Similar to:
 a domain-specific function
 returns a similarity value between two strings
Examples:
 Edit distance
 Hamming distance
 Jaccard similarity
 Soundex
 TF/IDF, BM25, DICE
 See [KSS06] for an excellent survey
11
Edit Distance
A widely used metric to define string similarity
 Ed(s1,s2) = minimum # of operations (insertion,
deletion, substitution) to change s1 to s2
 Example:
s1: Tom Hanks
s2: Ton Hank
ed(s1,s2) = 2

12
12
Outline

Motivation and preliminaries

Inverted list based algorithms







List-merging algorithms
VGRAM
List-compression techniques
Gram signature algorithms
Length normalized algorithms
Selectivity estimation
Conclusion and future directions
13
“q-grams” of strings
universal
2-grams
14
Edit operation’s effect on grams
universal
Fixed length: q
k operations could affect k * q grams
15
q-gram inverted lists
id
0
1
2
3
4
strings
rich
stick
stich
stuck
static
2-grams
at
ch
ck
ic
ri
st
ta
ti
tu
uc
4
0
2
1
0
0
1
4
1
3
3
3
1
2
4
2
3
4
2
4
16
Searching using inverted lists

Query: “shtick”, ED(shtick, ?)≤1
sh
id
0
1
2
3
4
ht
strings
rich
stick
stich
stuck
static
ti
2-grams
ic
at
ch
ck
ic
ri
st
ta
ti
tu
uc
ck
# of common grams >= 3
4
0
2
1
0
0
1
4
1
3
3
3
1
2
4
2
3
4
2
4
17
T-occurrence Problem
Merge
Ascending
order
Find elements whose occurrences ≥ T
18
Example

T=4
1
10
5
3
13
7
5
15
13
13
15
10
13
Result: 13
19
List-Merging Algorithms
HeapMerger
MergeOpt
[SK04]
[LLL08, BK02]
ScanCount
MergeSkip
DivideSkip
20
Heap-based Algorithm
Push to
heap
……
Min-heap
Count # of occurrences of each element using a heap
21
MergeOpt Algorithm [SK04]
Binary
search
Long Lists: T-1
Short Lists
22
Example of MergeOpt
1
10
5
3
13
7
5
15
13
13
15
10
13
Long Lists: 3
Short Lists: 2
Count threshold T≥ 4
23
ScanCount
String ids
1
2
3
# of occurrences
1
0
0
1
0
Increment by
1
…
1
10
5
3
13
7
5
15
13
13
15
10
13
14
0
4
0
15
2
0
Result!
13
Count threshold T≥ 4
24
List-Merging Algorithms
HeapMerger
MergeOpt
[SK04]
[LLL08, BK02]
ScanCount
MergeSkip
DivideSkip
25
MergeSkip algorithm [BK02, LLL08]
……
Min-heap
Jump
Pop T-1
Greater
or
equals
T-1
26
Example of MergeSkip
1
minHeap
5
13
Jump
10
15
1
10
5
3
13
7
5
15
17
13
15
10
15
Count threshold T≥ 4
27
DivideSkip Algorithm [LLL08]
Binary
MergeSkip
search
Long Lists
Short Lists
28
How many lists are treated as long
lists?
Long Lists
?
Lookup
Short Lists
Merge
A good balance in the tradeoff:
# of long lists = T / ( μ logM +1)
29
Length Filtering
Length: 10
s:
By length
only!
Ed(s,t) ≤ 2
t:
Length: 19
30
Positional Filtering
Ed(s,t) ≤ 2
s
a b
(ab,1)
t
a b
(ab,12)
31
Filter tree [LLL08]
root
1
2
3
aa
…
… zy
…
1
2
ab
5
12
17
28
44
n
Length level
Gram level
zz
m
Position level
Inverted list
32
Surprising experimental
results (DBLP)
DivideSkip
No filter
(ms)
Length
(ms)
2.23
0.76
Length+Pos
(ms)
1.96
Adding position filter could increase running time
33
Filters fragment inverts lists
Merge
Merge
Merge
Merge
Applying
filters
Saving: reduce list size.
Cost:
- Tree traversal,
- More merging
34
Outline


Motivation and preliminaries
Inverted list based algorithms







List-merging algorithms
VGRAM [LWY07,YWL08]
List-compression techniques
Gram signature algorithms
Length normalized algorithms
Selectivity estimation
Conclusion and future directions
35
2-grams -> 3-grams?

Query: “shtick”, ED(shtick, ?)≤1
sht
id
0
1
2
3
4
hti
strings
rich
stick
stich
stuck
static
tic
3-grams
ick
ati
ich
ick
ric
sta
sti
stu
tat
tic
tuc
uck
# of common grams >= 1
4
0
1
0
4
1
3
4
1
3
3
2
2
2
4
36
Observation 1: dilemma of choosing “q”

Increasing “q” causing:


id
0
1
2
3
4
Longer grams  Shorter lists
Smaller # of common grams of similar strings
strings
rich
stick
stich
stuck
static
2-grams
at
ch
ck
ic
ri
st
ta
ti
tu
uc
4
0
2
1
0
0
1
4
1
3
3
3
1
2
4
2
3
4
2
4
37
Observation 2: skew distributions of gram frequencies

DBLP: 276,699 article titles

Popular 5-grams: ation (>114K times), tions, ystem, catio
38
VGRAM: Main idea

Grams with variable lengths (between qmin
and qmax)

zebra
-

corrasion
-

ze(123)
co(5213), cor(859), corr(171)
Advantages



Reduce index size 
Reducing running time 
Adoptable by many algorithms 
39
Challenges




Generating variable-length grams?
Constructing a high-quality gram dictionary?
Relationship between string similarity and their
gram-set similarity?
Adopting VGRAM in existing algorithms?
40
Challenge 1: String  Variable-length grams?

Fixed-length 2-grams
universal

Variable-length grams
[2,4]-gram dictionary
universal
ni
ivr
sal
uni
vers
41
Representing gram dictionary as a trie
ni
ivr
sal
uni
vers
42
Challenge 2: Constructing gram dictionary
Step 1: Collecting frequencies of grams with length in [qmin, qmax]
st  0, 1, 3
sti 0, 1
stu3
stic 0, 1
stuc3
Gram trie with frequencies
43
Step 2: selecting grams

Pruning trie using a frequency threshold F (e.g., 2)
44
Step 2: selecting grams (cont)
Threshold T = 2
45
Final gram dictionary
A cost-based approach to choosing a gram dictionary [YWL08]
46
Challenge 3: Edit operation’s effect on grams
universal
Fixed length: q
k operations could affect k * q grams
47
Deletion affects variable-length grams
Not affected
Affected
i-qmax+1
i
Deletion
Not affected
i+qmax- 1
48
Grams affected by a deletion
Affected?
i-qmax+1
i
Deletion
i+qmax- 1
[2,4]-grams
Deletion
universal
Affected?
ni
ivr
sal
uni
vers
49
Grams affected by a deletion (cont)
Affected?
i-qmax+1
Trie of grams
i
Deletion
i+qmax- 1
Trie of reversed grams
50
# of grams affected by each operation
Deletion/substitution
Insertion
0 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 0
_u_n_i_v_e_r_s_a_l_
51
Max # of grams affected by k operations
Vector of s = <2,4,6,8,9>
With 2 edit operations, at most 4 grams can be affected



Called NAG vector (# of affected grams)
Precomputed and stored
Dynamic programming to compute tight bounds [YWL08]
52
Summary of VGRAM index
53
Challenge 4: adopting VGRAM
Easily adoptable by many algorithms
Basic interfaces:
 String s  grams
 String s1, s2 such that ed(s1,s2) <= k  min
# of their common grams
54
Lower bound on # of common grams
Fixed length (q)
universal
If ed(s1,s2) <= k, then their # of common grams >=:
(|s1|- q + 1) – k * q
Variable lengths: # of grams of s1 – NAG(s1,k)
55
Example: algorithm using inverted lists

Query: “shtick”, ED(shtick, ?)≤1
sh
ht
tick
2-grams
…
ck
ic
…
ti
…
2-4 grams
Lower bound = 3
1
0
3
1
2
1
2
4
id
0
1
2
3
4
4
strings
rich
stick
stich
stuck
static
…
ck
ic
ich
…
tic
tick
…
1
1
0
3
4
2
2
1
4
Lower bound = 1
56
Outline

Motivation and preliminaries

Inverted list based algorithms

List-merging algorithms
 VGRAM
 List-compression techniques [BJL+09]
Gram signature algorithms
Length normalized algorithms
Selectivity estimation
Conclusion and future directions




57
Motivation

Inverted index: very large

IR : lossless compression (delta encoding, mostly disk-based)

Decompression overhead

Difficult to tune compression ratio
58
Solution

Two lossy-compression techniques

Queries become faster

Flexibility to choose space / time tradeoff
59
Approach 1: Discarding Lists
2-grams
Inverted
Lists
…
in
tf
vi
5
9
1
5
Discarded
ir
ef
rv
ne
un
7
9
5
6
9
1
2
4
5
6
…
60
Approach 2: Combining Lists
2-grams
Inverted
Lists
…
in
1
3
4
5
7
9
tf
vi
ir
ef
1
2
3
9
5
6
9
rv
ne
un
7
9
6
9
1
2
4
5
6
…
Combined
61
Technical challenges

Effect on list-merging algorithms

How to choose lists to discard/merge
62
Outline (end of part 1)
Part 1:
 Motivation and preliminaries
 Inverted list based algorithms



List-merging algorithms
VGRAM [LWY07,YWL08]
List-compression techniques
Part 2:
 Gram signature algorithms
 Length normalized algorithms
 Selectivity estimation
 Conclusion and future directions
63