PPT - Jiaheng Lu

Download Report

Transcript PPT - Jiaheng Lu

Efficient Merging and
Filtering Algorithms for
Approximate String Searches
Jiaheng Lu,
University of California, Irvine
Joint work with Chen Li, Yiming Lu
1
Example: a movie database
Find movies starred Schwarrzenger.
Star
Keanu Reeves
Title
The Matrix
Year
1999
Genre
Sci-Fi
Samuel Jackson
Iron man
2008
Sci-Fi
Schwarzenegger
The Terminator
1984
Sci-Fi
Samuel Jackson
The man
2006
Crime
2
In general:
Gap between Queries and Data

Errors in the query


The user doesn’t remember a string exactly
The user unintentionally types a wrong string
Query: Schwarrzenger.
Data : Schwarzenegger
3
…
…
Data may not clean

Errors in the database:

Data often is not clean by itself, especially true in data
integration and cleansing
Relation R
Star
Relation S
Star
Keanu Reeves
Keanu Reeves
Samuel Jackson
Samuel L. Jackson
Schwarzenegger
Schwarzenegger
Samuel Jackson
Samuel L. Jackson
4
Query may include error
5
Problem definition:
approximate string searches
Collection of strings s
Search
Query q
Star
Keanu Reeves
Samuel Jackson
Schwarzenegger
Samuel Jackson
…
Output: strings s that satisfy Sim(q,s)≤δ
6
Example Similarity Function:
Edit Distance



A widely used metric to define string similarity
Ed(s1,s2)= minimum # of operations (insertion,
deletion, substitution) to change s1 to s2
Example:
s1: Tom Hanks
s2: Ton Hank
ed(s1,s2) = 2
7
Example: approximate string searches
Collection of strings s
Search
Query q
Tom Hanks
Star
Tom Hank
Thomas Hanks
Ton Hank
Tom J. Hanks
…
Output: strings s that satisfy ed(q,s)≤2
8
Outline
Problem motivation
 Preliminary



Grams
Inverted lists
Merge algorithms
 Filtering technique
 Conclusion

9
String  Grams

q-grams
For example: 2-gram
u
n
i
v
e
r
s
a
l
(un),(ni),(iv),(ve),(er),(rs),(sa),(al)
10
Inverted lists

Convert strings to gram inverted lists
id
0
1
2
3
4
strings
rich
stick
stich
stuck
static
2-grams
at
ch
ck
ic
ri
st
ta
ti
tu
uc
4
0
2
1
0
0
1
4
1
3
3
3
1
2
4
2
3
4
2
4
11
Main Example
Query
ed(s,q)≤1
stick
(st,ti,ic,ck)
Grams
Data
id
strings
0
rich
1
stick
2
stich
3
stuck
4
static
st
1,2,3,4
ti
1,2,4
ic
0,1,2,4
ck
1,3
Merge
Candidate string ids
{1,2,3,4}
count >=2
Double check
for the real
edit distance
ck
1,3
ic
0,1,2,4
Final answers
st
1,2,3,4
{1,2,3}
ta
4
ti
…
1,2,4
Performance
bottleneck!
12
Sub-problem definitions:
Given multiple inverted lists with
integer values in increasing order
and a threshold T, we find all values
whose number of occurrences ≥ T.
13
Example

Count threshold: 4
1
10
5
3
13
7
5
15
13
13
15
10
13
Result: 13
14
Outline
Problem motivation
 Preliminary
 Merge algorithms



Two previous algorithms
Our proposed three algorithms
Filtering technique
 Conclusion

15
Five Merge Algorithms
HeapMerger
MergeOpt
[Sarawagi,SIGMOD
2004]
[Sarawagi,SIGMOD
2004]
Previous
New
ScanCount
MergeSkip
DivideSkip
16
Two previous algorithms (1)
Heap-based Algorithm
Push to
heap
……
Min-heap
Count # of the occurrences of each element by a heap
17
Example of HeapMerger
[Sarawagi et al 2004]
1
minHeap
10
13
5
15
1
10
5
3
13
7
5
15
13
13
15
10
13
Count threshold ≥ 4
18
Five Merge Algorithms
MergeOpt
HeapMerger
[Sarawagi 2004]
[Sarawagi 2004]
Previous
New
ScanCount
MergeSkip
DivideSkip
19
Two previous algorithms (2)
MergeOpt Algorithm
Binary
search
Long Lists: T-1
Short Lists
20
Example of MergeOpt
[Sarawagi et al 2004]
Min-heap
1
10
5
3
13
7
5
15
13
13
15
10
13
Long Lists: 3
Short Lists: 2
Count threshold ≥ 4
21
Can we run faster?
22
Five Merge Algorithms
HeapMerger
MergeOpt
Previous
New
ScanCount
MergeSkip
DivideSkip
23
Our new algorithms (1)
ScanCount Algorithm
Use an array to record # of occurrences of each
24
element
ScanCount Example
1
2
3
4
5
6 7
8
9 10 11 12 13 14 15
1 0 1 0 2 0 1 0 0 2 0 0 4 0 1
Result:13
1
10
5
3
13
7
5
15
13
13
15
10
13
Count threshold ≥ 4
25
Five Merge Algorithms
HeapMerger
MergeOpt
Previous
New
ScanCount
MergeSkip
DivideSkip
26
Our new algorithms (2)
MergeSkip algorithm
……
Min-heap
Pop T-1
Jump
T-1
27
Example of MergeSkip
minHeap
1
10
5
3
13
7
5
15
13
13
15
10
13
Count threshold ≥ 4
28
Example of MergeSkip
1
minHeap
5
13
10
15
1
10
5
3
13
7
5
15
13
13
15
10
13
Count threshold ≥ 4
29
Example of MergeSkip
minHeap
Pop 1, 5,10
13
15
1
10
5
3
13
7
5
15
13
13
15
10
13
Count threshold ≥ 4
30
Example of MergeSkip
minHeap
Pop 1, 5,10
13
15
1
10
5
Jump
3
13
7
≥ 13
5
15
13
13
15
10
13
Count threshold ≥ 4
31
Example of HeapMerger
minHeap
13
13
13
13
15
1
10
5
3
13
7
5
15
13
13
15
Result:13
10
13
Count threshold ≥ 4
32
Five Merge Algorithms
HeapMerger
MergeOpt
Previous
New
ScanCount
MergeSkip
DivideSkip
33
Our new algorithms (3)
DivideSkip Algorithm
MergeSkip
Binary
search
Long Lists: dynamic size
Short Lists
34
Size of long lists
How many lists are treated as long lists?
Cost:
MergeOpt
Binary
search
Long Lists
Short Lists
35
Size of long lists
How many lists are treated as long lists?
Cost:
MergeSkip
Binary
search
Long Lists
Short Lists
36
Decide L value
A good balance in the tradeoff:
# of long lists = T / ( μ logM +1)
37
Empirically verification
Our formula about “L” achieves the best
result over other options.
38
Experimental data sets
Three real data sets have various string
lengths and data sizes
DBLP data
IMDB data
Google Web
corpus
39
Performance (DBLP data)
DivideSkip
is the best
one
Running time per query with various algorithms
40
# of elements reading (DBLP data)
DivideSkip
is the best
one
DivideSkip skips reading the most elements
41
Outline
Problem motivation
 Preliminary
 Merge algorithms
 Filtering technique




Length, positional filter [Gravano et al. VLDB 2001]
Filter tree
Conclusion and future work
42
Length Filtering
Length: 10
s:
By length
only!
Ed(s,t) ≤ 2
t:
Length: 19
43
Positional Filtering

Positional Gram


For example: string abcd:
{(ab,1),(bc,2),(cd,3)}
Ed(s,t) ≤ 2
s
a b
(ab,1)
t
a b
(ab,12)
44
Filter tree
root
1
2
aa
ab
3
…
…
zy
1
2
5
12
17
28
44
n
Gram level
zz
…
Length level
m
Position level
Inverted list
45
Surprising experimental results(DBLP)
Heap
MergeOpt
No filter
Length
115.42
11.98
Length+Pos
3.64
Wisely
use
14.22
1.40 filters,
6.78
more filters may be bad!
ScanCount
30.91
2.68
2.14
MergeSkip
10.12
1.09
2.65
DivideSkip
2.23
0.76
1.96
Conclusion
 Three

new merge algorithms
We run faster
 Surprising
experimental results
Wisely use filters,
more filters may be bad!
Thank you!
48
Backup : related work
Approximate
string matching
[Navarro 2001]
Fuzzy lookup in
Varied length
Grams
[Li et al 2007]
49
Reference
1.
2.
3.
[Arasu 2006] A. Arasu and V. Ganti and R.
Kaushik “Efficient Exact Set-similarity Joins” in
VLDB 2006
[Chaudhuri 2003] S. Chaudhuri ,K Ganjam, V.
Ganti and R. Motwani “Robust and Efficient
Fuzzy Match for online Data Cleaning” in
SIGMOD 2003
[Gravano 2001] L. Gravano, P.G. Ipeirotis, H.V.
Jagadish, N. Koudas, S. Muthukrishnan and D.
Srivastava “Approximate string joins in a
database almost for free” in VLDB 2001
50
Reference
4. [Li 2007] C. Li, B Wang and X. Yang
“VGRAM:Improving performance of approximate
queries on string collections using variablelength grams ” in VLDB 2007
5. [Navarro 2001] G. Navarro, “A guided tour to
approximate string matching” in Computing
survey 2001
6. [Sarawagi 2004] S. Sarawagi and A. Kirpal,
“Efficient set joins on similarity predicates” in
ACM SIGMOD 2004
51