Efficient Approximate Search on String Collections

Download Report

Transcript Efficient Approximate Search on String Collections

Efficient Approximate Search on String Collections
Part I
Marios Hadjieleftheriou
Chen Li
1
DBLP Author Search
http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html
2
Try their names (good luck!)
UCSD
Yannis Papakonstantinou
Case Western
Meral Ozsoyoglu
AT&T--Research
Marios Hadjieleftheriou
http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html
3

4
Better system?
5
http://dblp.ics.uci.edu/authors/
People Search at UC Irvine
6
http://psearch.ics.uci.edu/
Web Search
Actual
queries
gathered
by
Google
http://www.google.com/jobs/britney.html

Errors in queries

Errors in data

Bring query and meaningful
results closer together
7
Data Cleaning
R
informix
microsoft
…
…
S
infromix
…
mcrosoft
…
8
Problem Formulation
Find strings similar to a given string: dist(Q,D) <= δ
Example: find strings similar to “hadjeleftheriou”
Performance is important!
-10 ms: 100 queries per second (QPS)
- 5 ms: 200 QPS
9
Outline









Motivation
Preliminaries
Trie-based approach
Gram-based algorithms
Sketch-based algorithms
Compression
Selectivity estimation
Transformations/Synonyms
Conclusion
Part I
Part II
10
Next…
Preliminaries
11
Similarity Functions


Similar to:
 a domain-specific function
 returns a similarity value between two strings
Examples:
 Edit distance
 Hamming distance
 Jaccard similarity
 Soundex
 TF/IDF, BM25, DICE
 See [KSS06] for an excellent survey
12
Edit Distance
A widely used metric to define string similarity
 Ed(s1,s2) = minimum # of operations (insertion,
deletion, substitution) to change s1 to s2
 Example:
s1: Tom Hanks
s2: Ton Hank
ed(s1,s2) = 2

13
13
Next…
Gram-based algorithms
 List-merging
algorithms [LLL08]
 Variable-length grams (VGRAM)
[LWY07,YWL08]
14
“q-grams” of strings
universal
2-grams
15
Edit operation’s effect on grams
universal
Fixed length: q
k operations could affect k * q grams
If ed(s1,s2) <= k, then their # of common grams >=
(|s1|- q + 1) – k * q
16
q-gram inverted lists
id
0
1
2
3
4
strings
rich
stick
stich
stuck
static
2-grams
at
ch
ck
ic
ri
st
ta
ti
tu
uc
4
0
2
1
0
0
1
4
1
3
3
3
1
2
4
2
3
4
2
4
17
Searching using inverted lists

Query: “shtick”, ED(shtick, ?)≤1
sh
id
0
1
2
3
4
ht
strings
rich
stick
stich
stuck
static
ti
2-grams
ic
at
ch
ck
ic
ri
st
ta
ti
tu
uc
ck
# of common grams >= 3
4
0
2
1
0
0
1
4
1
3
3
3
1
2
4
2
3
4
2
4
18
T-occurrence Problem
Merge
Ascending
order
Find elements whose occurrences ≥ T
19
Example

T=4
1
10
5
3
13
7
5
15
13
13
15
10
13
Result: 13
20
List-Merging Algorithms
HeapMerger
MergeOpt
[SK04]
[LLL08, BK02]
ScanCount
MergeSkip
DivideSkip
21
Heap-based Algorithm
Push to
heap
……
Min-heap
Count # of occurrences of each element using a heap
22
MergeOpt Algorithm [SK04]
Binary
search
Long Lists: T-1
Short Lists
23
Example of MergeOpt
1
10
5
3
13
7
5
15
13
13
15
10
13
Long Lists: 3
Short Lists: 2
Count threshold T≥ 4
24
ScanCount
String ids
1
2
3
# of occurrences
1
0
0
1
0
Increment by
1
…
1
10
5
3
13
7
5
15
13
13
15
10
13
14
0
4
0
15
2
0
Result!
13
Count threshold T≥ 4
25
List-Merging Algorithms
HeapMerger
MergeOpt
[SK04]
[LLL08, BK02]
ScanCount
MergeSkip
DivideSkip
26
MergeSkip algorithm [BK02, LLL08]
……
Min-heap
Jump
Pop T-1
Greater
or
equals
T-1
27
Example of MergeSkip
1
minHeap
5
13
Jump
10
15
1
10
5
3
13
7
5
15
17
13
15
10
15
Count threshold T≥ 4
28
DivideSkip Algorithm [LLL08]
Binary
MergeSkip
search
Long Lists
Short Lists
29
How many lists are treated as long lists?
30
Length Filtering
Length: 10
s:
By length
only!
Ed(s,t) ≤ 2
t:
Length: 19
31
Positional Filtering
Ed(s,t) ≤ 2
s
a b
(ab,1)
t
a b
(ab,12)
32
A filter tree
Combine filters with list-merging algorithms [LLL08]
33
Next…
Variable-length grams (VGRAM)
[LWY07,YWL08]
34
2-grams -> 3-grams?

Query: “shtick”, ED(shtick, ?)≤1
sht
id
0
1
2
3
4
hti
strings
rich
stick
stich
stuck
static
tic
3-grams
ick
ati
ich
ick
ric
sta
sti
stu
tat
tic
tuc
uck
# of common grams >= 1
4
0
1
0
4
1
3
4
1
3
3
2
2
2
4
35
Observation 1: dilemma of choosing “q”

Increasing “q” causing:


id
0
1
2
3
4
Longer grams  Shorter lists
Smaller # of common grams of similar strings
strings
rich
stick
stich
stuck
static
2-grams
at
ch
ck
ic
ri
st
ta
ti
tu
uc
4
0
2
1
0
0
1
4
1
3
3
3
1
2
4
2
3
4
2
4
36
Observation 2: skew distributions of gram frequencies

DBLP: 276,699 article titles

Popular 5-grams: ation (>114K times), tions, ystem, catio
37
VGRAM: Main idea

Grams with variable lengths (between qmin
and qmax)

zebra
-

corrasion
-

ze(123)
co(5213), cor(859), corr(171)
Advantages



Reduce index size 
Reducing running time 
Adoptable by many algorithms 
38
Challenges




Generating variable-length grams?
Constructing a high-quality gram dictionary?
Relationship between string similarity and their
gram-set similarity?
Adopting VGRAM in existing algorithms?
39
Challenge 1: String  Variable-length grams?

Fixed-length 2-grams
universal

Variable-length grams
[2,4]-gram dictionary
universal
ni
ivr
sal
uni
vers
40
Representing gram dictionary as a trie
ni
ivr
sal
uni
vers
41
Step 2: Constructing a gram dictionary
qmin=2
qmax=4


Frequency-based [LYW07]
Cost-based [YLW08]
42
Challenge 3: Edit operation’s effect on grams
universal
Fixed length: q
k operations could affect k * q grams
43
Deletion affects variable-length grams
Not affected
Affected
i-qmax+1
i
Deletion
Not affected
i+qmax- 1
44
Main idea



For a string, for each position, compute the number of grams
that could be destroyed by an operation at this position
Compute number of grams possibly destroyed by k operations
Store these numbers (for all data strings) as part of the index
Vector of s = <2,4,6,8,9>
With 2 edit operations, at most 4 grams can be affected

Use this number to do count filtering
45
Summary of VGRAM index
46
Challenge 4: adopting VGRAM
Easily adoptable by many algorithms
Basic interfaces:
 String s  grams
 String s1, s2 such that ed(s1,s2) <= k  min
# of their common grams
47
Lower bound on # of common grams
Fixed length (q)
universal
If ed(s1,s2) <= k, then their # of common grams >=:
(|s1|- q + 1) – k * q
Variable lengths: # of grams of s1 – NAG(s1,k)
48
Example: algorithm using inverted lists

Query: “shtick”, ED(shtick, ?)≤1
sh
ht
tick
2-grams
…
ck
ic
…
ti
…
2-4 grams
Lower bound = 3
1
0
3
1
2
1
2
4
id
0
1
2
3
4
4
strings
rich
stick
stich
stuck
static
…
ck
ic
ich
…
tic
tick
…
1
1
0
3
4
2
2
1
4
Lower bound = 1
49
End of part I









Motivation
Preliminaries
Trie-based approach
Gram-based algorithms
Sketch-based algorithms
Compression
Selectivity estimation
Transformations/Synonyms
Conclusion
Part I
Part II
50
References









[AGK06] Efficient Exact Set-Similarity Joins. Arvind Arasu, Venkatesh Ganti, Raghav
Kaushik .VLDB 2006
[ACGK08] Incorporating string transformations in record matching. Arvind Arasu,
Surajit Chaudhuri, Kris Ganjam, Raghav Kaushik. SIGMOD 2008
[BK02] Adaptive intersection and t-threshold problems. Jérémy Barbay, Claire
Kenyon. SODA 2002
[BJL+09] Space-Constrained Gram-Based Indexing for Efficient Approximate String
Search. Alexander Behm, Shengyue Ji, Chen Li, and Jiaheng Lu. ICDE 2009
[BCFM98] Min-Wise Independent Permutations. Andrei Z. Broder, Moses Charikar,
Alan M. Frieze, Michael Mitzenmacher. STOC 1998
[CGG+05]Data cleaning in microsoft SQL server 2005. Surajit Chaudhuri, Kris
Ganjam, Venkatesh Ganti, Rahul Kapoor, Vivek R. Narasayya, Theo Vassilakis.
SIGMOD 2005
[CGK06] A Primitive Operator for Similarity Joins in Data Cleaning. Surajit
Chaudhuri, Venkatesh Ganti, Raghav Kaushik. ICDE06
[CCGX08] An Efficient Filter for Approximate Membership Checking. Kaushik
Chakrabarti, Surajit Chaudhuri, Venkatesh Ganti, Dong Xin. SIGMOD08
[HCK+08] Fast Indexes and Algorithms for Set Similarity Selection Queries. Marios
Hadjieleftheriou, Amit Chandel, Nick Koudas, Divesh Srivastava. ICDE 2008
51
References









[HYK+08] Hashed samples: selectivity estimators for set similarity selection queries.
Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, Divesh Srivastava. PVLDB 2008.
[JL05] Selectivity Estimation for Fuzzy String Predicates in Large Data Sets. Liang Jin,
Chen Li. VLDB 2005.
[JLL+09] Efficient Interactive Fuzzy Keyword Search. Shengyue Ji, Guoliang Li, Chen
Li, and Jianhua Feng. WWW 2009
[JLV08] SEPIA: Estimating Selectivities of Approximate String Predicates in Large
Databases. Liang Jin, Chen Li, Rares Vernica. VLDBJ08
[KSS06] Record linkage: Similarity measures and algorithms. Nick Koudas, Sunita
Sarawagi, Divesh Srivastava. SIGMOD 2006.
[LLL08] Efficient Merging and Filtering Algorithms for Approximate String Searches.
Chen Li, Jiaheng Lu, and Yiming Lu. ICDE 2008.
[LNS07] Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit
Distance. Hongrae Lee, Raymond T. Ng, Kyuseok Shim. VLDB 2007
[LWY07] VGRAM: Improving Performance of Approximate Queries on String
Collections Using Variable-Length Grams, Chen Li, Bin Wang, and Xiaochun Yang.
VLDB 2007
[MBK+07] Estimating the selectivity of approximate string queries. Arturas Mazeika,
Michael H. Böhlen, Nick Koudas, Divesh Srivastava. ACM TODS 2007
52
References




[SK04] Efficient set joins on similarity predicates. Sunita Sarawagi, Alok Kirpal.
SIGMOD 2004
[XWL08] Ed-Join: an efficient algorithm for similarity joins with edit distance
constraints. Chuan Xiao, Wei Wang, Xuemin Lin. PVLDB 2008
[XWL+08] Efficient similarity joins for near duplicate detection. Chuan Xiao, Wei
Wang, Xuemin Lin, Jeffrey Xu Yu. WWW 2008
[YWL08] Cost-Based Variable-Length-Gram Selection for String Collections to
Support Approximate Queries Efficiently. Xiaochun Yang, Bin Wang, and Chen Li.
SIGMOD 2008
53