Efficient Approximate Search on String Collections

Transcript Efficient Approximate Search on String Collections

Efficient Approximate Search on String Collections
Part I
Marios Hadjieleftheriou
Chen Li
1
DBLP Author Search
http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html
2
Try their names (good luck!)
UCSD
Yannis Papakonstantinou
Case Western
Meral Ozsoyoglu
AT&T--Research
Marios Hadjieleftheriou
http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html
3

4
Better system?
5
http://dblp.ics.uci.edu/authors/
People Search at UC Irvine
6
http://psearch.ics.uci.edu/
Web Search
Actual
queries
gathered
by
Google
http://www.google.com/jobs/britney.html

Errors in queries

Errors in data

Bring query and meaningful
results closer together
7
Data Cleaning
R
informix
microsoft
…
…
S
infromix
…
mcrosoft
…
8
Problem Formulation
Find strings similar to a given string: dist(Q,D) <= δ
Example: find strings similar to “hadjeleftheriou”
Performance is important!
-10 ms: 100 queries per second (QPS)
- 5 ms: 200 QPS
9
Outline









Motivation
Preliminaries
Trie-based approach
Gram-based algorithms
Sketch-based algorithms
Compression
Selectivity estimation
Transformations/Synonyms
Conclusion
Part I
Part II
10
Next…
Preliminaries
11
Similarity Functions


Similar to:
 a domain-specific function
 returns a similarity value between two strings
Examples:
 Edit distance
 Hamming distance
 Jaccard similarity
 Soundex
 TF/IDF, BM25, DICE
 See [KSS06] for an excellent survey
12
Edit Distance
A widely used metric to define string similarity
 Ed(s1,s2) = minimum # of operations (insertion,
deletion, substitution) to change s1 to s2
 Example:
s1: Tom Hanks
s2: Ton Hank
ed(s1,s2) = 2

13
13
Next…
Gram-based algorithms
 List-merging
algorithms [LLL08]
 Variable-length grams (VGRAM)
[LWY07,YWL08]
14
“q-grams” of strings
universal
2-grams
15
Edit operation’s effect on grams
universal
Fixed length: q
k operations could affect k * q grams
If ed(s1,s2) <= k, then their # of common grams >=
(|s1|- q + 1) – k * q
16
q-gram inverted lists
id
0
1
2
3
4
strings
rich
stick
stich
stuck
static
2-grams
at
ch
ck
ic
ri
st
ta
ti
tu
uc
4
0
2
1
0
0
1
4
1
3
3
3
1
2
4
2
3
4
2
4
17
Searching using inverted lists

Query: “shtick”, ED(shtick, ?)≤1
sh
id
0
1
2
3
4
ht
strings
rich
stick
stich
stuck
static
ti
2-grams
ic
at
ch
ck
ic
ri
st
ta
ti
tu
uc
ck
# of common grams >= 3
4
0
2
1
0
0
1
4
1
3
3
3
1
2
4
2
3
4
2
4
18
T-occurrence Problem
Merge
Ascending
order
Find elements whose occurrences ≥ T
19
Example

T=4
1
10
5
3
13
7
5
15
13
13
15
10
13
Result: 13
20
List-Merging Algorithms
HeapMerger
MergeOpt
[SK04]
[LLL08, BK02]
ScanCount
MergeSkip
DivideSkip
21
Heap-based Algorithm
Push to
heap
……
Min-heap
Count # of occurrences of each element using a heap
22
MergeOpt Algorithm [SK04]
Binary
search
Long Lists: T-1
Short Lists
23
Example of MergeOpt
1
10
5
3
13
7
5
15
13
13
15
10
13
Long Lists: 3
Short Lists: 2
Count threshold T≥ 4
24
ScanCount
String ids
1
2
3
# of occurrences
1
0
0
1
0
Increment by
1
…
1
10
5
3
13
7
5
15
13
13
15
10
13
14
0
4
0
15
2
0
Result!
13
Count threshold T≥ 4
25
List-Merging Algorithms
HeapMerger
MergeOpt
[SK04]
[LLL08, BK02]
ScanCount
MergeSkip
DivideSkip
26
MergeSkip algorithm [BK02, LLL08]
……
Min-heap
Jump
Pop T-1
Greater
or
equals
T-1
27
Example of MergeSkip
1
minHeap
5
13
Jump
10
15
1
10
5
3
13
7
5
15
17
13
15
10
15
Count threshold T≥ 4
28
DivideSkip Algorithm [LLL08]
Binary
MergeSkip
search
Long Lists
Short Lists
29
How many lists are treated as long lists?
30
Length Filtering
Length: 10
s:
By length
only!
Ed(s,t) ≤ 2
t:
Length: 19
31
Positional Filtering
Ed(s,t) ≤ 2
s
a b
(ab,1)
t
a b
(ab,12)
32
A filter tree
Combine filters with list-merging algorithms [LLL08]
33
Next…
Variable-length grams (VGRAM)
[LWY07,YWL08]
34
2-grams -> 3-grams?

Query: “shtick”, ED(shtick, ?)≤1
sht
id
0
1
2
3
4
hti
strings
rich
stick
stich
stuck
static
tic
3-grams
ick
ati
ich
ick
ric
sta
sti
stu
tat
tic
tuc
uck
# of common grams >= 1
4
0
1
0
4
1
3
4
1
3
3
2
2
2
4
35
Observation 1: dilemma of choosing “q”

Increasing “q” causing:


id
0
1
2
3
4
Longer grams  Shorter lists
Smaller # of common grams of similar strings
strings
rich
stick
stich
stuck
static
2-grams
at
ch
ck
ic
ri
st
ta
ti
tu
uc
4
0
2
1
0
0
1
4
1
3
3
3
1
2
4
2
3
4
2
4
36
Observation 2: skew distributions of gram frequencies

DBLP: 276,699 article titles

Popular 5-grams: ation (>114K times), tions, ystem, catio
37
VGRAM: Main idea

Grams with variable lengths (between qmin
and qmax)

zebra
-

corrasion
-

ze(123)
co(5213), cor(859), corr(171)
Advantages



Reduce index size 
Reducing running time 
Adoptable by many algorithms 
38
Challenges




Generating variable-length grams?
Constructing a high-quality gram dictionary?
Relationship between string similarity and their
gram-set similarity?
Adopting VGRAM in existing algorithms?
39
Challenge 1: String  Variable-length grams?

Fixed-length 2-grams
universal

Variable-length grams
[2,4]-gram dictionary
universal
ni
ivr
sal
uni
vers
40
Representing gram dictionary as a trie
ni
ivr
sal
uni
vers
41
Step 2: Constructing a gram dictionary
qmin=2
qmax=4


Frequency-based [LYW07]
Cost-based [YLW08]
42
Challenge 3: Edit operation’s effect on grams
universal
Fixed length: q
k operations could affect k * q grams
43
Deletion affects variable-length grams
Not affected
Affected
i-qmax+1
i
Deletion
Not affected
i+qmax- 1
44
Main idea



For a string, for each position, compute the number of grams
that could be destroyed by an operation at this position
Compute number of grams possibly destroyed by k operations
Store these numbers (for all data strings) as part of the index
Vector of s = <2,4,6,8,9>
With 2 edit operations, at most 4 grams can be affected

Use this number to do count filtering
45
Summary of VGRAM index
46
Challenge 4: adopting VGRAM
Easily adoptable by many algorithms
Basic interfaces:
 String s  grams
 String s1, s2 such that ed(s1,s2) <= k  min
# of their common grams
47
Lower bound on # of common grams
Fixed length (q)
universal
If ed(s1,s2) <= k, then their # of common grams >=:
(|s1|- q + 1) – k * q
Variable lengths: # of grams of s1 – NAG(s1,k)
48
Example: algorithm using inverted lists

Query: “shtick”, ED(shtick, ?)≤1
sh
ht
tick
2-grams
…
ck
ic
…
ti
…
2-4 grams
Lower bound = 3
1
0
3
1
2
1
2
4
id
0
1
2
3
4
4
strings
rich
stick
stich
stuck
static
…
ck
ic
ich
…
tic
tick
…
1
1
0
3
4
2
2
1
4
Lower bound = 1
49
End of part I









Motivation
Preliminaries
Trie-based approach
Gram-based algorithms
Sketch-based algorithms
Compression
Selectivity estimation
Transformations/Synonyms
Conclusion
Part I
Part II
50
References









[AGK06] Efficient Exact Set-Similarity Joins. Arvind Arasu, Venkatesh Ganti, Raghav
Kaushik .VLDB 2006
[ACGK08] Incorporating string transformations in record matching. Arvind Arasu,
Surajit Chaudhuri, Kris Ganjam, Raghav Kaushik. SIGMOD 2008
[BK02] Adaptive intersection and t-threshold problems. Jérémy Barbay, Claire
Kenyon. SODA 2002
[BJL+09] Space-Constrained Gram-Based Indexing for Efficient Approximate String
Search. Alexander Behm, Shengyue Ji, Chen Li, and Jiaheng Lu. ICDE 2009
[BCFM98] Min-Wise Independent Permutations. Andrei Z. Broder, Moses Charikar,
Alan M. Frieze, Michael Mitzenmacher. STOC 1998
[CGG+05]Data cleaning in microsoft SQL server 2005. Surajit Chaudhuri, Kris
Ganjam, Venkatesh Ganti, Rahul Kapoor, Vivek R. Narasayya, Theo Vassilakis.
SIGMOD 2005
[CGK06] A Primitive Operator for Similarity Joins in Data Cleaning. Surajit
Chaudhuri, Venkatesh Ganti, Raghav Kaushik. ICDE06
[CCGX08] An Efficient Filter for Approximate Membership Checking. Kaushik
Chakrabarti, Surajit Chaudhuri, Venkatesh Ganti, Dong Xin. SIGMOD08
[HCK+08] Fast Indexes and Algorithms for Set Similarity Selection Queries. Marios
Hadjieleftheriou, Amit Chandel, Nick Koudas, Divesh Srivastava. ICDE 2008
51
References









[HYK+08] Hashed samples: selectivity estimators for set similarity selection queries.
Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, Divesh Srivastava. PVLDB 2008.
[JL05] Selectivity Estimation for Fuzzy String Predicates in Large Data Sets. Liang Jin,
Chen Li. VLDB 2005.
[JLL+09] Efficient Interactive Fuzzy Keyword Search. Shengyue Ji, Guoliang Li, Chen
Li, and Jianhua Feng. WWW 2009
[JLV08] SEPIA: Estimating Selectivities of Approximate String Predicates in Large
Databases. Liang Jin, Chen Li, Rares Vernica. VLDBJ08
[KSS06] Record linkage: Similarity measures and algorithms. Nick Koudas, Sunita
Sarawagi, Divesh Srivastava. SIGMOD 2006.
[LLL08] Efficient Merging and Filtering Algorithms for Approximate String Searches.
Chen Li, Jiaheng Lu, and Yiming Lu. ICDE 2008.
[LNS07] Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit
Distance. Hongrae Lee, Raymond T. Ng, Kyuseok Shim. VLDB 2007
[LWY07] VGRAM: Improving Performance of Approximate Queries on String
Collections Using Variable-Length Grams, Chen Li, Bin Wang, and Xiaochun Yang.
VLDB 2007
[MBK+07] Estimating the selectivity of approximate string queries. Arturas Mazeika,
Michael H. Böhlen, Nick Koudas, Divesh Srivastava. ACM TODS 2007
52
References




[SK04] Efficient set joins on similarity predicates. Sunita Sarawagi, Alok Kirpal.
SIGMOD 2004
[XWL08] Ed-Join: an efficient algorithm for similarity joins with edit distance
constraints. Chuan Xiao, Wei Wang, Xuemin Lin. PVLDB 2008
[XWL+08] Efficient similarity joins for near duplicate detection. Chuan Xiao, Wei
Wang, Xuemin Lin, Jeffrey Xu Yu. WWW 2008
[YWL08] Cost-Based Variable-Length-Gram Selection for String Collections to
Support Approximate Queries Efficiently. Xiaochun Yang, Bin Wang, and Chen Li.
SIGMOD 2008
53

Efficient Approximate Search on String Collections

Transcript Efficient Approximate Search on String Collections

Directory