幻灯片 1 - Database Group @ Department of Computer

Download Report

Transcript 幻灯片 1 - Database Group @ Department of Computer

The Similarity Search Algorithm
A Pivotal Prefix Based Filtering Algorithm
for String Similarity Search
Dong Deng, Guoliang Li, Jianhua Feng
Department of Computer Science, Tsinghua University, Beijing, China
Problem Definition
Edit Distance:
The minimum number of edit operations(insertion/deletion/substitution)
needed to transform one string to another string.
For example: ED(youtbe, yotde) = 2
i
.
.
Inverted index I
Inverted index I
Querying
Preprocess
Probe
Probe
pre(s): {ot, om, yo, ub, co} piv(s): {ot, om, ub}
s: yotubecom
last(pre(s))
To supporting dynamic thresholds we maintain the incremental inverted indexes
Pivotal Prefix Selection
youtbe
delete u
yotbe
yotde
im my te bu un nt uc bb tb oy yt ca om yo ou ut ub co tu be ec
Indexing:
1
1
1 1 1 1 1 1 1 1 1 2 2
3 3 3 3 3 3 3 4
1. Fix the global gram order
Global gram order
last(pre(r ))
2. Build inverted indexes for
τ=2 q=2
suf(ri)
pre(ri)
r
:
imyouteca
1
prefixes and pivotal prefixes
q(r1): {im, my,te, ca, yo
ou, ut, ec}
r2: ubuntucom Sort and Split String, q(r2): {bu, un, nt, uc, om ub, co, tu}
Querying:
r3: utubbecou
q(r3): {bb, ou, ut, ub, co
tu, be, ec}
r4: youtbecom Sort q-grams q(r4): {tb, om, yo,ou, ut co, be, ec}
1. Generate prefix and pivotal
q(r5): {oy, yt, ca, yo, ub
tu, be, ec}
r
:
yoytubeca
5
prefix for the query string
slt(ri)
(
)={<9,1>}
M
+ ( )={<9,1>}
Indexing
M
2. Probe the prefix index with the Candidates: im <r ,1>
1
r3, r4, r5
te <r1,6> ca <r1,8><r5,8> im <r1,1> uc <r2,6> ca <r1,8> <r5,8>
pivotal prefix of the query
my <r1,2> bb <r3,4> om <r2,8><r4,8>
bu <r2,2> om <r4,8>
3. Probe the pivotal prefix index
verify nt <r2,4> yo <r4,1><r5,1> te <r1,6> tb <r4,4> yo <r1,3><r4,1><r5,1>
bu <r2,2> oy <r5,2> ou <r3,8> <r4,2>
uc <r2,6> ou <r3,8>
with the prefix of the query
un <r2,3> yt <r5,3> ut <r3,1> <r4,3>
tb <r4,4> ut <r3,1>
nt <r2,4> co <r5,7> ub <r3,3> <r5,5>
Result:r4
<r3,3>
yt
<
r
,3>
ub
5
4. Verify the candidates
+
Existence of Pivotal Prefix:
substitute b with d
Query string
s = “yotubecom”
and τ = 2
string dataset R
ed(s, r4) <= 2
output <s, r4> as a result
Application:
 Data cleaning & Data integration
 Spell Checking
 Copy Detection
 Entity Linking
 Macromolecules Sequence Alignment
….
Alignment Filter
Edit Distance
q-gram and Existing Filter
q-gram:
A q-gram of a string is its substring with length q.
For example, the 2-gram set of string s is q(s)= “youtbecom”
youtbecom youtdecom
yo
yo
ou
ou
ut
ut
tb
td
beec
deec
co
co
om
om
τ edit operations destroy at most qτ
q-grams. 
Prefix Filter:
Count Filter:
If ED(r,s) ≤ τ, pre(r) ∩ pre(s) ≠ ϕ
If ED(r,s) ≤ τ, |q(r) ∩ q(s)|≥ max(|q(r)|,
|q(s)|) − qτ
Experiments
Settings:
C++, g++ 4.8.2 with -O3 flags
64bit Ubuntu Server 12.04 LTS version
Intel Xeon E5-2650 2.00GHz processor
and 16GB memory.
Pivotal Prefix Filter
Evaluating Alignment Filter
Evaluating Pivotal Prefix Filter
Comparison with State-of-the-art
Scalability
http://dbgroup.cs.tsinghua.edu.cn/dd/pivotal.html
Copyright © 2014, Database Research Group, Tsinghua University