PowerPoint 演示文稿 - Tsinghua University

Download Report

Transcript PowerPoint 演示文稿 - Tsinghua University

A Pivotal Prefix Based Filtering
Algorithm for String Similarity Search
Dong Deng, Guoliang Li, Jianhua Feng
Database Group, Tsinghua University
Present by Dong Deng
Search is Important
Google Searches per Year
Source: http://www.internetlivestats.com/google-search-statistics/
Speed Matters
Source:
Data is Dirty
DBLP Complete Search
• Typos
Argyrios Zymnis
Argyris Zymnis
relaxed
• Typo in “title”
related
Similarity Search
Query
All the strings similar
to the query
String Dataset
Edit Distance
• ED(r, s): The min number of edit operations
(insertion/deletion/substitution) needed to
transform r to s.
• For example: ED(sigcom, sigmod) = 2
sigcom
substitute c with m
sigmom
sigmod
substitute m with d
Problem Definition
Query string
s = “yotubecom” and τ = 2
string dataset R
ed(s, r4) <= 2
output r4 as a result
Application
•
•
•
•
Spell Checking
Copy Detection
Entity Linking
Bioinformatic
….
Challenge
Naïve Method
Time complexity: for each query 𝑂 |𝑅| |𝑠| τ
Filter-and-Verification Framework
Query string s
Dataset R
Index
Threshold τ
Filter:
Signature(s) ∩
Signature(r) = ϕ?
No
Verify:
ED(r,s) ≤ τ?
Yes
Results
Preliminary: q-gram
• q-gram of the substring with length q
2-gram
youtbecom
yo
ou
ut
tb
be
ec
co
om
Preliminary: q-gram
• 1 edit operation destroies at most q grams.
yout decom
yo
ou
ut
td
de
ec
co
om
• τ edit operations destroy at most qτ grams.
• if r and s have more than qτ mismatch grams, ED(r, s)>τ.
Preliminary: Prefix Filter
Sort all q-grams by global ordering, such as idf
Pre(r)
q(r) : The sorted q-gram set of string r
suffix(r)
Pre(•) is the prefix of q(•)
|Pre(•)|= qτ+1
Pre(s)
q(s): The sorted q-gram set of string s
Prefix Filter: If pre(r) ∩ pre(s) = ϕ, ED(r,s) > τ
Preliminary: Prefix Filter
Sort all q-grams by global ordering, such as idf
Pre(r)
q(r) : The sorted q-gram set of string r
suffix(r)
g1
g2
g5
g6 g11 g12
Pre(•) is the prefix of q(•)
g13
>g10 >g10 >g10 >g10 >g10 >g10
|Pre(•)|= qτ+1
g3
g4
g7
g8
Pre(s)
g9 g10
g12
q(s): The sorted q-gram set of string s
Prefix Filter: If pre(r) ∩ pre(s) = ϕ, ED(r,s) > τ
Preliminary: disjoint q-gram
• One edit operation destroies at most 1 disjoint gram.
yout decom
yo
ut
de om
• τ edit operations destroy at most τ disjoint grams.
• if r and s have more than τ mismatch disjoint grams, ED(r,
s)> τ
Pivotal Prefix Filter
Sort all q-grams by global ordering, such as idf
Pre(r)
q(r) : The sorted q-gram set of string r
suffix(r)
Piv(r)
Piv(s)
Pre(s)
Piv(•) is the pivotal prefix of q(•)
|Piv(•)|= τ+1 and the q-grams in Piv(•) are disjoint
q(s): The sorted q-gram set of string s
If piv(s) ∩ pre(r) = ϕ and piv(r) ∩ pre(s) = ϕ, ED(r,s) > τ
Pivotal Prefix Filter
Sort all q-grams by global ordering, such as idf
Pre(r)
q(r) : The sorted q-gram set of string r
last(r)
g5
g8
g10
Piv(r)
Piv(•) is the pivotal prefix of q(•)
|Piv(•)|= τ+1 and the q-grams in Piv(•) are disjoint
Piv(s)
g1
g3
suffix(r)
>g10 >g10 >g10 >g10 >g10 >g10 >g10
g6
g9 g11 g13
last(s)
Pre(s)
q(s): The sorted q-gram set of string s
Pivotal Prefix Filter: If last(s)> last(r) and piv(r) ∩ pre(s) = ϕ, ED(r,s) > τ
Pivotal Prefix Filter
Sort all q-grams by global ordering, such as idf
Pre(r)
q(r) : The sorted q-gram set of string r
last(r)
g1
g4
g6
g9 g12 g13
Piv(r)
>g10 >g10 >g10 >g10 >g10 >g10 >g10
Piv(•) is the pivotal prefix of q(•)
|Piv(•)|= τ+1 and the q-grams in Piv(•) are disjoint
Piv(s)
g3
suffix(r)
g7 g10 g11
last(s)
Pre(s)
q(s): The sorted q-gram set of string s
Pivotal Prefix Filter: If last(r)> last(s) and piv(s) ∩ pre(r) = ϕ, ED(r,s) > τ
Pivotal Prefix Filter
If last(r)> last(s) and piv(s) ∩ pre(r) = ϕ, ED(r,s) > τ
If last(s)> last(r) and piv(r) ∩ pre(s) = ϕ, ED(r,s) > τ
• Existence: There must exist τ+1 disjoint grams
in the prefix
• The Pivotal Prefix is a subset of the Prefix
– The pivotal prefix filter dominates the prefix filter
– Signature size are O(τ) and O(qτ) respectively
Related Work
Method
|Sig(r)|
|Sig(s)|
Prefix Filter
O(qτ)
O(qτ)
Mismatch Filter
O(qτ)
O(qτ)
Qchunk Filter
O(τ)
O(l)
Pivotal Prefix Filter
O(τ)
O(qτ)
• Mismatch Filter [Xiao VLDB08] : Shorten prefix length, but still O(qτ)
• Qchunk Filter[Qin SIGMOD11] : Shorten one to O(τ) but increased the other one to O(l)
• Adaptive Prefix[Wang SIGMOD12]
– Increase prefix length to reduce candidate number
– Orthogonal and can be integrated into our method
• Flamingo[Li ICDE08]
– Based on count filter. Accelerating counting process.
– Orthogonal and can be integrated into our method
Pivotal Search Algorithm
• Indexing
– Build inverted indexes for both the prefix and the
pivotal prefix of the data strings
• Querying
–
–
–
–
Generate prefix and pivotal prefix for the query string
Probe the prefix index with the pivotal prefix of the query
Probe the pivotal prefix index with the prefix of the query
Verify the candidates and output results
Pivotal Prefix Selection
Evaluating Different Pivotal Prefixes:
The longer the inverted lists we probe, the more
candidates we may have.
For query string:
For data string:
min
𝑝𝑖𝑣(𝑠)
𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓 𝑖𝑛𝑣𝑒𝑟𝑡𝑒𝑑 𝑙𝑖𝑠𝑡 𝑜𝑓 𝑔
𝑔∈𝑝𝑖𝑣(𝑠)
min
𝑝𝑖𝑣(𝑟)
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑔
𝑔∈𝑝𝑖𝑣(𝑟)
Optimal Pivotal Prefix Selection
Dynamic Programming:
Object: Select m=τ+1 optimal pivotal q-grams
from the first n=qτ+1 grams in the prefix
Select m-1 optimal pivotal q-grams from the first n-1 q-grams in prefix
Select as last pivotal q-gram
Optimal Pivotal Prefix Selection
Dynamic Programming:
Select m-1 optimal pivotal q-grams from the first n-2 q-grams
Select as last pivotal q-gram
Optimal Pivotal Prefix Selection
Dynamic Programming:
Select m-1 optimal pivotal q-grams from the first m-1 q-grams
Select as last pivotal q-gram
Recursive formula:
𝑓 𝑚, 𝑛 = min ( 𝑤𝑒𝑖𝑔ℎ𝑡 𝑔𝑘 + 𝑓 𝑘, 𝑛 − 1)
1≤𝑘≤𝑚
𝑤𝑒𝑖𝑔ℎ𝑡 𝑖𝑠 𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓 𝑖𝑛𝑣𝑒𝑟𝑡𝑒𝑑 𝑙𝑖𝑠𝑡 𝑓𝑜𝑟 𝑞𝑢𝑒𝑟𝑦 𝑎𝑛𝑑 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑓𝑜𝑟 𝑑𝑎𝑡𝑎 𝑠𝑡𝑟𝑖𝑛𝑔
Filter-and-Verification Framework
Query string s
Dataset R
Index
Filter:
Signature(s) ∩
Signature(r) = ϕ?
No
Verify:
Yes
alignment filter?
If yes, ED(r,s) ≤ τ?
Threshold τ
Complexity Improvement:
Improved from 𝑂(min 𝑟 , 𝑠 ∗ τ) to 𝑂(𝑞τ2)
Results
Alignment Filter
Intuition of Alignment Filter:
suppose in the best case we need erri edit operations to
transform 𝑔𝑖 to a substring of r, then ED r, s > τ+1
𝑖=1 𝑒𝑟𝑟𝑖
If
τ+1
𝑖=1 𝑒𝑟𝑟𝑖
> τ, ED r, s > τ
Alignment Filter
Substring edit distance (sed)
𝑠𝑒𝑑 𝑔𝑖 , 𝑟 is the minimum edit distance between 𝑔𝑖 and
any substring of r.
Alignment filter:
If
τ+1
𝑖=1 𝑠𝑒𝑑(𝑔𝑖 , 𝑟)
> τ, 𝐸𝐷 𝑟, 𝑠 > τ
Alignment Filter
Accelerating Calculation:
• The computation complexity of sed(𝑔𝑖, r) is O(q|r|).
• By position filter, 𝑔𝑖 can only align to a substring xi of r
where |xi|<2τ + 𝑞.
• Thus if τ+1
𝑖=1 𝑠𝑒𝑑(𝑔𝑖 , 𝑥𝑖 ) > τ, ED(𝑟, 𝑠)> τ.
• The complexity reduced to O qτ .
Complexity Improvement:
Improved from 𝑂(min 𝑟 , 𝑠 ∗ τ) to 𝑂(𝑞τ2)
Experiments
Settings:
C++, g++ 4.8.2 with -O3 flags
64bit Ubuntu Server 12.04 LTS version
Intel Xeon E5-2650 2.00GHz processor and 16GB memory.
Evaluating Pivotal Prefix Filter
Average Search Time
Mismatch:
From EDJoin
CrossFiler:
Cross Filter
PivotalFilter: PivotalFilter
CrossSelect: CrossFilter + Pivotal Prefix Selection
PivotalSearch: PivotalFilter + Pivotal Prefix Selection
Evaluating Pivotal Prefix Filter
Candidate Number
Mismatch:
From EDJoin
CrossFiler:
Cross Filter
PivotalFilter: PivotalFilter
CrossSelect: CrossFilter + Pivotal Prefix Selection
PivotalSearch: PivotalFilter + Pivotal Prefix Selection
Evaluating Alignment Filter
Average Search Time
NoFilter: without any filter
ContentFilter: From EDJoin
AlignFilter: Alignment Filter
Evaluating Alignment Filter
Candidate Number
NoFilter: without any filter
ContentFilter: From EDJoin
AlignFilter: Alignment Filter
Real: Number of results
Comparison with State-of-the-arts
PivotalSearch: Our method
Adaptive: [Wang2012]
Flamingo: [Li2008]
Qchunk: [Qin 2011]
Scalability
Conclusion
•
•
•
•
Pivotal prefix filter
Pivotal search algorithm
Optimal pivotal prefix selection
Alignment filter
Project hompage: http://dbgroup.cs.tsinghua.edu.cn/dd/pivotal.html
THANK YOU
Q&A
Outline
•
•
•
•
•
•
Problem Definition
Pivotal Prefix Filter
The Similarity Search Algorithm
Alignment Filter
Experiment
Conclusion
Outline
•
•
•
•
•
•
Motivation and Problem Definition
Pivotal Prefix Filter
The Similarity Search Algorithm
Alignment Filter
Experiment
Conclusion
Outline
•
•
•
•
•
•
Problem Definition
Pivotal Prefix Filter
The Similarity Search Algorithm
Alignment Filter
Experiment
Conclusion
Outline
•
•
•
•
•
•
Problem Definition
Pivotal Prefix Filter
The Similarity Search Algorithm
Alignment Filter
Experiment
Conclusion
Outline
•
•
•
•
•
•
Problem Definition
Pivotal Prefix Filter
The Similarity Search Algorithm
Alignment Filter
Experiment
Conclusion
Complexity
• Space Complexity: 𝑂(𝑞τ|𝑅|)
• Time Complexity:
Pivotal Prefix Selection
Existence of Pivotal Prefix:
There must exist at least τ+1 disjoint q-grams in the
prefix pre(r) for any string r
Evaluating Different Pivotal Prefixes:
The longer the inverted lists we scan, the larger the filtering cost
is and the smaller the pruning power is.
|𝐼 + [𝑔]|
min
For query string:
𝑝𝑖𝑣(𝑟)
For data string:
𝑝𝑖𝑣(𝑟)
𝑔∈𝑝𝑖𝑣(𝑟)
min
|𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦[𝑔]|
𝑔∈𝑝𝑖𝑣(𝑟)
Complexity
• Space Complexity:
– Prefix Inverted Index Size: 𝑂 (𝑞τ + 1) 𝑅
– Pivotal Prefix Inverted Index Size: 𝑂 (τ + 1) 𝑅
• Query Time Complexity:
– Preprocess Query s: 𝑂 𝑠 + 𝑠 𝑙𝑜𝑔 𝑠 + 𝑞τ
– Probing Inverted Indexes: 𝑂 𝑞τ𝑙𝑝 where 𝑙𝑝 is the average
length of probed prefix inverted lists
• Verification Complexity: 𝑂 𝑐τ𝑙 where c is the
number of candidates and l is average string length
Complexity
• Space Complexity:
– Prefix Inverted Index Size: 𝑂 (𝑞τ + 1) 𝑅
– Pivotal Prefix Inverted Index Size: 𝑂 (τ + 1) 𝑅
• Query Time Complexity:
– Preprocess Query s: 𝑂 𝑠 + 𝑠 𝑙𝑜𝑔 𝑠 + 𝑞τ
– Probing Inverted Indexes: 𝑂 𝑞τ𝑙𝑝 where 𝑙𝑝 is the average
length of probed prefix inverted lists
• Verification Complexity: 𝑂 𝑐τ𝑙 where c is the
number of candidates and l is average string length
Preliminary: Prefix Filter
Sort all q-grams by global ordering, such as idf
Pre(r)
g1
g2
g5
g6
q(r) : The sorted q-gram set of string r
g9 g10
g11
Pre(•) is the prefix of q(•)
|Pre(•)|= qτ+1
g3
g4
g7
g8 g11 g12
Pre(s)
>g10 >g10 >g10 >g10 >g10 >g10 >g10
g13
q(s): The sorted q-gram set of string s
Prefix Filter: If pre(r) ∩ pre(s) = ϕ, ED(r,s) > τ
Alignment Filter
non-consecutive errors:
youtubecom
yoytupecxm
q=3, the 3 non-consecutive errors destroy 8 q-grams
consecutive errors:
youtubecom
youtzpxcom
q=3, the 3 consecutive errors only destroy 5 q-grams
Indexing
• Fix a global gram order
τ=2 q=2
We use gram frequency ascending order
im
my
te
bu
un
nt
uc
bb
tb
oy
yt
ca
om
yo
ou
ut
ub
co
tu
be
ec
1
1
1
1
1
1
1
1
1
1
1
2
2
3
3
3
3
3
3
3
4
Global gram order
Indexing
• Build inverted indexes for prefix and pivotal prefix
last(pre(ri))
τ=2 q=2
q(r1):
q(r2):
q(r3):
q(r4):
q(r5):
Sort and Split String,
Sort q-grams
pre(ri)
{im, my,te, ca, yo
{bu, un, nt, uc, om
{bb, ou, ut, ub, co
{tb, om, yo,ou, ut
{oy, yt, ca, yo, ub
ou, ut, ec}
ub, co, tu}
tu, be, ec}
co, be, ec}
tu, be, ec}
slt(r
Piv(ri)i)
im
my
te
bu
un
nt
uc
bb
tb
oy
yt
ca
om
yo
ou
ut
ub
co
tu
be
ec
1
1
1
1
1
1
1
1
1
1
1
2
2
3
3
3
3
3
3
3
4
Global gram order
Indexing
• Build inverted indexes for prefix and pivotal prefix
pre(ri)
q(r1):
q(r2):
q(r3):
q(r4):
q(r5):
Pivotal Prefix Index
im
te
bu
nt
uc
tb
yt
<r1,1>
<r1,6> ca
<r2,2> om
<r2,4> yo
<r2,6> ou
<r4,4> ut
<r5,3> ub
{im, my,te, ca, yo
{bu, un, nt, uc, om
{bb, ou, ut, ub, co
{tb, om, yo,ou, ut
{oy, yt, ca, yo, ub
ou, ut, ec}
ub, co, tu}
tu, be, ec}
co, be, ec}
tu, be, ec}
Piv(rii))
slt(r
<r1,8>
<r4,8>
<r4,1>
<r3,8>
<r3,1>
<r3,3>
Inverted index I -
Prefix Index
im
my
te
bu
un
nt
<r1,1> uc
<r1,2> bb
<r1,6> tb
<r2,2> oy
<r2,3> yt
<r2,4> co
<r2,6> ca
<r3,4> om
<r1,8> <r5,8>
<r2,8><r4,8>
<r4,4> yo <r1,3><r4,1><r5,1>
<r5,2> ou <r3,8> <r4,2>
<r5,3> ut <r3,1> <r4,3>
<r5,7> ub <r3,3> <r5,5>
Inverted index I+
Querying
• Generate prefix and pivotal prefix for the query string
s: yotubecom
pre(s): {ot, om, yo, ub, co}
piv(s): {ot, om, ub}
last(pre(s))
im
my
te
bu
un
nt
uc
bb
tb
oy
yt
ca
om
yo
ou
ut
ub
co
tu
be
ec
1
1
1
1
1
1
1
1
1
1
1
2
2
3
3
3
3
3
3
3
4
Global gram order
Querying
• Probe the prefix index with the pivotal prefix of the query
• Probe the pivotal prefix index with the prefix of the query
im
te
bu
nt
uc
tb
yt
<r1,1>
<r1,6> ca
<r2,2> om
<r2,4> yo
<r2,6> ou
<r4,4> ut
<r5,3> ub
<r1,8><r5,8>
<r4,8>
<r4,1><r5,1>
<r3,8>
<r3,1>
<r3,3>
im
my
te
bu
un
nt
<r1,1> uc
<r1,2> bb
<r1,6> tb
<r2,2> oy
<r2,3> yt
<r2,4> co
<r2,6> ca
<r3,4> om
<r1,8> <r5,8>
<r2,8><r4,8>
<r4,4> yo <r1,3><r4,1><r5,1>
<r5,2> ou <r3,8> <r4,2>
<r5,3> ut <r3,1> <r4,3>
<r5,7> ub <r3,3> <r5,5>
Inverted index I+
Inverted index I Querying
Preprocess
Probe
Probe
pre(s): {ot, om, yo, ub, co} piv(s): {ot, om, ub}
s: yotubecom
last(pre(s))
Querying
• Verify the candidates and output results
Candidates:
r3, r4, r5
verify
Result:r4
im
te
bu
nt
uc
tb
yt
<r1,1>
<r1,6> ca
<r2,2> om
<r2,4> yo
<r2,6> ou
<r4,4> ut
<r5,3> ub
<r1,8><r5,8>
<r4,8>
<r4,1><r5,1>
<r3,8>
<r3,1>
<r3,3>
im
my
te
bu
un
nt
<r1,1> uc
<r1,2> bb
<r1,6> tb
<r2,2> oy
<r2,3> yt
<r2,4> co
<r2,6> ca
<r3,4> om
<r1,8> <r5,8>
<r2,8><r4,8>
<r4,4> yo <r1,3><r4,1><r5,1>
<r5,2> ou <r3,8> <r4,2>
<r5,3> ut <r3,1> <r4,3>
<r5,7> ub <r3,3> <r5,5>
Inverted index I+
Inverted index I Querying
Preprocess
Probe
Probe
pre(s): {ot, om, yo, ub, co} piv(s): {ot, om, ub}
s: yotubecom
last(pre(s))