Efficient Algorithms for Approximate Member Extraction
Download
Report
Transcript Efficient Algorithms for Approximate Member Extraction
Efficient Algorithms for
Approximate Member
Extraction Using Signaturebased Inverted Lists
Jialong Han
Co-authored with Jiaheng Lu, Xiaofeng Meng
Renmin University of China
Introduction: An Example
A dictionary of strings we are interested in
E.g.
product names, postal addresses…
We are going to locate their “approximate
apparences” in a series of documents.
See
the meaning of “approximate apparence” in the
following example:
Jiaheng Lu, Jialong Han, Xiaofeng Meng
2
Problem Definition
Given a dictionary R and a threshold δ, extract
all proper substrings m from input documents S
such that there exists r ∈R, and Similarity (r, m)
≥δ(or Distance(r, m) ≤k).
Here
we call r a piece of evidence for m.
Similarity() is a function measuring the similarity of
two strings
Strings are viewed as sets of tokens (words)
wt (r m)
An example for Sim(): Jaccard similarity: J (r , m)
wt (r m)
Jiaheng Lu, Jialong Han, Xiaofeng Meng
3
Outline
Introduction
State-of-the-art techniques
The
filtration-verification framework
K-signature scheme
Inverted Signature-based Hashtable
Our algorithms and evaluations
Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng
4
Why pre-pruning is needed
We need spot evidence to decide whether a
substring m should be extracted
Simple
verification on all dictionary strings may be
inefficient
Pre-pruning and post-verifying is beneficial
But should it be running-speed-oriented or filteringpower-oriented?
Less time or less survivors?
Jiaheng Lu, Jialong Han, Xiaofeng Meng
5
The issue of compromise comes again
Balance between the two stages should be
reached:
More(less)
filtration time
Strong(weak)
Overall performance
filtration power
=Tf+Tv ?????
Fewer(more)
candidates
Jiaheng Lu, Jialong Han, Xiaofeng Meng
Less(more)
verification time
6
Outline
Introduction
State-of-the-art techniques
The
filtration-verification framework
K-signature scheme
Inverted Signature-based Hashtable
Our algorithms and evaluations
Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng
7
K-signature scheme
K-signature scheme
Proposed
by Chakrabarti et al. (SIGMOD 2008)
Choose several top-weighted tokens in a string as
signatures to represent it: s => Sig(s)
Observation: if r cannot match m, r is likely to have
insufficient signature overlapping with m
K is a parameter for filtration power tuning
Potential evidence loss
A
counter-example found when k=3
We tried and only proved that it works for k=1 and
k=∞
Jiaheng Lu, Jialong Han, Xiaofeng Meng
8
Outline
Introduction
State-of-the-art techniques
The
filtration-verification framework
K-signature scheme
Inverted Signature-based Hashtable
Our algorithms and evaluations
Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng
9
Inverted Signature-based Hashtable
Proposed by Chakrabarti et al. (SIGMOD 2008)
Each
dictionary string encoded into a solid 0-1 matrix
An ‘1’ for each occurrence of a <token,sig-token>
tuple (‘1’- rectangle)
Bitwise-or all solid matrices to get the matrix of R
Observation: if m is an approximate member of
R, the matrix of m must have enough
intersections with that of R.
Formalized into an NPC problem
Solution
causes too weak filtering power
Jiaheng Lu, Jialong Han, Xiaofeng Meng
10
Outline
Introduction
State-of-the-art techniques
Our algorithms and evaluations
Corrected
filtering conditions
EvSCAN: Filtration by SIL
EvITER: Incremental optimization on EvSCAN
Supporting Dynamic Thresholds
Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng
11
Our proposed theorem
If Sim(m,r) ≥δ, what do we have ?
Too strict !
wt(Sig(m)∩Sig(r)) ≥ τ(m)
wt(Sig(m)∩Sig(r)) ≥ min{τ(m),τ(r) }
Proved by us
So the threshold does not remain constant
involves
unknown evidence
Our solution: Use inverted lists to count sigtoken overlappings.
Note
that sig-tokens usually have low document
frequency (e.g. IDF as weights)
Jiaheng Lu, Jialong Han, Xiaofeng Meng
12
Outline
Introduction
State-of-the-art techniques
Our algorithms and evaluations
Corrected
filtering conditions
EvSCAN: Filtration by SIL
EvITER: Incremental optimization on EvSCAN
Supporting Dynamic Thresholds
Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng
13
Signature-based Inverted Lists
Lists
indexed by sig-tokens
Each sig-token of a string creates a node (containing
the string’s id) in the corresponding list.
E.g. R = { r1 = “canon eos 5d digital camera", r2 =“nikon
digital slr camera”, r3=“canon slr camera”}.
wt(digital, camera, canon, nikon, slr, eos, 5d) = (1, 1, 2, 2, 2,
7 ,9).
5d, 9.0
1
canon, 2.0
1
camera, 1.0
2
eos, 7.0
1
nikon, 2.0
2
slr, 2.0
2
Jiaheng Lu, Jialong Han, Xiaofeng Meng
3
3
14
Filtration by SIL
Using an array called “accumulator” to compute
the overlapped sig weight wt(Sig(m)∩Sig(r))
E.g. m=“canon eos digital camera”, δ=0.8
5d, 9.0
1
canon, 2.0
1
camera, 1.0
2
rid
1
2
3
eos, 7.0
1
wt(Sig(m)∩Sig(r))
9.0
2.0
0
2.0
nikon, 2.0
2
min{τ(m),τ(r) }
6.8
3.8
3
slr, 2.0
2
Accumulator
3
3
Qualified!
Jiaheng Lu, Jialong Han, Xiaofeng Meng
15
Outline
Introduction
State-of-the-art techniques
Our algorithms and evaluations
Corrected
filtering conditions
EvSCAN: Filtration by SIL
EvITER: Incremental optimization on EvSCAN
Supporting Dynamic Thresholds
Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng
16
EvITER: Progressive Computation
Recall we are checking all substrings
Some
of them are quite similar, indicating that they
share duplicate computation
An intuition: if m have potential evidence r, then
m t is very likely to match r
Formally we proved that
Let ES(m) be the set of “potential evidence” for m, list[t]={s| all
dictionary strings that contain token t}
We have ES(m t) ES(m)∪list[t]
Jiaheng Lu, Jialong Han, Xiaofeng Meng
17
Example
Docoment M:
m
t
“…. cannon eos digital camera lens…”
List[t]
ES(m)
…
{r1}
lens, 3.0
22
53
…
We know that only r1, r22, r53 are possible to
match “cannon eos digital camera lens”
Jiaheng Lu, Jialong Han, Xiaofeng Meng
18
Flow of Evidence
EvITER for “Evidence ITERATION”
…
Jiaheng Lu, Jialong Han, Xiaofeng Meng
19
Outline
Introduction
State-of-the-art techniques
Our algorithms and evaluations
Corrected
filtering conditions
EvSCAN: Filtration by SIL
EvITER: Incremental optimization on EvSCAN
Supporting Dynamic Thresholds
Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng
20
The Static Threshold Problem
How does this index work so far?
-“Get ready forδ=0.8 please.”
-“Please wait 30min for index generation…”
-“Ready!”
-“Document M1,δ=0.8. Go!”
-“…Extraction complete.”
-“Document M2, and I wantδ=0.9…”
-“Sorry, please wait another 30min for index regeneration…”
-“:-(”
Jiaheng Lu, Jialong Han, Xiaofeng Meng
21
The Static Threshold Problem
This One Seems Better
-“Get ready forδ>=0.8 please.”
-“Please wait 30min for index generation…”
-“Ready!”
-“Document M1,δ=0.8. Go!”
-“…Extraction complete.”
-“Document M2, and I wantδ=0.9…”
-“…Extraction complete.”
“:-)”
Jiaheng Lu, Jialong Han, Xiaofeng Meng
22
Supporting Dynamic Thresholds
An Observation
δ descends, a string r’s tokens fall into Sig(r)
one by one, in the order of their weight ranking.
I.e. any node <sig-token, rid> is “active” when δ is
below certain “threshold” u<sig-token, rid>.
When
We record u<sig-token, rid> in each node and sort all
nodes in each list according to the descending
order of their u value.
For any given δ, we only need retrieve a prefix
of each list to get all “active nodes”
Jiaheng Lu, Jialong Han, Xiaofeng Meng
23
Experimental Datasets
DBLP: 274,788 Paper titles
1,838,973 URLs
Jiaheng Lu, Jialong Han, Xiaofeng Meng
24
Balance should be reached
Recall our two stages of filtration and verification
Jiaheng Lu, Jialong Han, Xiaofeng Meng
25
Performance (DBLP)
Jiaheng Lu, Jialong Han, Xiaofeng Meng
26
Outline
Introduction
State-of-the-art techniques
Our algorithms and evaluations
Corrected
filtering conditions
EvSCAN: Filtration by SIL
EvITER: Incremental optimization on EvSCAN
Supporting Dynamic Thresholds
Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng
27
Conclusion
Our method causes no false negatives
Our method achieves a good balance between the two
phases of filtration and verification
We also propose EvITER to eliminate duplicate
computation
Our method has both effective & efficient performance
Jiaheng Lu, Jialong Han, Xiaofeng Meng
28
Jiaheng Lu, Jialong Han, Xiaofeng Meng
29
References
[1] A. Arasu, V. Ganti, R. Kaushik. Efficient exact set-similarity joins.
In VLDB, pages 918-929, 2006.
[2] K. Chakrabarti, S. Chaudhuri, V. Ganti, D. Xin. An efficient filter
for approximate membership checking. In SIGMOD Conference,
2008.
[3] A. Chandel, P. C. Nagesh, and S. Sarawagi. Efficient batch top-k
search for dictionary-based entity recognition. In ICDE, page 28,
2006.
[4] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for
similarity joins in data cleaning. In ICDE, page 5, 2006.
[5] M.R.Garey and D.S.Johnson. Computers and Intractability:
Guidance to the Theory of NP-Completeness.
[6] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S.
Muthukrishnan, and D. Srivastava. Approximate string joins in a
database (almost) for free. In VLDB, pages 491-500, 2001.
Jiaheng Lu, Jialong Han, Xiaofeng Meng
30
References
[7] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms
for approximate string searches. In ICDE, pages 257–266, 2008.
[8] C. Li, B,Wang, X. Yang, VGRAM: Improving performance of
approximate queries on string collections using variable length
grams. In VLDB 2007.
[9] G. Navarro. A guided tour to approximate string matching. ACM
Comput. Surv., 33(1):31–88, 2001.
[10] S. Sarawagi, A.Kirpal, Efficient set joins on similarity predicates.
In SIGMOD Conference, 2004.
[11] A. Singhal. Modern information retrieval: A brief overview.
Bulletin of the IEEE Computer Society Technical Committee on Data
Engineering, 24(4):35-43, 2001.
[12] E. Sutinen and J. Tarhio. On using q-grams locations in
approximate string matching. In ESA, pages 327-340, 1995.
[13] W. Wang, C. Xiao, X. Lin, C. Zhang. Efficient approximate entity
extraction with edit distance constraints. In SIGMOD Conference,
2009.
Jiaheng Lu, Jialong Han, Xiaofeng Meng
31