Efficient Algorithms for Approximate Member Extraction

Download Report

Transcript Efficient Algorithms for Approximate Member Extraction

Efficient Algorithms for
Approximate Member
Extraction Using Signaturebased Inverted Lists
Jialong Han
Co-authored with Jiaheng Lu, Xiaofeng Meng
Renmin University of China
Introduction: An Example

A dictionary of strings we are interested in
 E.g.

product names, postal addresses…
We are going to locate their “approximate
apparences” in a series of documents.
 See
the meaning of “approximate apparence” in the
following example:
Jiaheng Lu, Jialong Han, Xiaofeng Meng
2
Problem Definition

Given a dictionary R and a threshold δ, extract
all proper substrings m from input documents S
such that there exists r ∈R, and Similarity (r, m)
≥δ(or Distance(r, m) ≤k).
 Here
we call r a piece of evidence for m.
 Similarity() is a function measuring the similarity of
two strings


Strings are viewed as sets of tokens (words)
wt (r  m)
An example for Sim(): Jaccard similarity: J (r , m) 
wt (r  m)
Jiaheng Lu, Jialong Han, Xiaofeng Meng
3
Outline


Introduction
State-of-the-art techniques
 The
filtration-verification framework
 K-signature scheme
 Inverted Signature-based Hashtable


Our algorithms and evaluations
Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng
4
Why pre-pruning is needed

We need spot evidence to decide whether a
substring m should be extracted
 Simple
verification on all dictionary strings may be
inefficient
 Pre-pruning and post-verifying is beneficial
 But should it be running-speed-oriented or filteringpower-oriented?

Less time or less survivors?
Jiaheng Lu, Jialong Han, Xiaofeng Meng
5
The issue of compromise comes again

Balance between the two stages should be
reached:
More(less)
filtration time
Strong(weak)
Overall performance
filtration power
=Tf+Tv ?????
Fewer(more)
candidates
Jiaheng Lu, Jialong Han, Xiaofeng Meng
Less(more)
verification time
6
Outline


Introduction
State-of-the-art techniques
 The
filtration-verification framework
 K-signature scheme
 Inverted Signature-based Hashtable


Our algorithms and evaluations
Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng
7
K-signature scheme

K-signature scheme
 Proposed
by Chakrabarti et al. (SIGMOD 2008)
 Choose several top-weighted tokens in a string as
signatures to represent it: s => Sig(s)
 Observation: if r cannot match m, r is likely to have
insufficient signature overlapping with m
 K is a parameter for filtration power tuning

Potential evidence loss
A
counter-example found when k=3
 We tried and only proved that it works for k=1 and
k=∞
Jiaheng Lu, Jialong Han, Xiaofeng Meng
8
Outline


Introduction
State-of-the-art techniques
 The
filtration-verification framework
 K-signature scheme
 Inverted Signature-based Hashtable


Our algorithms and evaluations
Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng
9
Inverted Signature-based Hashtable

Proposed by Chakrabarti et al. (SIGMOD 2008)
 Each
dictionary string encoded into a solid 0-1 matrix
 An ‘1’ for each occurrence of a <token,sig-token>
tuple (‘1’- rectangle)
 Bitwise-or all solid matrices to get the matrix of R


Observation: if m is an approximate member of
R, the matrix of m must have enough
intersections with that of R.
Formalized into an NPC problem
 Solution
causes too weak filtering power
Jiaheng Lu, Jialong Han, Xiaofeng Meng
10
Outline



Introduction
State-of-the-art techniques
Our algorithms and evaluations
 Corrected
filtering conditions
 EvSCAN: Filtration by SIL
 EvITER: Incremental optimization on EvSCAN
 Supporting Dynamic Thresholds

Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng
11
Our proposed theorem

If Sim(m,r) ≥δ, what do we have ?
Too strict !
wt(Sig(m)∩Sig(r)) ≥ τ(m)
wt(Sig(m)∩Sig(r)) ≥ min{τ(m),τ(r) }
Proved by us

So the threshold does not remain constant
 involves

unknown evidence
Our solution: Use inverted lists to count sigtoken overlappings.
 Note
that sig-tokens usually have low document
frequency (e.g. IDF as weights)
Jiaheng Lu, Jialong Han, Xiaofeng Meng
12
Outline



Introduction
State-of-the-art techniques
Our algorithms and evaluations
 Corrected
filtering conditions
 EvSCAN: Filtration by SIL
 EvITER: Incremental optimization on EvSCAN
 Supporting Dynamic Thresholds

Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng
13
Signature-based Inverted Lists
 Lists
indexed by sig-tokens
 Each sig-token of a string creates a node (containing
the string’s id) in the corresponding list.


E.g. R = { r1 = “canon eos 5d digital camera", r2 =“nikon
digital slr camera”, r3=“canon slr camera”}.
wt(digital, camera, canon, nikon, slr, eos, 5d) = (1, 1, 2, 2, 2,
7 ,9).
5d, 9.0
1
canon, 2.0
1
camera, 1.0
2
eos, 7.0
1
nikon, 2.0
2
slr, 2.0
2
Jiaheng Lu, Jialong Han, Xiaofeng Meng
3
3
14
Filtration by SIL

Using an array called “accumulator” to compute
the overlapped sig weight wt(Sig(m)∩Sig(r))

E.g. m=“canon eos digital camera”, δ=0.8
5d, 9.0
1
canon, 2.0
1
camera, 1.0
2
rid
1
2
3
eos, 7.0
1
wt(Sig(m)∩Sig(r))
9.0
2.0
0
2.0
nikon, 2.0
2
min{τ(m),τ(r) }
6.8
3.8
3
slr, 2.0
2
Accumulator
3
3
Qualified!
Jiaheng Lu, Jialong Han, Xiaofeng Meng
15
Outline



Introduction
State-of-the-art techniques
Our algorithms and evaluations
 Corrected
filtering conditions
 EvSCAN: Filtration by SIL
 EvITER: Incremental optimization on EvSCAN
 Supporting Dynamic Thresholds

Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng
16
EvITER: Progressive Computation

Recall we are checking all substrings
 Some
of them are quite similar, indicating that they
share duplicate computation
 An intuition: if m have potential evidence r, then
m  t is very likely to match r

Formally we proved that


Let ES(m) be the set of “potential evidence” for m, list[t]={s| all
dictionary strings that contain token t}
We have ES(m  t)  ES(m)∪list[t]
Jiaheng Lu, Jialong Han, Xiaofeng Meng
17
Example

Docoment M:
m
t
“…. cannon eos digital camera lens…”
List[t]
ES(m)
…
{r1}
lens, 3.0
22
53
…

We know that only r1, r22, r53 are possible to
match “cannon eos digital camera lens”
Jiaheng Lu, Jialong Han, Xiaofeng Meng
18
Flow of Evidence

EvITER for “Evidence ITERATION”
…
Jiaheng Lu, Jialong Han, Xiaofeng Meng
19
Outline



Introduction
State-of-the-art techniques
Our algorithms and evaluations
 Corrected
filtering conditions
 EvSCAN: Filtration by SIL
 EvITER: Incremental optimization on EvSCAN
 Supporting Dynamic Thresholds

Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng
20
The Static Threshold Problem

How does this index work so far?








-“Get ready forδ=0.8 please.”
-“Please wait 30min for index generation…”
-“Ready!”
-“Document M1,δ=0.8. Go!”
-“…Extraction complete.”
-“Document M2, and I wantδ=0.9…”
-“Sorry, please wait another 30min for index regeneration…”
-“:-(”
Jiaheng Lu, Jialong Han, Xiaofeng Meng
21
The Static Threshold Problem

This One Seems Better








-“Get ready forδ>=0.8 please.”
-“Please wait 30min for index generation…”
-“Ready!”
-“Document M1,δ=0.8. Go!”
-“…Extraction complete.”
-“Document M2, and I wantδ=0.9…”
-“…Extraction complete.”
“:-)”
Jiaheng Lu, Jialong Han, Xiaofeng Meng
22
Supporting Dynamic Thresholds

An Observation
δ descends, a string r’s tokens fall into Sig(r)
one by one, in the order of their weight ranking.
 I.e. any node <sig-token, rid> is “active” when δ is
below certain “threshold” u<sig-token, rid>.
 When


We record u<sig-token, rid> in each node and sort all
nodes in each list according to the descending
order of their u value.
For any given δ, we only need retrieve a prefix
of each list to get all “active nodes”
Jiaheng Lu, Jialong Han, Xiaofeng Meng
23
Experimental Datasets


DBLP: 274,788 Paper titles
1,838,973 URLs
Jiaheng Lu, Jialong Han, Xiaofeng Meng
24
Balance should be reached

Recall our two stages of filtration and verification
Jiaheng Lu, Jialong Han, Xiaofeng Meng
25
Performance (DBLP)
Jiaheng Lu, Jialong Han, Xiaofeng Meng
26
Outline



Introduction
State-of-the-art techniques
Our algorithms and evaluations
 Corrected
filtering conditions
 EvSCAN: Filtration by SIL
 EvITER: Incremental optimization on EvSCAN
 Supporting Dynamic Thresholds

Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng
27
Conclusion




Our method causes no false negatives
Our method achieves a good balance between the two
phases of filtration and verification
We also propose EvITER to eliminate duplicate
computation
Our method has both effective & efficient performance
Jiaheng Lu, Jialong Han, Xiaofeng Meng
28
Jiaheng Lu, Jialong Han, Xiaofeng Meng
29
References






[1] A. Arasu, V. Ganti, R. Kaushik. Efficient exact set-similarity joins.
In VLDB, pages 918-929, 2006.
[2] K. Chakrabarti, S. Chaudhuri, V. Ganti, D. Xin. An efficient filter
for approximate membership checking. In SIGMOD Conference,
2008.
[3] A. Chandel, P. C. Nagesh, and S. Sarawagi. Efficient batch top-k
search for dictionary-based entity recognition. In ICDE, page 28,
2006.
[4] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for
similarity joins in data cleaning. In ICDE, page 5, 2006.
[5] M.R.Garey and D.S.Johnson. Computers and Intractability:
Guidance to the Theory of NP-Completeness.
[6] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S.
Muthukrishnan, and D. Srivastava. Approximate string joins in a
database (almost) for free. In VLDB, pages 491-500, 2001.
Jiaheng Lu, Jialong Han, Xiaofeng Meng
30
References







[7] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms
for approximate string searches. In ICDE, pages 257–266, 2008.
[8] C. Li, B,Wang, X. Yang, VGRAM: Improving performance of
approximate queries on string collections using variable length
grams. In VLDB 2007.
[9] G. Navarro. A guided tour to approximate string matching. ACM
Comput. Surv., 33(1):31–88, 2001.
[10] S. Sarawagi, A.Kirpal, Efficient set joins on similarity predicates.
In SIGMOD Conference, 2004.
[11] A. Singhal. Modern information retrieval: A brief overview.
Bulletin of the IEEE Computer Society Technical Committee on Data
Engineering, 24(4):35-43, 2001.
[12] E. Sutinen and J. Tarhio. On using q-grams locations in
approximate string matching. In ESA, pages 327-340, 1995.
[13] W. Wang, C. Xiao, X. Lin, C. Zhang. Efficient approximate entity
extraction with edit distance constraints. In SIGMOD Conference,
2009.
Jiaheng Lu, Jialong Han, Xiaofeng Meng
31