Efficient Algorithms for Approximate Member Extraction

Transcript Efficient Algorithms for Approximate Member Extraction

Efficient Algorithms for
Approximate Member
Extraction Using Signaturebased Inverted Lists
Jialong Han
Co-authored with Jiaheng Lu, Xiaofeng Meng
Renmin University of China
Introduction: An Example

A dictionary of strings we are interested in
 E.g.

product names, postal addresses…
We are going to locate their “approximate
apparences” in a series of documents.
 See
the meaning of “approximate apparence” in the
following example:
Jiaheng Lu, Jialong Han, Xiaofeng Meng
2
Problem Definition

Given a dictionary R and a threshold δ, extract
all proper substrings m from input documents S
such that there exists r ∈R, and Similarity (r, m)
≥δ(or Distance(r, m) ≤k).
 Here
we call r a piece of evidence for m.
 Similarity() is a function measuring the similarity of
two strings


Strings are viewed as sets of tokens (words)
wt (r  m)
An example for Sim(): Jaccard similarity: J (r , m) 
wt (r  m)
Jiaheng Lu, Jialong Han, Xiaofeng Meng
3
Outline


Introduction
State-of-the-art techniques
 The
filtration-verification framework
 K-signature scheme
 Inverted Signature-based Hashtable


Our algorithms and evaluations
Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng
4
Why pre-pruning is needed

We need spot evidence to decide whether a
substring m should be extracted
 Simple
verification on all dictionary strings may be
inefficient
 Pre-pruning and post-verifying is beneficial
 But should it be running-speed-oriented or filteringpower-oriented?

Less time or less survivors?
Jiaheng Lu, Jialong Han, Xiaofeng Meng
5
The issue of compromise comes again

Balance between the two stages should be
reached:
More(less)
filtration time
Strong(weak)
Overall performance
filtration power
=Tf+Tv ?????
Fewer(more)
candidates
Jiaheng Lu, Jialong Han, Xiaofeng Meng
Less(more)
verification time
6
Outline


Introduction
State-of-the-art techniques
 The
filtration-verification framework
 K-signature scheme
 Inverted Signature-based Hashtable


Our algorithms and evaluations
Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng
7
K-signature scheme

K-signature scheme
 Proposed
by Chakrabarti et al. (SIGMOD 2008)
 Choose several top-weighted tokens in a string as
signatures to represent it: s => Sig(s)
 Observation: if r cannot match m, r is likely to have
insufficient signature overlapping with m
 K is a parameter for filtration power tuning

Potential evidence loss
A
counter-example found when k=3
 We tried and only proved that it works for k=1 and
k=∞
Jiaheng Lu, Jialong Han, Xiaofeng Meng
8
Outline


Introduction
State-of-the-art techniques
 The
filtration-verification framework
 K-signature scheme
 Inverted Signature-based Hashtable


Our algorithms and evaluations
Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng
9
Inverted Signature-based Hashtable

Proposed by Chakrabarti et al. (SIGMOD 2008)
 Each
dictionary string encoded into a solid 0-1 matrix
 An ‘1’ for each occurrence of a <token,sig-token>
tuple (‘1’- rectangle)
 Bitwise-or all solid matrices to get the matrix of R


Observation: if m is an approximate member of
R, the matrix of m must have enough
intersections with that of R.
Formalized into an NPC problem
 Solution
causes too weak filtering power
Jiaheng Lu, Jialong Han, Xiaofeng Meng
10
Outline



Introduction
State-of-the-art techniques
Our algorithms and evaluations
 Corrected
filtering conditions
 EvSCAN: Filtration by SIL
 EvITER: Incremental optimization on EvSCAN
 Supporting Dynamic Thresholds

Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng
11
Our proposed theorem

If Sim(m,r) ≥δ, what do we have ?
Too strict !
wt(Sig(m)∩Sig(r)) ≥ τ(m)
wt(Sig(m)∩Sig(r)) ≥ min{τ(m),τ(r) }
Proved by us

So the threshold does not remain constant
 involves

unknown evidence
Our solution: Use inverted lists to count sigtoken overlappings.
 Note
that sig-tokens usually have low document
frequency (e.g. IDF as weights)
Jiaheng Lu, Jialong Han, Xiaofeng Meng
12
Outline



Introduction
State-of-the-art techniques
Our algorithms and evaluations
 Corrected
filtering conditions
 EvSCAN: Filtration by SIL
 EvITER: Incremental optimization on EvSCAN
 Supporting Dynamic Thresholds

Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng
13
Signature-based Inverted Lists
 Lists
indexed by sig-tokens
 Each sig-token of a string creates a node (containing
the string’s id) in the corresponding list.


E.g. R = { r1 = “canon eos 5d digital camera", r2 =“nikon
digital slr camera”, r3=“canon slr camera”}.
wt(digital, camera, canon, nikon, slr, eos, 5d) = (1, 1, 2, 2, 2,
7 ,9).
5d, 9.0
1
canon, 2.0
1
camera, 1.0
2
eos, 7.0
1
nikon, 2.0
2
slr, 2.0
2
Jiaheng Lu, Jialong Han, Xiaofeng Meng
3
3
14
Filtration by SIL

Using an array called “accumulator” to compute
the overlapped sig weight wt(Sig(m)∩Sig(r))

E.g. m=“canon eos digital camera”, δ=0.8
5d, 9.0
1
canon, 2.0
1
camera, 1.0
2
rid
1
2
3
eos, 7.0
1
wt(Sig(m)∩Sig(r))
9.0
2.0
0
2.0
nikon, 2.0
2
min{τ(m),τ(r) }
6.8
3.8
3
slr, 2.0
2
Accumulator
3
3
Qualified!
Jiaheng Lu, Jialong Han, Xiaofeng Meng
15
Outline



Introduction
State-of-the-art techniques
Our algorithms and evaluations
 Corrected
filtering conditions
 EvSCAN: Filtration by SIL
 EvITER: Incremental optimization on EvSCAN
 Supporting Dynamic Thresholds

Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng
16
EvITER: Progressive Computation

Recall we are checking all substrings
 Some
of them are quite similar, indicating that they
share duplicate computation
 An intuition: if m have potential evidence r, then
m  t is very likely to match r

Formally we proved that


Let ES(m) be the set of “potential evidence” for m, list[t]={s| all
dictionary strings that contain token t}
We have ES(m  t)  ES(m)∪list[t]
Jiaheng Lu, Jialong Han, Xiaofeng Meng
17
Example

Docoment M:
m
t
“…. cannon eos digital camera lens…”
List[t]
ES(m)
…
{r1}
lens, 3.0
22
53
…

We know that only r1, r22, r53 are possible to
match “cannon eos digital camera lens”
Jiaheng Lu, Jialong Han, Xiaofeng Meng
18
Flow of Evidence

EvITER for “Evidence ITERATION”
…
Jiaheng Lu, Jialong Han, Xiaofeng Meng
19
Outline



Introduction
State-of-the-art techniques
Our algorithms and evaluations
 Corrected
filtering conditions
 EvSCAN: Filtration by SIL
 EvITER: Incremental optimization on EvSCAN
 Supporting Dynamic Thresholds

Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng
20
The Static Threshold Problem

How does this index work so far?








-“Get ready forδ=0.8 please.”
-“Please wait 30min for index generation…”
-“Ready!”
-“Document M1,δ=0.8. Go!”
-“…Extraction complete.”
-“Document M2, and I wantδ=0.9…”
-“Sorry, please wait another 30min for index regeneration…”
-“:-(”
Jiaheng Lu, Jialong Han, Xiaofeng Meng
21
The Static Threshold Problem

This One Seems Better








-“Get ready forδ>=0.8 please.”
-“Please wait 30min for index generation…”
-“Ready!”
-“Document M1,δ=0.8. Go!”
-“…Extraction complete.”
-“Document M2, and I wantδ=0.9…”
-“…Extraction complete.”
“:-)”
Jiaheng Lu, Jialong Han, Xiaofeng Meng
22
Supporting Dynamic Thresholds

An Observation
δ descends, a string r’s tokens fall into Sig(r)
one by one, in the order of their weight ranking.
 I.e. any node <sig-token, rid> is “active” when δ is
below certain “threshold” u<sig-token, rid>.
 When


We record u<sig-token, rid> in each node and sort all
nodes in each list according to the descending
order of their u value.
For any given δ, we only need retrieve a prefix
of each list to get all “active nodes”
Jiaheng Lu, Jialong Han, Xiaofeng Meng
23
Experimental Datasets


DBLP: 274,788 Paper titles
1,838,973 URLs
Jiaheng Lu, Jialong Han, Xiaofeng Meng
24
Balance should be reached

Recall our two stages of filtration and verification
Jiaheng Lu, Jialong Han, Xiaofeng Meng
25
Performance (DBLP)
Jiaheng Lu, Jialong Han, Xiaofeng Meng
26
Outline



Introduction
State-of-the-art techniques
Our algorithms and evaluations
 Corrected
filtering conditions
 EvSCAN: Filtration by SIL
 EvITER: Incremental optimization on EvSCAN
 Supporting Dynamic Thresholds

Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng Meng
27
Conclusion




Our method causes no false negatives
Our method achieves a good balance between the two
phases of filtration and verification
We also propose EvITER to eliminate duplicate
computation
Our method has both effective & efficient performance
Jiaheng Lu, Jialong Han, Xiaofeng Meng
28
Jiaheng Lu, Jialong Han, Xiaofeng Meng
29
References






[1] A. Arasu, V. Ganti, R. Kaushik. Efficient exact set-similarity joins.
In VLDB, pages 918-929, 2006.
[2] K. Chakrabarti, S. Chaudhuri, V. Ganti, D. Xin. An efficient filter
for approximate membership checking. In SIGMOD Conference,
2008.
[3] A. Chandel, P. C. Nagesh, and S. Sarawagi. Efficient batch top-k
search for dictionary-based entity recognition. In ICDE, page 28,
2006.
[4] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for
similarity joins in data cleaning. In ICDE, page 5, 2006.
[5] M.R.Garey and D.S.Johnson. Computers and Intractability:
Guidance to the Theory of NP-Completeness.
[6] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S.
Muthukrishnan, and D. Srivastava. Approximate string joins in a
database (almost) for free. In VLDB, pages 491-500, 2001.
Jiaheng Lu, Jialong Han, Xiaofeng Meng
30
References







[7] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms
for approximate string searches. In ICDE, pages 257–266, 2008.
[8] C. Li, B,Wang, X. Yang, VGRAM: Improving performance of
approximate queries on string collections using variable length
grams. In VLDB 2007.
[9] G. Navarro. A guided tour to approximate string matching. ACM
Comput. Surv., 33(1):31–88, 2001.
[10] S. Sarawagi, A.Kirpal, Efficient set joins on similarity predicates.
In SIGMOD Conference, 2004.
[11] A. Singhal. Modern information retrieval: A brief overview.
Bulletin of the IEEE Computer Society Technical Committee on Data
Engineering, 24(4):35-43, 2001.
[12] E. Sutinen and J. Tarhio. On using q-grams locations in
approximate string matching. In ESA, pages 327-340, 1995.
[13] W. Wang, C. Xiao, X. Lin, C. Zhang. Efficient approximate entity
extraction with edit distance constraints. In SIGMOD Conference,
2009.
Jiaheng Lu, Jialong Han, Xiaofeng Meng
31

Efficient Algorithms for Approximate Member Extraction

Transcript Efficient Algorithms for Approximate Member Extraction

Directory