Document sketching • Problem: duplicate or near-duplicate identification in a collection of documents • How to measure the similarity between documents? • A reasonable.

Transcript Document sketching • Problem: duplicate or near-duplicate identification in a collection of documents • How to measure the similarity between documents? • A reasonable.

Document sketching
• Problem: duplicate or near-duplicate identification in a collection of
documents
• How to measure the similarity between documents?
• A reasonable (?) candidate: edit distance
• Computationally expensive
• Another measure: resemblance due to [Broder ‘97]
Resemblance of documents [Broder ‘97]
•
•
•
𝑟(𝐴, 𝐵): resemblance between documents 𝐴 and 𝐵
𝑟 𝐴, 𝐵 ∈ [0,1]. Similar 𝐴, 𝐵 means 𝑟(𝐴, 𝐵) close to 1
Convert documents to a set of integers 𝐷 ↦ 𝑆(𝐷)
•
•
•
•
•
•
•
A contiguous sequence of length 𝑤 contained in document 𝐷 is called a 𝑤-shingle
Example: 𝐷 = (a rose is a rose is a rose)
4-shingles of 𝐷 are:
(a rose is a), (rose is a rose), (is a rose is), (a rose is a), (rose is a rose)
The set of 4-shingles of 𝐷 : {(a rose is a), (rose is a rose), (is a rose is)}
Map shingles to integers (for some fixed 𝑤)
From now on, identify the documents with sets of integers in [𝑚] ≔ {1, . . , 𝑚}
Thus a document is represented as a set of integers
•
𝑟 𝐴, 𝐵 ≔
•
•
•
𝑟 𝐴, 𝐵 ∈ 0,1
Thus, 𝑟 𝐴, 𝐴 = 1, but 𝑟 𝐴, 𝐵 = 1 does not mean 𝐴 = 𝐵
In practice, 𝑟 𝐴, 𝐵 is a reasonable approximation of the informal notion of similarity of
𝐴, 𝐵
|𝐴∩𝐵|
|𝐴∪𝐵|
(also known as Jaccard similarity between sets 𝐴 and 𝐵)
Estimating resemblance
• Given : 𝐴, 𝐵 ⊆ 𝑚
• Estimate: 𝑟 𝐴, 𝐵 =
|𝐴∩𝐵|
|𝐴∪𝐵|
• Exact computation of 𝑟 𝐴, 𝐵 requires time 𝑂( 𝐴 + 𝐵 )
• A basic estimator for 𝑟(𝐴, 𝐵)
• 𝑆𝑚 : set of permutations 𝑚 → 𝑚
• Choose a random 𝜋 ∈ 𝑆𝑚
Pr min 𝜋 𝐴 } = min{𝜋 𝐵
= 𝑟(𝐴, 𝐵)
• Variance too high
Reducing variance
First method
• Sample 𝑘 random permutations 𝜋1 , … , 𝜋𝑘 ∈ 𝑆𝑚
•
•
Sketch of document 𝐴 is min 𝜋1 𝐴 , … , min 𝜋𝑘 𝐴
Resemblance can be estimated as
| 𝑖∈[𝑘] such that min 𝜋𝑖 𝐴 = min 𝜋𝑖 𝐵
|
𝑘
(this is an unbiased estimator; proof follows from the previous slide)
Second method
• Let min𝑘 (𝑆) denote the set of 𝑘 smallest elements of 𝑆, and if 𝑆 < 𝑘, then
min𝑘 𝑆 = 𝑆
• For a constant 𝑘, and uniformly random 𝜋 ∈ 𝑆𝑚
|min𝑘 𝜋 𝐴 ∩min𝑘 𝜋 𝐵 ∩min𝑘 𝜋 𝐴 ∪𝜋 𝐵 |
|min𝑘 𝜋 𝐴 ∪𝜋 𝐵 |
is an unbiased estimator of 𝑟(𝐴, 𝐵)
(details on the board)
• We can estimate 𝑟(𝐴, 𝐵) within multiplicative error 1 ± 𝜖 with 𝑘 =
1
𝑂 2
(for both methods above)
𝜖
• The second method above gives us a way of sketching the documents:
Fix a permutation 𝜋: 𝑚 → [𝑚], and a constant 𝑘
For document 𝐴, its sketch is min𝑘 (𝜋(𝐴))
• Now given the sketches of documents 𝐴, 𝐵, 𝐶 …, using the same
permutation 𝜋, we can estimate the resemblance of pairs
• Sketch of a document takes space 𝑂(𝑘 log 𝑚) and estimating resemblance
takes time 𝑂(𝑘 log 𝑚)
• (We can also do it with the second method but we will need to store 𝑘
permutations)
Document sketching in small space
• One problem with this: storing permutations is expensive
• Question: Can we work with a small set of permutations instead of 𝑆𝑛 ?
• Yes: Min-wise independent permutations [Broder et al. ‘98]
• Can also use 2-wise independent hash functions [Thorup 2013]
Sampling from data streams
Sampling from a data stream
• How to select a uniformly random size 𝑘 subset of {𝑎1 , … , 𝑎𝑛 }?
𝑘−𝑡
• Choose the 𝑖 + 1 ′ th element with probability
if 𝑡 elements have
𝑛−𝑖
already been selected
• What if the set 𝑆 is given via a stream and we don’t know its length in
advance?
• There is a solution similar to the previous one, but the following is easier:
• For the 𝑖th item, sample 𝛼𝑖 ∈ (0,1] uniformly at random, keep the 𝑘 items
with highest value of the 𝛼𝑖
Sampling for subset sum estimation
• Given a stream 𝑤1 , … , 𝑤𝑛 of positive weights, we want to keep a small
amount of information so that later we can estimate the weight of any
given subset (the weight of a subset is the sum of the weights in it)
First Solution (Poisson sampling)
• Choose any probabilities 𝑝1 , … , 𝑝𝑛 ∈ (0,1] for each weight
• On encountering 𝑤𝑖 , include it in set 𝑆 with probability 𝑝𝑖 (independent of
previous decisions)
• Given any set 𝑇 (chosen in advance before the selection of 𝑆), estimator
𝑤
for 𝑖∈𝑆∩𝑇 𝑖
𝑝𝑖
• This is an unbiased estimator for 𝑤 𝑇 = 𝑖∈𝑇 𝑤𝑖
• The expected number of samples = 𝑖∈[𝑛] 𝑝𝑖
Poisson sampling
• Smaller sample set does not come for free:
Variance in the estimate of the weight of the 𝑖th item =
𝑤𝑖2
𝑝𝑖
− 𝑤𝑖2
• One issue with this solution: The sample size is not fixed (although it can
be concentrated around the mean)
• Another issue: What should be the values of the 𝑝𝑖 ? If we want the sample
𝑘
to be size 𝑘 in expectation, then a possible choice is 𝑝𝑖 =
𝑛
• But
– 𝑛 may not be known
– this sampling is not weight-sensitive: we may want to choose 𝑝𝑖 to be
larger for larger 𝑤𝑖 to reduce the variance
Priority sampling [Duffield et al. 2007]
Second solution (priority sampling):
• For each item, generate an independent uniform 𝛼𝑖 ∈ (0,1]
• Priority 𝑞𝑖 of item 𝑖 is given by 𝑞𝑖 =
𝑤𝑖
𝛼𝑖
• We assume all priorities are distinct (true with probability 1)
• For a given 𝑘 ≤ 𝑛 the priority sample 𝑆 of size 𝑘 is given by the 𝑘 items of
highest priority
• 𝜏 = (𝑘 + 1)th priority, thus 𝑖 ∈ 𝑆 iff 𝑞𝑖 > 𝜏
• For 𝑖 ∈ [𝑛] let 𝑤𝑖 ≔ max{𝑤𝑖 , 𝜏} if 𝑖 ∈ 𝑆 and 𝑤𝑖 ≔ 0 otherwise
Properties of priority sampling:
• Maintains sample of fixed size 𝑘
• For 𝑖 ∈ 𝑛
𝐸 𝑤𝑖 = 𝑤𝑖
• And so, for T⊆ 𝑛 , 𝐸 𝑖∈𝑇 𝑤𝑖 = 𝑖∈𝑇 𝑤𝑖
(proof on the board; also in the Duffield et al. paper)
Priority sampling properties
We won’t prove the following:
• For distinct 𝑖, 𝑗 ∈ [𝑛], 𝑤𝑖 and 𝑤𝑗 have 0 covariance
• So the variance of the estimate of the weight of a set is the sum of the
variances of the estimators for the items in the set
• The total variance (sum of variances of the estimators of all individual
items) of priority sampling is near-minimal among unbiased estimators

Document sketching • Problem: duplicate or near-duplicate identification in a collection of documents • How to measure the similarity between documents? • A reasonable.

Transcript Document sketching • Problem: duplicate or near-duplicate identification in a collection of documents • How to measure the similarity between documents? • A reasonable.

Directory