Transcript lecture13
Leveraging Big Data: Lecture 13
http://www.cohenwang.com/edith/bigdataclass2013
Instructors:
Edith Cohen
Amos Fiat
Haim Kaplan
Tova Milo
What are Linear Sketches?
Linear Transformations of the input vector to a
lower dimension.
5
โฎ
๐=
0
2
Examples: JL Lemma on Gaussian random
projections, AMS sketch
When to use linear sketches?
Min-Hash sketches
๏ง Suitable for nonnegative vectors
๏ง (we will talk about weighted vectors later
today)
๏ง Mergeable (under MAX)
๏ง In particular, can replace value with a larger one
๏ง One sketch with many uses: distinct count,
similarity, (weighted) sample
But.. no support for negative updates
Linear Sketches
linear transformations (usually โrandomโ)
๏ง Input vector ๐ of dimension ๐
๏ง Matrix ๐๐
×๐ whose entries are specified by
(carefully chosen) random hash functions
๐
๐
๐โช๐
= ๐
๐
๐
๐
Advantages of Linear Sketches
๏Easy to update sketch under positive and
negative updates to entry:
๏ง Update ๐, ๐ฅ , where ๐ โ [1, โฆ , ๐] ๐ฅ โ ๐
means ๐๐ โ ๐๐ + ๐ฅ .
๏ง To update sketch: โ๐, ๐ ๐ โ ๐ ๐ + ๐๐๐ ๐ฅ
๏ Naturally mergeable (over signed entries)
๐ ๐ + ๐ = ๐ ๐ + ๐ = ๐๐ + ๐๐ = ๐ ๐ + ๐ (๐)
Linear sketches: Today
Design linear sketches for:
๏ง โExactly1?โ : Determine if there is exactly one
nonzero entry (special case of distinct count)
๏ง โSample1โ: Obtain the index and value of a
(random) nonzero entry
Application: Sketch the โadjacency vectorsโ of
each node so that we can compute connected
components and more by just looking at the
sketches.
Linear sketches: Today
Design linear sketches for:
๏ง โExactly1?โ : Determine if there is exactly one
nonzero entry (special case of distinct count)
๏ง โSample1โ: Obtain the index and value of a
(random) nonzero entry
Application: Sketch the โadjacency vectorsโ of
each node so that we can compute connected
components and more by just looking at the
sketches.
Exactly1?
๏ง Vector ๐ โ ๐
๐
๏ง Is there exactly one nonzero?
๐ = (๐, ๐, ๐, โ๐, ๐, ๐, ๐, ๐)
No (3 nonzeros)
๐ = (๐, ๐, ๐, ๐, ๐, ๐, ๐, ๐)
Yes
Exactly1? sketch
๏ง Vector ๐ โ ๐
๐
๏ง Random hash function โ: ๐ โ {0,1}
๏ง Sketch: ๐ 0 =
๐|โ ๐
1
๐
,
๐
=
=0 ๐
๐|โ ๐ =1 ๐๐
If exactly one of ๐ 0 , ๐ 1 is 0, return yes.
Analysis:
๏ง If Exactly1 then exactly one of s 0 , s1 is zero
๏ง Else, this happens with probability โค
How to boost this ?
3
4
โฆ.Exactly1? sketch
To reduce error probability to โค
3 ๐
4
:
Use ๐ functions โ1 , โฆ , โ๐ โ {0,1}
Sketch: sj0 =
๐|โ๐ ๐
1
๐
,
๐
๐ =
=0 ๐
๐|โ๐ ๐ =1 ๐๐
With ๐ = ๐(log ๐), error probability โค
1
nc
Exactly1? Sketch in matrix form
๐ functions โ1 , โฆ , โ๐
Sketch: ๐ ๐0 = ๐|โ๐ ๐ =0 ๐๐ , ๐ ๐1 =
โ1 (1)
โ1 2
1 โ โ1 (1)
1 โ โ1 2
โ2 (1)
โ2 2
โฏ
โ1 ๐
โฏ 1 โ โ1 ๐
โฏ
1 โ โ๐ (1)
โฎ
1 โ โ๐ 2 โฏ 1 โ โ๐ ๐
๐ 10
5
โฎ
โ2 ๐
1 โ โ2 (1) 1 โ โ2 2 โฏ 1 โ โ2 ๐
โฎ
๐|โ๐ ๐ =1 ๐๐
0
โฎ
2
๐ 11
=
๐ 20
๐ 21
โฎ
๐ ๐1
Linear sketches: Next
Design linear sketches for:
๏ง โExactly1?โ : Determine if there is exactly one
nonzero entry (special case of distinct count)
๏ง โSample1โ: Obtain the index and value of a
(random) nonzero entry
Application: Sketch the โadjacency vectorsโ of
each node so that we can compute connected
components and more by just looking at the
sketches.
Sample1 sketch
Cormode Muthukrishnan Rozenbaum 2005
A linear sketch with ๐ = O(log 2 ๐) which
obtains (with fixed probability, say 0.1) a
uniform at random nonzero entry.
Vector ๐ = (๐, ๐, ๐, โ๐, ๐, ๐, ๐, ๐)
With probability > 0.1 return
1 1 1
๐ = ( , , ): (2,1) (4, โ5) (8,3)
3 3 3
Else return failure
1
Also, very small < ๐ probability of wrong answer
๐
Sample1 sketch
For ๐ โ [1, log 2 ๐ ], take a random hash function
โ๐ : 1, ๐ โ [0,2๐ โ1]
We only look at indices that map to ๐, for these
indices we maintain:
๏ง Exactly1? Sketch (boosted to error prob <
1
)
๐
๐
๏ง ๐๐ =
๐|โ๐ (๐)=0
๐๐ sum of values
๏ง ๐๐ =
๐|โ๐ (๐)=0
๐๐๐ sum of index times values
For lowest ๐ s.t. Exactly1?=yes, return
Else (no such ๐), return failure.
๐๐
๐๐
, ๐๐
Matrix form of Sample1
For each ๐ there is a block of rows as follows:
๏ง Entries are 0 on all columns ๐ก โ 1, โฆ , ๐ for
which โ๐ โ 0. Let ๐ด๐ = t โ๐ ๐ก = 0}.
๏ง The first ๐(log ๐) rows on ๐ด๐ contain an
exactly1? Sketch (input vector dimension of the
exactly1? Is equal to |๐ด๐ |).
๏ง The next row has โ1โ on ๐ก โ ๐ด๐ (and โcodesโ ๐๐ )
๏ง The last row in the block has ๐ก on ๐ก โ ๐ด๐ (and
โcodesโ ๐๐ )
Sample1 sketch: Correctness
For lowest ๐ such that Exactly1?=yes, return (๐๐ , ๐๐ )
If Sample1 returns a sample, correctness only depends
on that of the Exactly1? Component.
All log 2 ๐ โExactly1?โ applications are correct with
โlog2 ๐โ
probability โฅ 1 โ
๐ .
๐
It remains to show that:
With probability โฅ 0.1, at least for one ๐, โ๐ ๐ = 0
for exactly one nonzero ๐๐
Sample1 Analysis
1
2๐
Lemma: With probability โฅ , for some ๐ there
is exactly one index that maps to 0
Proof: What is the probability that exactly one
index maps to 0 by โ๐ ?
If there are ๐ non-zeros: ๐ = ๐2
โน If ๐ โ 2
๐โ1
๐
,2 , ๐ >
1
2
โ๐
1โ2
1โ2
โ๐ ๐โ1
๐ โ1
2
โ๐
โน for any ๐, this holds for some ๐
โฅ
1
2๐
Sample1: boosting success probability
Same trick as before:
We can use ๐(log ๐) independent applications
to obtain a sample1 sketch with success
probability that is โฅ 1 โ 1/๐๐ for a constant ๐
of our choice.
We will need this small error probability for the
next part: Connected components computation
over sketched adjacency vectors of nodes.
Linear sketches: Next
Design linear sketches for:
๏ง โExactly1?โ : Determine if there is exactly one
nonzero entry (special case of distinct count)
๏ง โSample1โ: Obtain the index and value of a
(random) nonzero entry
Application: Sketch the โadjacency vectorsโ of
each node so that we can compute connected
components and more by just looking at the
sketches.
Connected Components: Review
Repeat:
๏ง Each node selects an incident edge
๏ง Contract all selected edges (contract = merge
the two endpoints to a single node)
Connected Components: Review
Iteration1:
๏ง Each node selects an incident edge
Connected Components: Review
Iteration1:
๏ง Each node selects an incident edge
๏ง Contract selected edges
Connected Components: Review
Iteration 2:
๏ง Each (contracted) node selects an incident edge
Connected Components: Review
Iteration2:
๏ง Each (contracted) node selects an incident edge
๏ง Contract selected edges
Done!
Connected Components: Analysis
Repeat:
๏ง Each โsuperโ node selects an incident edge
๏ง Contract all selected edges (contract = merge
the two endpoint super node to a single
super node)
Lemma: There are at most log 2 ๐ iterations
Proof: By induction: after the ๐๐กโ iteration, each
โsuperโ node include โฅ 2๐ original nodes.
Adjacency sketches
Ahn, Guha and McGregor 2012
Adjacency Vectors of nodes
Nodes 1, โฆ , ๐ .
Each node has an associated adjacency vector
of dimension ๐2 : Entry for each pair ๐, ๐ ๐ < ๐
Adjacency vector ๐ of node ๐:
๐(๐,๐) = 1 โ edge ๐, ๐ โ ๐ธ ๐ < ๐
๐(๐,๐) = โ1 โ โ edge ๐, ๐ โ ๐ธ ๐ > ๐
๐๐ฅ = 0 if edge ๐ฅ โ ๐ธ or not adjacent to ๐
Adjacency vector of a node
Node 3:
(1,2) (1,3) (1,4) (1,5) (2,3) (2,4) (2,5) (3,4) (3,5) (4,5)
0
-1
0
0
-1
0
0
+1
0
0
3
1
4
2
5
Adjacency vector of a node
Node 5:
(1,2) (1,3) (1,4) (1,5) (2,3) (2,4) (2,5) (3,4) (3,5) (4,5)
0
0
0
0
0
0
0
0
0
-1
3
1
4
2
5
Adjacency vector of a set of nodes
We define the adjacency vector of a set of
nodes ๐ถ to be the sum of adjacency vectors
of members.
What is the graph interpretation ?
Adjacency vector of a set of nodes
๐ = {2,3,4}:
(1,2) (1,3) (1,4) (1,5) (2,3) (2,4) (2,5) (3,4) (3,5) (4,5)
0
0
0
0
-1
0
0
0
0
0
0
0
+1
-1
0
+1
0
-1
0
0
0
0
+1
-1
0
0
0
0
0
+1
0
-1
0
0
0
0
0
0
0
+1
3
Entries are ±1
only on cut edges
(๐, ๐ โ ๐)
1
4
2
5
Stating Connected Components
Algorithm in terms of adjacency vectors
We maintain a disjoint-sets (union find) data
structure over the set of nodes.
๏ง Disjoint sets correspond to โsuper nodes.โ
๏ง For each set ๐ we keep a vector ๐ด(๐)
Operations:
๏ง Find(๐): for node ๐, return its super node
๏ง Union ๐1 , ๐2 : Merge two super nodes ๐ โ ๐1 โช
๐2 , ๐ด ๐ โ ๐ด ๐1 + ๐ด(๐2 )
Connected Components Computation
in terms of adjacency vectors
Initially, each node ๐ creates a supernode with
๐ด being the adjacency vector of ๐
Repeat:
๏ง Each supernode ๐ selects a nonzero entry
(๐ฅ, ๐ฆ) in ๐ด(๐) (this is a cut edge of ๐)
๏ง For each selected (๐ฅ, ๐ฆ), Union(๐ ๐ฅ , ๐ ๐ฆ )
Connected Components in sketch space
Sketching: We maintain a sample1 sketch of the
adjacency vector of each node.: When edges
are added or deleted we update the sketch.
Connected Component Query: We apply the
connected component algorithm for adjacency
vectors over the sketched vectors.
Connected Components in sketch space
Operation on sketches during CC computation:
๏ง Select a nonzero in ๐ด(๐): we use the sample1
sketch of ๐ด(๐), which succeeds with
1
probability > 1 โ ๐
๐
๏ง Union: We take the sum of the sample1
sketch vectors of the merged supernodes to
obtain the sample1 sketch of the new
supernode
Connected Components in sketch
space
Iteration1:
๏ง Each supernode (node) uses its sample1
sketch to select an incident edge
Sample1 sketches
of dimension ๐
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
Connected Components in sketch
space
Iteration1 (continue):
Union the nodes in each path/cycle. Sum up
the sample1 sketches.
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
Connected Components in sketch
Iteration1 (end): space
New super nodes with their vectors
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
[๐, โ๐, . . , ๐, โฆ ]
Connected Components in sketch space
Important subtlety:
One sample1 sketch only guarantees (with high
probability) one sample !!!
But the connected components computation
uses each sketch โlog 2 ๐โ times (once in each
iteration)
Solution: We maintain โlog 2 ๐โ sets of sample1
sketches of the adjacency vectors.
Connected Components in sketch space
When does sketching pay off ??
The plain solution maintains the adjacency list of
each node, update as needed, and apply a classic
connected components algorithm on query time.
Sketches of adjacency vectors is justified when:
๏ง Many edges are deleted and added,
๏ง we need to test connectivity โoftenโ, and
๏ง โusuallyโ ๐ โซ ๐
Bibliography
๏ง Ahn, Guha, McGregor: โAnalysing graph
structure via linear measurements.โ 2013
๏ง Cormode, Muthukrishnan, Rozenbaum,
โSummarizing and Mining Inverse
Distributions on Data Streams via Dynamic
Inverse Samplingโ VLDB 2005
๏ง Jowhari, Saglam, Tardos, โTight bounds for Lp
samplers, finding duplicates in streams, and
related problems.โ PODS 2011
Back to Random Sampling
Powerful tool for data analysis: Efficiently
estimate properties of a large population (data
set) by examining the smaller sample.
We saw sampling several times in this class:
๏ง Min-Hash: Uniform over distinct items
๏ง ADS: probability decreases with distance
๏ง Sampling using linear sketches
๏ง Sample coordination: Using same set of hash
functions. We get mergeability and better
similarity estimators between sampled vectors.
Subset (Domain/Subpopulation) queries:
Important application of samples
Query is specified by a predicate ๐ on items {๐}
๏ง Estimate subset cardinality: | ๐|๐(๐) |
๏ง Weighted items: Estimate subset weight
๐|๐(๐) ๐ค๐
More on โbasicโ sampling
Reservoir sampling (uniform โsimple randomโ
sampling on a stream)
Weighted sampling
๏ง Poisson and Probability Proportional to Size (PPS)
๏ง Bottom-๐/Order Sampling:
๏ง Sequential Poisson/Order PPS/ Priority
๏ง Weighted sampling without replacement
Many names because these highly useful and natural
sampling schemes were re-invented multiple times, by
Computer Scientists and Statisticians
Reservoir Sampling:
[Knuth 1969,1981; Vitter 1985, โฆ]
Model: Stream of (unique) items: ๐1 , ๐2 , โฆ
Maintain a uniform sample ๐ 1 , ๐ 2 , โฆ , ๐ ๐ of size ๐
-- (all ๐ tuples equally likely)
When item ๐ก arrives:
๏ง If ๐ก โค ๐, ๐ ๐ก โ ๐๐ก .
๏ง Else:
๏ง Choose ๐ โผ ๐{1, โฆ , ๐ก}
๏ง If ๐ โค ๐, ๐ ๐ โ ๐๐ก
Reservoir using bottom-๐ Min-Hash
Bottom-k Min-Hash samples: Each item has a
random โhashโ value โผ ๐ 0,1 . We take the ๐
items with smallest hash (also in [Knuth 1969])
๏ง Another form of Reservoir sampling, good also
with distributed data.
๏ง Min-Hash form applies with distinct sampling
(multiple occurrences of same item) where we
can not track ๐ก (total population size till now)
Subset queries with uniform sample
Fraction in sample is an unbiased estimate of
fraction in population
To estimate number in population:
๏ง If we know the total number of items ๐ (e.g.,
stream of items which occur once)
๐
Estimate is: Number in sample times
๐
๏ง If we do not know ๐ (e.g., sampling distinct
items with bottom-k Min-Hash), we use
(conditioned) inverse probability estimates
First option is better (when available): Lower
variance for large subsets
Weighted Sampling
๏ง Items often have a skewed weight distribution:
Internet flows, file sizes, feature frequencies,
number of friends in social network.
๏ง If sample misses heavy items, subset weight
queries would have high variance. โ Heavier
items should have higher inclusion probabilities.
Poisson Sampling (generalizes Bernoulli)
๏ง Items have weights ๐ค1 , ๐ค2 , ๐ค3 , โฆ
๏ง Independent inclusion probabilities
๐1 , ๐2 , ๐3 , โฆ that depend on weights
๏ง Expected sample size is ๐ ๐๐
๐1
๐2
๐3
๐4
๐5
๐6
โฆ
Poisson: Subset Weight Estimation
Inverse Probability estimates [HT52]
If ๐ โ ๐ ๐๐ =
๐ค๐
๐๐
Else ๐๐ = 0
๏ง Assumes we know ๐ค๐ and ๐๐ when ๐ โ ๐
HT estimator of ๐ค ๐ =
๐1
๐ค ๐ =
๐2
๐4
:
๐๐ =
๐โ๐
๐3
๐โ๐ ๐ค๐
๐๐
๐โ๐โฉ๐
๐5
๐6
โฆ
Poisson with HT estimates: Variance
HT estimator is the linear nonnegative estimator
with minimum variance
๏ง linear = estimates each item separately
Variance for item ๐:
Var ๐๐ =
๐ค๐ 2
๐๐
๐๐
+ 1 โ ๐๐ 02 โ ๐ค๐2 =
2 1
๐ค๐ ( โ 1)
๐๐
Poisson: How to choose ๐๐ ?
Optimization problem: Given expected sample
size ๐, minimize sum of per-item variances.
(variance of population weight estimate,
expected variance of a โrandomโ subset)
Minimize
๐
Such that
๐ ๐๐
2 1
๐ค๐ (
๐๐
=๐
โ 1)
Probability Proportional to Size (PPS)
Minimize
๐
Such that
๐ ๐๐
2 1
๐ค๐ (
๐๐
โ 1)
=๐
Solution: Each item is sampled with probability
๐๐ โ ๐ค๐ (truncate with 1).
We show proof for 2 itemsโฆ
PPS minimizes variance: 2 items
Minimize
2 1
๐ค1
๐1
โ1
2 1
+๐ค2
๐2
โ1
Such that ๐1 + ๐2 = ๐
๏ง Same as minimizing
๐ค12
๐1
๐ค22
+
๐โ๐1
๏ง Take derivative with respect to ๐1 :
๐ค12
๐ค22
โ 2+
=0
2
๐ โ ๐1
๐1
๏ง Second derivative โฅ 0: extremum is minimum
Probability Proportional to Size (PPS)
Equivalent formulation:
To obtain a PPS sample with expected size ๐:
๐ค๐
๏ง Take ๐ to be the solution of ๐ = ๐ min{1, }
๐
๐ค๐
๏ง Sample ๐ with probability ๐๐ = min{1, }
๐
๐ค๐
โTake random โ(๐) โผ ๐[0,1] sample โบ
โฅ
โ ๐
For given weights {๐ค๐ }, ๐ uniquely determines ๐
๐
Poisson PPS on a stream
Keep expected sample size ๐, increase ๐
Sample contains all items with
๐ค๐
โ ๐
โฅ๐
๏ง We need to track ๐ค๐ for items that are not
sampled. This allows us to re-compute ๐ so
that ๐๐ = ๐ when a new item arrives, using
only information in sample.
๏ง When ๐ increases, we may need to remove
items from sample.
Poisson sampling has a variable sample size !!
We prefer to specify a fixed sample size ๐
Obtaining a fixed sample size
Proposed schemes include Rejective sampling,
Varopt sampling [Chao 1982] [CDKLT2009], โฆ.
We focus here on bottom-k/order sampling.
Idea:
๏ง Instead of taking items with
increasing ๐ on the go)
๐ค๐
โ(๐)
๏ง Take the ๐ items with highest
> ๐, (and
๐ค๐
โ(๐)
๏ง Same as bottom-๐ items with respect to
โ(๐)
๐ค๐
Keeping sample size fixed
Bottom-๐/Order sampling
[Bengt Rosen (1972,1997), Esbjorn Ohlsson
(1990-)]
Scheme(s) (re-)invented very many timesโฆ E.g.
Duffield Lund Thorup (JACM 2007).โฆ (โpriorityโ
sampling), Efraimidis Spirakis 2006, C 1997, CK 2007
Bottom-k sampling (weighted):
General form
๏ง Each item ๐ takes a random โrankโ
๐๐ = ๐น ๐ค๐ , โ(๐)
where โ ๐ โผ ๐[0,1]
๏ง The sample includes the ๐ items with smallest
rank value.
Weighted Bottom-k sample:
Computation
๏ง Rank of item ๐ is ๐๐ = ๐น ๐ค๐ , โ ๐
โ ๐ โผ ๐[0,1]
๏ง Take ๐ items with smallest rank
, where
This is a weighted bottom-๐ Min-Hash sketch.
Good properties carry over:
๏ง Streaming/ Distributed computation
๏ง Mergeable
Choosing ๐น(๐ค, โ)
๏ง Uniform weights: using ๐๐ = โ ๐ , we get
bottom-k Min-Hash sample
โ(๐)
๐ค๐
๏ง With ๐๐ =
: Order PPS/Priority sample
[Ohlsson 1990, Rosen 1997] [DLT 2007]
ln โ ๐
โ
๐ค๐
๏ง With ๐๐ =
: (exponentially distributed
with parameter ๐ค๐ ) weighted sampling
without replacement [Rosen 1972] [Efraimidis
Spirakis 2006] [CK2007]โฆ
Weighted Sampling without Replacement
Iteratively ๐ times:
Choose ๐ with probability ๐ = ๐ค๐ /
๐โ๐ ๐ค๐
We show that this is the same as bottom-๐ with
๐๐ โผ ๐ธ๐ฅ๐[๐ค๐ ]:
Part I: Probability that item๐ has the minimum
๐ค๐
rank is , where W= ๐ ๐ค๐ .
๐
Part II: From memorylessness property of Exp
distribution, Part I also applies to subsequent
samples, conditioned on already-selected prefix.
Weighted Sampling without Replacement
Lemma: Probability that item๐ has the minimum
๐ค๐
rank is , where W= ๐ ๐ค๐ .
๐
Proof: Let ๐โฒ= ๐โ ๐ ๐ค๐ . Minimum of Exp r.v. has
an Exp distribution with sum of parameters. Thus
min{๐๐ , ๐ โ ๐} โผ ๐ธ๐ฅ๐[๐โฒ]
๐1 โผ ๐ธ๐ฅ๐[๐ค1 ]
Pr ๐๐ < min{๐๐ , ๐ โ ๐}
โ
=
0
โ
=
0
๐ค๐ ๐ โ๐ฅ ๐ค1
โ
๐โฒ๐ โ๐ฆ ๐โฒ ๐๐ฆ๐๐ฅ
๐ฅ
๐ค๐
โ๐ฅ ๐ค๐ โ๐ฅ ๐โฒ
๐ค๐ ๐
๐
๐๐ฅ =
๐
โ
๐๐ โ๐ฅ๐ dx
0
Weighted bottom-๐: Inverse probability
estimates for subset queries
Same as with Min-Hash sketches (uniform
weights):
๏ง For each ๐ โ ๐ , compute ๐๐ : probability that
๐ โ ๐ given ๐๐ |๐ โ ๐
๏ง This is exactly the probability that ๐๐ is smaller
than ๐ฆ = ๐๐กโ ๐๐ ๐ โ ๐ . Note that in our
sample y = k + 1 th {rj }
๐๐ = Pr ๐น ๐ค๐ , ๐ฅ โค ๐ฆ
๐ฅโผ๐[0,1]
We take ๐๐ = 1/๐๐
Weighted bottom-๐:
Remark on subset estimators
๏ง Inverse Probability (HT) estimators apply also
when we do not know the total weight of the
population.
๏ง We can estimate the total weight by ๐โ๐ ๐๐
(same as with unweighted sketches we used
for distinct counting).
When we know the total weight, we can get better
estimators for larger subsets:
With uniform weights, we could use fraction-insample times total. Weighted case is harder.
Weighted Bottom-k sample:
Remark on similarity queries
๏ง Rank of item ๐ is ๐๐ = ๐น ๐ค๐ , โ ๐
โ ๐ โผ ๐[0,1]
๏ง Take ๐ items with smallest rank
, where
Remark:
Similarly to โuniformโ weight Min-Hash sketches,
โCoordinatedโ weighted bottom-k samples of different
vectors support similarity queries (weighted Jaccard,
Cosine, Lp distance) and other queries which involve
multiple vectors [CK2009-2013]