#### Transcript Estimating Distinct Elements, Optimally

# Estimating Distinct Elements, Optimally

### David Woodruff IBM Based on papers with Piotr Indyk, Daniel Kane, and Jelani Nelson

### Problem Description

• Given a long string of at most n the number F 0 distinct characters, count of distinct characters • See characters one at a time • One pass over the string • Algorithms must use small memory and fast update time – too expensive to store set of distinct characters – algorithms must be randomized and settle for an approximate solution: output F 2 [(1 ² )F 0 , (1+ ² )F 0 ] with, say, good constant probability

### Algorithm History

• Flajolet and Martin introduced problem, FOCS 1983 – O(log n) space for fixed ε • Alon, Matias and Szegedy in random oracle model – O(log n) space/update time for fixed ε • Gibbons and Tirthapura with no oracle – O(ε -2 log n) space and O( ε -2 ) update time • Bar-Yossef et al – O(ε -2 log n) space and O(log 1/ ε) update time – O(ε -2 log log n + log n) space and O( ε -2 ) update time, essentially – Similar space bound also obtained by Flajolet et al in the random oracle model • Kane, Nelson and W – O(ε -2 + log n) space and O(1) update and reporting time – All time complexities are in unit-cost RAM model

### Lower Bound History

• • • • • • Alon, Matias and Szegedy – Any algorithm requires Ω(log n) bits of space Bar-Yossef – Any algorithm requires Ω(ε -1 ) bits of space Indyk and W – If ε > 1/n 1/9 , any algorithm needs Ω(ε -2 ) bits of space W – If ε > 1/n 1/2 , any algorithm needs Ω(ε -2 ) bits of space Jayram, Kumar and Sivakumar – Simpler proof of Ω(ε -2 ) bound for any ε > 1/m 1/2 Brody and Chakrabarti – Show above lower bounds hold even for multiple passes over the string Combining upper and lower bounds, the complexity of this problem is: Θ(ε -2 + log n) space and Θ(1) update and reporting time

### Outline for Remainder of Talk • Proofs of the Upper Bounds • Proofs of the Lower Bounds

### Hash Functions for Throwing Balls

• We consider a random mapping f of B balls into C count the number of non-empty containers containers and • The expected number of non-empty containers is C – C(1-1/C) B • If instead of the mapping f , we use an O(log C/ ε)/log log C/ε independent mapping g , then – the expected number of non-empty containers under g – wise is the same as that under f , up to a factor of (1 ± ε) • Proof based on approximate inclusion-exclusion – express 1 – (1-1/C) B in terms of a series of binomial coefficients – truncate the series at an appropriate place – use limited independence to handle the remaining terms

### Fast Hash Functions

• Use hash functions g that can be evaluated in O(1) time. • If g is O(log C/ ε)/(log log C/ε) -wise independent, the natural family of polynomial hash functions doesn’t work • We use theorems due to Pagh, Pagh, and Siegel that construct k -wise independent families for large k, and allow O(1) evaluation time • For example, Siegel shows: – Let U = [u] and V = [v] word size is Ω(log v) – Let k = v o(1) be arbitrary with u = v c for a constant c > 1 , and suppose the machine – For any constant d > 0 there is a randomized procedure that constructs a k independent hash family H requires v d space. Each h 2 from U H to V that succeeds with probability can be evaluated in O(1) time 1-1/v d -wise and • Can show we have sufficiently random hash functions that can be evaluated in O(1) time and represented with O( ε -2 + log n) bits of space

1.

Set K = 1/ ε 2

### Algorithm Outline

2.

Instantiate a lg n x K bitmatrix A , initializing entries of A to 0 3.

Pick random hash functions f: [n]->[n] and g: [n]->[K] 4.

Obtain a constant factor approximation R to F 0 somehow 5.

6.

Update(i): Set A 1, g(i) = 1, A 2, g(i) = 1, …, A lsb(f(i)), g(i) = 1 Estimator: Let T = |{j in [K]: A log (16R/K), j = 1}| Output (32R/K) * ln(1-T/K)/ln(1-1/K)

### Space Complexity

• Naively, A is a lg n x K bitmatrix, so O( ε -2 log n) space • Better: for each column j , store the identity of the largest row i(j) A i, j = 1 . Note if – Takes O( ε -2 A i,j = 1 , then log log n) A i’, j space = 1 for all i’ < i for which • Better yet: maintain a “base level” I . For each column j , store max(i(j) – I, 0) – Given an O(1) approximation R to F 0 at each point in the stream, set I = log R – Don’t need to remember i(j) if i(j) < I , since j won’t be used in estimator – For the j for which i(j) ¸ fourth such j will have I , about i(j) = I+1 1/2 , etc.

such j will have i(j) = I , about one – Total number of bits to store offsets is now only O(K) = O( ε -2 ) with good probability at all points in the stream

### The Constant Factor Approximation

• Previous algorithms state that at each point in the stream, with probability 1 δ , the output is an O(1) approximation to F 0 – The space of such algorithms is O(log n log 1/ δ). – Union-bounding over a stream of length m gives O(log n log m ) total space • We achieve O(log n) space, and guarantee the O(1) approximation R algorithm is non-decreasing of the – Apply the previous scheme on a log n x log n/(log log n) matrix – For each column, maintain the identity of the deepest row with value 1 – Output 2 i , where i is the largest row containing a constant fraction of 1 s – We repeat the procedure O(1) times, and output the median of the estimates – Can show the output is correct with probability 1- O(1/log n) – Then we use the non-decreasing property to union-bound over O(log n) events • We only increase the base level every time R increases by a factor of 2 – Note that the base level never decreases

### Running Time

• Blandford and Blelloch – Definition: a variable length array (VLA) is a data structure implementing an array C 1 , …, C n supporting the following operations: • Update(i, x) sets the value of C i to x • Read(i) returns C i The C i are allowed to have bit representations of varying lengths len(C i ).

– Theorem: there is a VLA using case O(1) O(n + sum i len(C i )) bits of space supporting worst updates and reads, assuming the machine word size is at least log n • Store our offsets in a VLA, giving O(1) update time for a fixed base level • Occasionally we need to update the base level and decrement offsets by 1 – Show base level only increases after Θ(ε -2 ) updates, so can spread this work across these updates, so O(1) worst-case update time – Copy the data structure, use it for performing this additional work so it doesn’t interfere with reporting the correct answer – When base level changes, switch to copy • For O(1) reporting time, maintain a count of non-zero containers in a level

### Outline for Remainder of Talk • Proofs of the Upper Bounds • Proofs of the Lower Bounds

### 1-Round Communication Complexity

Alice: input x What is f(x,y)?

Bob: input y • Alice sends a single, randomized message M(x) to Bob • Bob outputs g(M(x), y) for a randomized function g • g(M(x), y) should equal f(x,y) with constant probability • Communication cost CC(f) is |M(x)|, maximized over x and random bits • Alice creates a string s(x), runs a randomized algorithm transmits the state of A(s(x)) to Bob A on s(x), and • Bob creates a string s(y), continues A on s(y), thus computing A(s(x) ◦s(y)) • If A(s(x) ◦s(y)) can be used to solve f(x,y), then space(A) ¸ CC(f)

### The Ω(log n) Bound

• Consider equality function: f(x,y) = 1 if and only if x = y for x, y 2 {0,1} n/3 • Well known that CC(f) = Ω(log n) for (n/3) -bit strings x and y • Let C: {0,1} n/3 -> {0,1} n Hamming weight n/10 be an error-correcting code with all codewords of – If – If x = y, then x != y, then C(x) = C(y) ¢ (C(x), C(y)) = Ω(n) • Let s(x) be any string on alphabet size if and only if C(x) i = 1. Similarly define n with s(y) i -th character appearing in s(x) • If x = y, then F 0 (s(x) ◦s(y)) = n/10. Else, F 0 (s(x) ◦s(y)) = n/10 + Ω(n) • A constant factor approximation to F 0 solves f(x,y)

• •

### The Ω(ε

-2

### ) Bound

Let r = 1/ε 2 . Gap Hamming promise problem for x, y – f(x,y) = 1 if ¢ (x,y) > 1/(2ε -2 ) – f(x,y) = 0 if ¢ (x,y) < 1/(2ε -2 ) 1/ε in {0,1} r Theorem: CC(f) = Ω(ε -2 ) – Can prove this from the Indexing function – Alice has w 2 {0,1} r , Bob has i in {1, 2, …, r}, – Well-known that CC(g) = Ω(r) output g(w, i) = w i • Proof: CC(f) = Ω(r), – Alice sends the seed r of a pseudorandom generator to Bob, so the parties have common random strings z i , …, z r 2 {0,1} r – Alice sets x = coordinate-wise-majority{z i | w j = 1} – Bob sets y = z i – Since the z i are random, if x j probability ¢ f(x,y) < 1/(2ε -2 ) = 1 , then by properties of majority, with good 1/ε, otherwise likely that ¢ f(x,y) > 1/(2ε -2 ) – Repeat a few times to get concentration

### The Ω(ε

-2

### ) Bound Continued

• Need to create strings s(x) and s(y) ¢ (x,y) > 1/(2ε -2 ) or ¢ (x,y) < 1/(2ε -2 ) to have F 0 (s(x) ◦s(y)) decide whether 1/ε • Let s(x) be a string on x i = 1. Similarly define n characters where character s(y) i appears if and only if • F 0 (s(x) ◦s(y)) = (wt(x) + wt(y) + ¢ (x,y))/2 – Alice sends wt(x) to Bob • A calculation shows a wt(x) and (1+ ε) approximation to F 0 (s(x) ◦s(y)), together with wt(y), solves the Gap-Hamming problem • Total communication is space(A) + log 1/ ε = Ω(ε -2 ) • It follows that space(A) = Ω(ε -2 )

### Conclusion

Combining upper and lower bounds, the streaming complexity of estimating F 0 up to a (1+ ε) factor is: Θ(ε -2 + log n) bits of space and Θ(1) update and reporting time • Upper bounds based on careful combination of efficient hashing, sampling and various data structures • Lower bounds come from 1 -way communication complexity