CS Forum Annual Meeting 2003
Download
Report
Transcript CS Forum Annual Meeting 2003
Randomization for Massive
and Streaming Data Sets
Rajeev Motwani
May 21, 2003
CS Forum Annual Meeting
1
Data Streams Mangement Systems
Traditional DBMS – data stored in finite, persistent
data sets
Data Streams – distributed, continuous, unbounded,
rapid, time-varying, noisy, …
Emerging DSMS – variety of modern applications
Network monitoring and traffic engineering
Telecom call records
Network security
Financial applications
Sensor networks
Manufacturing processes
Web logs and clickstreams
Massive data sets
2
DSMS – Big Picture
Register
Query
Streamed
Result
Stored
Result
DSMS
Input streams
Archive
Scratch Store
Stored
Relations
3
Algorithmic Issues
Computational Model
Streaming data (or, secondary memory)
Bounded main memory
Techniques
New paradigms
Negative Results and Approximation
Randomization
Complexity Measures
Memory
Time per item (online, real-time)
# Passes (linear scan in secondary memory)
4
Stream Model of Computation
1
1
0
0
1
Main Memory
(Synopsis Data Structures)
0
1
1
0
1
1
Memory: poly(1/ε, log N)
Query/Update Time: poly(1/ε, log N)
N: # items so far, or window size
Data Stream
ε: error parameter
5
“Toy” Example – Network Monitoring
Intrusion
Warnings
Online
Performance
Metrics
Register
Monitoring
Queries
DSMS
Network measurements,
Packet traces,
…
Archive
Scratch Store
Lookup
Tables
6
Frequency Related Problems
Analytics on Packet Headers – IP Addresses
Top-k most frequent elements
Find elements that
occupy 0.1% of the tail.
Mean + Variance?
Median?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Find all elements
with frequency > 0.1%
What is the frequency
of element 3?
What is the total frequency
of elements between 8 and 14?
How many elements have non-zero frequency?
7
Example 1– Distinct Values
Input Sequence X = x1, x2, …, xn, …
Domain U = {0,1,2, …, u-1}
Compute D(X) number of distinct values
Remarks
Assume stream size n is finite/known
(generally, n is window size)
Domain could be arbitrary (e.g., text, tuples)
8
Naïve Approach
Counter C(i) for each domain value i
Initialize counters C(i) 0
Scan X incrementing appropriate counters
Problem
Memory size M << n
Space O(u) – possibly u >> n
(e.g., when counting distinct words in web crawl)
9
Negative Result
Theorem:
Deterministic algorithms need M = Ω(n log u) bits
Proof: Information-theoretic arguments
Note: Leaves open randomization/approximation
10
Randomized Algorithm
h:U [1..t]
Input Stream
Hash Table
Analysis
Random h few collisions & avg list-size O(n/t)
Thus
Space: O(n) – since we need t =
Time: O(1) per item [Expected]
Ω(n)
11
Improvement via Sampling?
Sample-based Estimation
Random Sample R (of size r) of n values in X
Compute D(R)
Estimator E = D(R) x n/r
Benefit – sublinear space
Cost – estimation error is high
Why? – low-frequency values underrepresented
12
Negative Result for Sampling
Consider estimator E of D(X) examining r items in X
Possibly in adaptive/randomized fashion.
r
Theorem: For any δ e , E has relative error
nr 1
ln
2r
δ
with probability at least δ .
Remarks
r = n/10 Error 75% with probability ½
Leaves open randomization/approximation on full scans
13
Randomized Approximation
Simplified Problem – For fixed t, is D(X) >> t?
Choose hash function h: U[1..t]
Initialize answer to NO
For each xi, if h(xi) = t, set answer to YES
h:U [1..t]
1
t
Boolean Flag
YES/NO
Input Stream
Observe – need 1 bit memory only !
Theorem:
If D(X) < t, P[output NO] > 0.25
If D(X) > 2t, P[output NO] < 0.14
14
Analysis
Let – Y be set of distinct elements of X
output NO no element of Y hashes to t
P [element hashes to t] = 1/t
Thus – P[output NO] = (1-1/t)|Y|
Since |Y| = D(X),
D(X) < t
D(X) > 2t
P[output NO] > (1-1/t)t > 0.25
P[output NO] < (1-1/t)2t < 1/e^2
15
Boosting Accuracy
With 1 bit distinguish D(X)<t from D(X)>2t
Running O(log 1/δ) instances in parallel
reduce error probability to any δ>0
Running O(log n) in parallel for t = 1, 2, 4, 8,…, n
can estimate D(X) within factor 2
Choice of multiplier 2 is arbitrary
can use factor (1+ε) to reduce error to ε
Theorem: Can estimate D(X) within factor (1±ε)
with probability (1-δ) using space
n
1
O(log 2 log )
ε
δ
16
Example 2 – Elephants-and-Ants
Stream
Identify items whose current frequency
exceeds support threshold s = 0.1%.
[Jacobson 2000, Estan-Verghese 2001]
17
Algorithm 1: Lossy Counting
Step 1: Divide the stream into ‘windows’
Window 1
Window 2
Window 3
Window-size W is function of support s – specify later…
18
Lossy Counting in Action ...
Frequency
Counts
+
Empty
First Window
At window boundary, decrement all counters by 1
19
Lossy Counting continued ...
Frequency
Counts
+
Next Window
At window boundary, decrement all counters by 1
20
Error Analysis
How much do we undercount?
If
and
current size of stream
window-size W
=N
= 1/ε
then frequency error # windows = εN
Rule of thumb:
Set ε = 10% of support s
Example:
Given support frequency s = 1%,
set error frequency
ε = 0.1%
21
Putting it all together…
Output:
Elements with counter values exceeding (s-ε)N
Approximation guarantees
Frequencies underestimated by at most εN
No false negatives
False positives have true frequency at least (s–ε)N
How many counters do we need?
Worst case bound: 1/ε log εN counters
Implementation details…
22
Algorithm 2: Sticky Sampling
Stream
Create counters by sampling
Maintain exact counts thereafter
What is sampling rate?
28
31
41
23
35
19
34
15
30
23
Sticky Sampling contd...
For finite stream of length N
Sampling rate = 2/εN log 1/s
= probability of failure
Output:
Elements with counter values exceeding (s-ε)N
Approximation guarantees (probabilistic)
Frequencies underestimated by at most εN
No false negatives
False positives have true frequency at least (s-ε)N
Same Rule of thumb:
Same error guarantees
as Lossy Counting
but probabilistic
Set ε = 10% of support s
Example:
Given support threshold s = 1%,
set error threshold
ε = 0.1%
set failure probability = 0.01%
24
Number of counters?
Finite stream of length N
Sampling rate: 2/εN log 1/s
Infinite stream with unknown N
Gradually adjust sampling rate
In either case,
Expected number of counters = 2/ log 1/s
Independent of N
25
Example 3 – Correlated Attributes
R1
R2
R3
R4
R5
R6
R7
R8
…
C1
1
1
1
0
1
1
0
0
C2
1
1
0
0
1
1
1
1
…
C3
1
0
0
1
1
1
1
1
C4
1
1
1
0
0
1
1
1
…
C5
0
0
0
1
1
1
1
0
Input Stream – items with boolean attributes
Matrix – M(r,c) = 1 Row r has Attribute c
Identify – Highly-correlated column-pairs
26
Correlation Similarity
View column as set of row-indexes
(where it has 1’s)
Set Similarity (Jaccard measure)
sim(C i , C j )
Example
Ci
0
1
1
0
1
0
Cj
1
0
1
0
1
1
Ci C j
Ci C j
sim(Ci,Cj) = 2/5 = 0.4
27
Identifying Similar Columns?
Goal – finding candidate pairs in small memory
Signature Idea
Hash columns Ci to small signature sig(Ci)
Set of signatures fits in memory
sim(Ci,Cj) approximated by sim(sig(Ci),sig(Cj))
Naïve Approach
Sample P rows uniformly at random
Define sig(Ci) as P bits of Ci in sample
Problem
sparsity would miss interesting part of columns
sample would get only 0’s in columns
28
Key Observation
For columns Ci, Cj, four types of rows
A
B
C
D
Ci
1
1
0
0
Cj
1
0
1
0
Overload notation: A = # rows of type A
Observation
A
sim(C i , C j )
ABC
29
Min Hashing
Randomly permute rows
Hash h(Ci) = index of first row with 1 in column Ci
Suprising Property
P[h(Ci) = h(Cj)] = sim(Ci, Cj)
Why?
Both are A/(A+B+C)
Look down columns Ci, Cj until first non-Type-D row
h(Ci) = h(Cj) if type A row
30
Min-Hash Signatures
Pick – k random row permutations
Min-Hash Signature
sig(C) = k indexes of first rows with 1 in column C
Similarity of signatures
Define: sim(sig(Ci),sig(Cj)) = fraction of
permutations where Min-Hash values agree
Lemma E[sim(sig(Ci),sig(Cj))] = sim(Ci,Cj)
31
Example
R1
R2
R3
R4
R5
C1
1
0
1
1
0
C2
0
1
0
0
1
C3
1
1
0
1
0
Signatures
S1
Perm 1 = (12345) 1
Perm 2 = (54321) 4
Perm 3 = (34512) 3
S2
2
5
5
S3
1
4
4
Similarities
1-2
1-3 2-3
Col-Col 0.00 0.50 0.25
Sig-Sig 0.00 0.67 0.00
32
Implementation Trick
Permuting rows even once is prohibitive
Row Hashing
Pick k hash functions hk: {1,…,n}{1,…,O(n)}
Ordering under hk gives random row permutation
One-pass implementation
33
Comparing Signatures
Signature Matrix S
Rows = Hash Functions
Columns = Columns
Entries = Signatures
Need – Pair-wise similarity of signature columns
Problem
MinHash fits column signatures in memory
But comparing signature-pairs takes too much time
Limiting candidate pairs – Locality Sensitive Hashing
34
Summary
New algorithmic paradigms needed for
streams and massive data sets
Negative results abound
Need to approximate
Power of randomization
35
Thank You!
36
References
Rajeev Motwani (http://theory.stanford.edu/~rajeev)
STREAM Project (http://www-db.stanford.edu/stream)
STREAM: The Stanford Stream Data Manager. Bulletin of
the Technical Committee on Data Engineering 2003.
Motwani et al. Query Processing, Approximation, and
Resource Management in a Data Stream Management System.
CIDR 2003.
Babcock-Babu-Datar-Motwani-Widom. Models and Issues in
Data Stream Systems. PODS 2002.
Manku-Motwani. Approximate Frequency Counts over
Streaming Data. VLDB 2003.
Babcock-Datar-Motwani-O’Callahan. Maintaining Variance and
K-Medians over Data Stream Windows. PODS 2003.
Guha-Meyerson-Mishra-Motwani-O’Callahan. Clustering Data
Streams: Theory and Practice. IEEE TKDE 2003.
37
References (contd)
Datar-Gionis-Indyk-Motwani. Maintaining Stream Statistics
over Sliding Windows. SIAM Journal on Computing 2002.
Babcock-Datar-Motwani. Sampling From a Moving Window
Over Streaming Data. SODA 2002.
O’Callahan-Guha-Mishra-Meyerson-Motwani. HighPerformance Clustering of Streams and Large Data Sets.
ICDE 2003.
Guha-Mishra-Motwani-O’Callagahan. Clustering Data Streams.
FOCS 2000.
Cohen et al. Finding Interesting Associations without
Support Pruning. ICDE 2000.
Charikar-Chaudhuri-Motwani-Narasayya. Towards Estimation
Error Guarantees for Distinct Values. PODS 2000.
Gionis-Indyk-Motwani. Similarity Search in High Dimensions
via Hashing. VLDB 1999.
Indyk-Motwani. Approximate Nearest Neighbors: Towards
Removing the Curse of Dimensionality. STOC 1998.
38