Spark Streaming Preview Fault-Tolerant Stream Processing at Scale Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica UC BERKELEY.

Download Report

Transcript Spark Streaming Preview Fault-Tolerant Stream Processing at Scale Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica UC BERKELEY.

Spark Streaming Preview
Fault-Tolerant Stream Processing at Scale
Matei Zaharia, Tathagata Das,
Haoyuan Li, Scott Shenker, Ion Stoica
UC BERKELEY
Motivation
• Many important applications need to process
large data streams arriving in real time
– User activity statistics (e.g. Facebook’s Puma)
– Spam detection
– Traffic estimation
– Network intrusion detection
• Our target: large-scale apps that need to run
on tens-hundreds of nodes with O(1 sec) latency
System Goals
•
•
•
•
Simple programming interface
Automatic fault recovery (including state)
Automatic straggler recovery
Integration with batch & ad-hoc queries
(want one API for all your data analysis)
Traditional Streaming Systems
• “Record-at-a-time” processing model
– Each node has mutable state
– Event-driven API: for each record, update state
and send out new records
mutable state
input records
node 1
push
node 3
input records
node 2
Challenges with Traditional Systems
• Fault tolerance
– Either replicate the whole system (costly) or use
upstream backup (slow to recover)
• Stragglers (typically not handled)
• Consistency (few guarantees across nodes)
• Hard to unify with batch processing
Our Model: “Discretized Streams”
• Run each streaming computation as a series of
very small, deterministic batch jobs
– E.g. a MapReduce every second to count tweets
• Keep state in memory across jobs
– New Spark operators allow “stateful” processing
• Recover from faults/stragglers in same way as
MapReduce (by rerunning tasks in parallel)
Discretized Streams in Action
t = 1:
batch operation
input
immutable dataset
(output or state);
stored in memory
as Spark RDD
immutable dataset
(stored reliably)
…
input
…
…
t = 2:
stream 1
stream 2
Example: View Count
• Keep a running count of views to each webpage
views = readStream("http:...", "1s")
views
ones
counts
t = 1:
ones = views.map(ev => (ev.url, 1))
map
counts = ones.runningReduce(_ + _)
reduce
t = 2:
...
= dataset
= partition
Fault Recovery
• Checkpoint state datasets periodically
• If a node fails/straggles, build its data in parallel
on other nodes using dependency graph
map
input dataset
output dataset
Fast recovery without the cost of full replication
How Fast Can It Go?
• Currently handles 4 GB/s of data (42 million
records/s) on 100 nodes at sub-second latency
• Recovers from failures/stragglers within 1 sec
Outline
• Introduction
• Programming interface
• Implementation
• Early results
• Future development
D-Streams
• A discretized stream is a sequence of immutable,
partitioned datasets
– Specifically, each dataset is an RDD (resilient
distributed dataset), the storage abstraction in Spark
– Each RDD remembers how it was created, and can
recover if any part of the data is lost
D-Streams
• D-Streams can be created…
– either from live streaming data
– or by transforming other D-streams
• Programming with D-Streams is very similar to
programming with RDDs in Spark
D-Stream Operators
• Transformations
– Build new streams from existing streams
– Include existing Spark operators, which act on each
interval in isolation, plus new “stateful” operators
• Output operators
– Send data to outside world (save results to external
storage, print to screen, etc)
Example 1
Count the words received every second
words = readStream("http://...", Seconds(1))
D-Streams
counts = words.count()
transformation
words
counts
time = 0 - 1:
count
time = 1 - 2:
count
time = 2 - 3:
count
= RDD
Demo
• Setup
– 10 EC2 m1.xlarge instances
– Each instance receiving a stream of sentences at
rate of 1 MB/s, total 10 MB/s
• Spark Streaming receives the sentences and
processes them
Example 2
Count frequency of words received every second
words = readStream("http://...", Seconds(1))
ones = words.map(w => (w, 1))
freqs = ones.reduceByKey(_ + _)
Scala function literal
freqs
ones
words
time = 0 - 1:
time = 1 - 2:
time = 2 - 3:
map
reduce
Demo
Example 3
Count frequency of words received in last minute
ones = words.map(w => (w, 1))
sliding window operator
freqs = ones.reduceByKey(_ + _)
freqs_60s = freqs.window(Seconds(60), Second(1))
.reduceByKey(_ + _)
window length
words
time = 0 - 1:
time = 1 - 2:
time = 2 - 3:
ones
map
window movement
freqs
reduce
window
freqs_60s
reduce
Simpler running reduce
freqs = ones.reduceByKey(_ + _)
freqs_60s = freqs.window(Seconds(60), Second(1))
.reduceByKey(_ + _)
freqs = ones.reduceByKeyAndWindow(_ + _, Seconds(60), Seconds(1))
Demo
“Incremental” window operators
words
freqs
words
freqs_60s
t-1
t-1
t
t
t+1
t+1
t+2
t+2
t+3
t+3
+
t+4
Aggregation function
t+4
freqs
freqs_60s
–
+
+
Invertible aggregation function
Smarter running reduce
freqs = ones.reduceByKey(_ + _)
freqs_60s = freqs.window(Seconds(60), Second(1))
.reduceByKey(_ + _)
freqs = ones.reduceByKeyAndWindow(_ + _, Seconds(60), Seconds(1))
freqs = ones.reduceByKeyAndWindow(
_ + _, _ - _, Seconds(60), Seconds(1))
Output Operators
• save: write results to any Hadoop-compatible
storage system (e.g. HDFS, HBase)
freqs.save(“hdfs://...”)
• foreachRDD: run a Spark function on each RDD
words.foreachRDD(wordsRDD => {
// any Spark/scala processing, maybe save to database
})
Live + Batch + Interactive
• Combining D-streams with historical datasets
pageViews.join(historicCounts).map(...)
• Interactive queries on stream state from the
Spark interpreter
pageViews.slice(“21:00”, “21:05”).topK(10)
Outline
• Introduction
• Programming interface
• Implementation
• Early results
• Future development
System Architecture
Built on an optimized version of Spark
Worker
Master
Client
Input receiver
Task execution
Block manager
D-stream
lineage
Task scheduler
Worker
Block tracker
Input receiver
Task execution
Block manager
Replication
of input &
checkpoint
RDDs
Client
Client
Implementation
Optimizations on current Spark:
– New block store
• APIs: Put(key, value, storage level), Get(key)
– Optimized scheduling for <100ms tasks
• Bypass Mesos cluster scheduler (tens of ms)
– Fast NIO communication library
– Pipelining of jobs from different time intervals
Evaluation
• Ran on up to 100 “m1.xlarge” machines on
EC2
– 4 cores, 15 GB RAM each
• Three applications:
– Grep: count lines matching a pattern
– Sliding word count
– Sliding top K words
Scalability
Cluster Throughput
(GB/s)
5
5
Grep
5
WordCount
4
4
4
3
3
3
2
2
2
1
1
1
0
0
1 sec
2 sec
0
0
50
100
0
50
100
TopKWords
0
50
# Nodes in Cluster
Maximum throughput possible with 1s or 2s latency
100-byte records (100K-500K records/s/node)
100
Spark
60
50
40
30
20
10
0
100
Storm
10000
Record Size (bytes)
TopK Throughput
(MB/s/node)
Grep Throughput
(MB/s/node)
Performance vs Storm and S4
30
25
20
15
10
5
0
100
Spark
Storm
10000
Record Size (bytes)
• Storm limited to 10,000 records/s/node
• Also tried S4: 7000 records/s/node
• Commercial systems report 100K aggregated
Fault Recovery
• Recovers from failures within 1 second
Interval Processing
Time (s)
Failure Happens
2.0
1.5
1.0
0.5
Time (s)
0.0
0
15
30
45
60
75
Sliding WordCount on 10 nodes with 30s checkpoint interval
Failures:
Interval Processing
TIme (s)
Fault Recovery
3.0
2.5
2.0
1.5
1.0
0.5
0.0
2.64
2.31
2.03
1.75
1.66
1.47
Before Failure
At Time of
Failure
Stragglers:
Interval Processing
Time (s)
WordCount, WordMax, no WordMax, 10s
30s checkpoints checkpoints
checkpoints
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
1.09
1.08
No straggler
0.79
WordCount
0.66
Grep
Straggler, with
speculation
Interactive Ad-Hoc Queries
Outline
• Introduction
• Programming interface
• Implementation
• Early results
• Future development
Future Development
• An alpha of discretized streams will go into
Spark by the end of the summer
• Engine improvements from Spark Streaming
project are already there (“dev” branch)
• Together, make Spark to a powerful platform
for both batch and near-real-time analytics
Future Development
• Other things we’re working on/thinking of:
– Easier deployment options (standalone & YARN)
– Hadoop-based deployment (run as Hadoop job)?
– Run Hadoop mappers/reducers on Spark?
– Java API?
• Need your feedback to prioritize these!
More Details
• You can find more about Spark Streaming in
our paper: http://tinyurl.com/dstreams
Related Work
• Bulk incremental processing (CBP, Comet)
– Periodic (~5 min) batch jobs on Hadoop/Dryad
– On-disk, replicated FS for storage instead of RDDs
• Hadoop Online
– Does not recover stateful ops or allow multi-stage jobs
• Streaming databases
– Record-at-a-time processing, generally replication for FT
• Approximate query processing, load shedding
– Do not support the loss of arbitrary nodes
– Different math because drop rate is known exactly
• Parallel recovery (MapReduce, GFS, RAMCloud, etc)
Timing Considerations
• D-streams group input into intervals based on
when records arrive at the system
• For apps that need to group by an “external”
time and tolerate network delays, support:
– Slack time: delay starting a batch for a short fixed
time to give records a chance to arrive
– Application-level correction: e.g. give a result for
time t at time t+1, then use later records to update
incrementally at time t+5