Berkley_Data_Analysis_Stack_(BDAS).
Download
Report
Transcript Berkley_Data_Analysis_Stack_(BDAS).
Berkley Data Analysis Stack
(BDAS)
Mesos, Spark, Shark, Spark Streaming
Current Data Analysis Open Stack
Application
Data Processing
Storage
Infrastructure
Characteristics:
• Batch Processing on on-disk data..
• Not very efficient with “Interactive” and “Streaming” computations.
Goal
Berkeley Data Analytics Stack (BDAS)
New apps: AMP-Genomics, Carat, …
Application
Data Processing
• in-memory processing
• trade between time, quality, and cost
Data Management
Storage
Efficient data sharing across frameworks
Resource
Infrastructure
Management
Share infrastructure across frameworks
(multi-programming for datacenters)
BDAS Components
Mesos
• A platform for sharing commodity clusters between diverse
computing frameworks.
B. Hindman et. al, Mesos: A Platform for Fine-Grained
Resource Sharing in the Data Center, tech report, UCB,
2010
Mesos
• “Resource Offers” to publish available resources
• Has to deal with “framework” specific constraints
(without knowing the specific constraints).
• e.g. data locality.
• Allows the framework scheduler to “reject
offer” if constraints are not met.
B. Hindman et. al, Mesos: A Platform for Fine-Grained
Resource Sharing in the Data Center, tech report, UCB,
2010
Mesos
• Other Issues:
• Resource Allocation Strategies: Pluggable
• fair sharing plugin implemented
• Revocation
• Isolation:
• “Existing OS isolation techniques: Linux Containers”.
• Fault Tolerance:
• Master: stand by master nodes and zoo keeper
• Slaves: Reports task/slave failures to the framework, the latter handles.
• Framework scheduler failure: Replicate
B. Hindman et. al, Mesos: A Platform for Fine-Grained
Resource Sharing in the Data Center, tech report, UCB,
2010
Spark
Current popular programming models for clusters transform data flowing from
stable storage to stable storage
• Spark: In-Memory Cluster Computing for Iterative and Interactive Applications.
– Assumption
• Same working data set is used across iterations or for a number of interactive queries.
• Commodity Cluster
• Local Data partitions fit in Memory
Some Slides taken from presentation:
Spark
• Acyclic data flows – powerful abstractions
– But not efficient for Iterative/interactive applications that
repeatedly use the same “working data set”.
Solution: augment data flow model with in-memory
“resilient distributed datasets” (RDDs)
RDDs
• An RDD is an immutable, partitioned, logical collection of
records
– Need not be materialized, but rather contains information to rebuild a
dataset from stable storage (lazy-loading and lineage)
– can be rebuilt if a partition is lost (Transform once read many)
• Partitioning can be based on a key in each record (using hash or
range partitioning)
• Created by transforming data in stable storage using data flow operators (map,
filter, group-by, …)
• Can be cached for future reuse
Generality of RDDs
• Claim: Spark’s combination of data flow with RDDs
unifies many proposed cluster programming models
– General data flow models: MapReduce, Dryad, SQL
– Specialized models for stateful apps: Pregel (BSP), HaLoop
(iterative MR), Continuous Bulk Processing
• Instead of specialized APIs for one type of app, give
user first-class control of distributed datasets
Programming Model
Transformations
(define a new RDD)
map
filter
sample
union
groupByKey
reduceByKey
join
cache
…
Parallel operations
(return a result to driver)
reduce
collect
count
save
lookupKey
…
Example: Log Mining
• Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
messages = errors.map(_.split(‘\t’)(2))
Worker
results
errors = lines.filter(_.startsWith(“ERROR”))
cachedMsgs = messages.cache()
Cache 1
BaseTransformed
RDD
RDD
tasks
Block 1
Driver
Cached RDD
Parallel operation
cachedMsgs.filter(_.contains(“foo”)).count
Cache 2
cachedMsgs.filter(_.contains(“bar”)).count
Worker
. . .
Cache 3
Result: full-text search of Wikipedia in <1 sec (vs
20 sec for on-disk data)
Worker
Block 3
Block 2
RDD Fault Tolerance
• RDDs maintain lineage information that can be
used to reconstruct lost partitions
• Ex: cachedMsgs
= textFile(...).filter(_.contains(“error”))
.map(_.split(‘\t’)(2))
.cache()
HdfsRDD
FilteredRDD
MappedRDD
path: hdfs://…
func: contains(...)
func: split(…)
CachedRDD
Benefits of RDD Model
• Consistency is easy due to immutability
• Inexpensive fault tolerance (log lineage rather than
replicating/checkpointing data)
• Locality-aware scheduling of tasks on partitions
• Despite being restricted, model seems applicable to a broad
variety of applications
Example: Logistic Regression
• Goal: find best line separating two sets of
points
random initial line
target
Logistic Regression Code
val data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
println("Final w: " + w)
Logistic Regression Performance
127 s / iteration
first iteration 174 s
further iterations 6 s
Page Rank: Scala Implementation
val links = // RDD of (url, neighbors) pairs
var ranks = // RDD of (url, rank) pairs
for (i <- 1 to ITERATIONS) {
val contribs = links.join(ranks).flatMap {
case (url, (links, rank)) =>
links.map(dest => (dest, rank/links.size))
}
ranks = contribs.reduceByKey(_ + _)
.mapValues(0.15 + 0.85 * _)
}
ranks.saveAsTextFile(...)
Spark Summary
• Fast, expressive cluster computing system compatible with
Apache Hadoop
– Works with any Hadoop-supported storage system (HDFS, S3, Avro, …)
• Improves efficiency through:
Up to 100× faster
– In-memory computing primitives
– General computation graphs
• Improves usability through:
– Rich APIs in Java, Scala, Python
– Interactive shell
Often 2-10× less code
Spark Streaming
• Framework for large scale stream processing
–
–
–
–
–
Scales to 100s of nodes
Can achieve second scale latencies
Integrates with Spark’s batch and interactive processing
Provides a simple batch-like API for implementing complex algorithm
Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.
Requirements
Scalable to large clusters
Second-scale latencies
Simple programming model
Integrated with batch & interactive processing
Stateful Stream Processing
Traditional streaming systems have a eventdriven record-at-a-time processing model
- Each node has mutable state
- For each record, update state & send new
records
State is lost if node dies!
Making stateful stream processing be faulttolerant is challenging
mutable state
input
records
node 1
node 3
input
records
node 2
24
Discretized Stream Processing
Run a streaming computation as a series of very
small, deterministic batch jobs
live data stream
Chop up the live stream into batches of X seconds
Spark treats each batch of data as RDDs and
processes them using RDD operations
Finally, the processed results of the RDD
operations are returned in batches
Spark
Streaming
batches of X seconds
processed
results
25
Spark
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
DStream: a sequence of RDD representing a stream of data
Twitter Streaming API
batch @ t
batch @ t+1
batch @ t+2
tweets DStream
stored in memory as an RDD
(immutable, distributed)
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
new DStream
transformation: modify data in one Dstream to create another DStream
batch @ t
batch @ t+1
batch @ t+2
tweets DStream
hashTags Dstream
[#cat, #dog, … ]
flatMap
flatMap
…
flatMap
new RDDs created for
every batch
Example 1 – Get hashtags from Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
output operation: to push data to external storage
batch @ t
batch @ t+1
batch @ t+2
tweets DStream
hashTags DStream
flatMa
p
flatMa
p
flatMa
p
save
save
save
every batch saved
to HDFS
Fault-tolerance
RDDs remember the sequence of
operations that created it from the
original fault-tolerant input data
Batches of input data are replicated in
memory of multiple worker nodes,
therefore fault-tolerant
Data lost due to worker failure, can be
recomputed from input data
tweets
RDD
input data
replicated
in memory
flatMap
hashTags
RDD
lost partitions
recomputed on
other workers
Key concepts
• DStream – sequence of RDDs representing a stream of data
– Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets
• Transformations – modify data from on DStream to another
– Standard RDD operations – map, countByValue, reduce, join, …
– Stateful operations – window, countByValueAndWindow, …
• Output Operations – send data to external entity
– saveAsHadoopFiles – saves to HDFS
– foreach – do anything with each batch of results
Higher throughput than Storm
Spark Streaming: 670k records/second/node
Throughput per node
(MB/s)
Comparison with Storm and S4
Storm: 115k records/second/node
120
Grep
Spark
80
40
Storm
0
100
1000
Record Size (bytes)
Throughput per node
(MB/s)
Apache S4: 7.5k records/second/node
30
WordCount
Spark
20
10
Storm
0
100
1000
Record Size (bytes)
31