Tyler Akidau, Alex Balikov, Kaya Bekiro glu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, Sam Whittle Google.

Download Report

Transcript Tyler Akidau, Alex Balikov, Kaya Bekiro glu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, Sam Whittle Google.

Tyler Akidau, Alex Balikov, Kaya Bekiro glu, Slava Chernyak, Josh Haberman,
Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, Sam Whittle
Google








Motivation and Requirements
Introduction
High level Overview of System
Fundamental abstractions of the MillWheel
model
API
Fault tolerance
System implementation.
Experimental results related work
2
Records over one-second intervals (bucket) are compared to
expected traffic that the model predicts
Consistent mismatch over n windows concludes a query is
spiking or dipping
Model is updated with newly received data
3



Requires both short term and long term
storage
Needs duplicate prevention, as duplicate
record deliveries could cause spurious
spikes.
Should distinguish whether data (expected)
is delayed or actually not there.
◦ MillWheel uses low watermark
4

Real time processing of data

Persistent state abstractions to user

Handling of out-of-order data


Constant latency as the system scales to more
machines.
Guarantee for exactly-once delivery of records
5


Framework for building low-latency dataprocessing applications
System manages
◦ Persistent state
◦ Continuous flow of records
◦ Fault-tolerance guarantees

Provides a notion of logical time
6

Provides fault tolerance at the framework
level
◦ Correctness is ensured in case of failure
◦ Record are handled in an idempotent fashion
 Ensures exactly once delivery of record from the user’s
perspective

Check-pointing is at fine granularity
◦ Eliminates buffering of pending data for long
periods between check points
7
8

At high level, MillWheel is a graph of
computation nodes
◦ Users specify
 A directed Computation graph
 Application code for individual nodes
◦ Each node takes input and produce output
◦ Computation are also called as transformations

Transformations are parallelized
◦ Users are not concerned with load-balancing at a
fine-grained level
9


Users can add and remove computations
dynamically
All internal updates are atomically
checkpointed per-key
◦ User code can access a per-key, per computation
persistent store
◦ Allows for powerful per-key aggregations
◦ Uses replicated, highly available data store (e.g.
Spanner)
10

Inputs and outputs in MillWheel are
represented by triples.
◦ (key, value, timestamp)
 key is a metadata field with semantic
 value is an arbitrary byte string
 timestamp can be an arbitrary value
11
12



Computations holds the application logic
Code is invoked upon receipt of input data
Code operates in the context of a single key
13



Abstraction for aggregation and comparison between
different records (Similar to map reduce)
Key extraction function(Specified by consumer) assigns a key
to the record.
Computation code is run in the context of a specific key
(accesses state for that specific key only).
14


Provides a bound on the timestamps of future
records
Low watermark of A is
◦ min(oldest work of A, low watermark of C)
 oldest work of A is the timestamp corresponding to the
oldest unfinished record in A. :
 Node C produces the output to A for consumption
15
16



Timers are per-key programmatic hooks that trigger
at a specific wall time or low watermark value
Timers are journaled in persistent state and can
survive process restarts and machine failures
Timer runs the specified user function (when fired)
and provides same exactly-once guarantees
17
18

User implements a custom subclass of the
Computation class
◦ Provides method for accessing MillWheel abstractions
◦ ProcessRecord and ProcessTimer hooks provide two main
entry points into user code
◦ Hooks are triggered in reaction to record receipt and
timer expiration

Per-key serialization is handled at the
framework level
19
20
21
22

Each computation calculates a low watermark
value for all of its pending work
◦ Users rarely deals with low watermarks
◦ Users manipulate them indirectly through timestamp
assignation to records.

Injectors bring external data, seed low
watermark values for the rest of the pipeline
◦ If injector are distributed across multiple processes the least
watermark among all processes is used
23

Exactly-Once Delivery
◦ Steps performed on receipt of an input record for a
computation are





Checked for duplication
User code is run for the input
Pending changes are committed to the backing store.
Sender is ACKed
Pending productions are sent(retried until they are ACKed)
◦ System assigns unique IDs to all records at production
time, which are stored to identify duplicate records
during retries
24

Strong Productions
◦ Produced records are checkpointed before delivery
 Checkpointing is done in same atomic write as state
modification
 Checkpoints are scanned into memory and replayed, if a
process restarts
 Checkpoint data is deleted when productions are ACKed

Exactly-Once Delivery and Strong Productions
ensures user logic is idempotent
25

Some computations may already be idempotent
◦ Strong productions and/or exactly-once can be disabled

Weak Productions
◦ Broadcast downstream deliveries optimistically, prior to
persisting state
◦ Each stage waits for the downstream ACKs of records
◦ Completion times of consecutive stages increases so
chances of experiencing a failure increases
26
To overcome this, a small fraction of productions are
checkpointed, allowing those stages to ACK their senders.
This selective checkpointing can both improve end-to-end
latency and reduce overall resource consumption.
27

Following user-visible guarantees must
satisfy:
No data loss
Updates must ensure exactly-once semantics
All persisted data must be consistent
Low watermarks must reflect all pending state in the
system
◦ Timers must fire in-order for a given key
◦
◦
◦
◦
28

To avoid Inconsistencies in Persistent state
◦ Per-key update are wrapped as single atomic
operation

To avoid network remnant stale writes
◦ Sequencer is attached to each write
◦ Mediator of backing store checks before allowing
writes
◦ New worker invalidates any extant sequencer
29
30


Distributed systems with dynamic set of host
servers
Each computation in a pipeline runs on one or
more machines
◦ Streams are delivered via RPC

On each machine, the MillWheel system
◦ Marshals incoming work
◦ Manages process-level metadata
◦ Delegates data to the appropriate computation

Load distribution and balancing is handled by a
master
◦ Each computation is divided into a set of lexicographic
key intervals
◦ Intervals are assigned to a set of machines
◦ Depending on load intervals can be merged or splitted

Low Watermarks
◦ Central authority tracks all low watermark values in the
system and journals them to persistent state


In-memory data structures are used to store
aggregated timestamp
Consumer computations subscribe low
watermark from all the senders
◦ Use minimum of all values
◦ Central authority’s low watermark values is at least
as those of the workers
 Latency distribution for records when running over 200 CPUs.
 Median record delay is 3.6 milliseconds and 95th-percentile latency
is 30 milliseconds
 Strong productions and exactly once disabled, with both enabled
Median Latency jumps to 33.7 milliseconds
 Median latency stays roughly constant, regardless of
system size
 99th-percentile latency does get significantly worse with
increase in system size.
 Simple three-stage MillWheel pipeline on 200 CPUs
 Polled each computation’s low watermark value once per second
 Increasing available cache linearly improves CPU usage (after 550MB
most data is cached, so further increases were not helpful)