Slides - Eric Zhiqiang Ma

Transcript Slides - Eric Zhiqiang Ma

Survey on Programming and
Tasking in Cloud
Computing Environments
PhD Qualifying Exam
Zhiqiang Ma
Supervisor: Lin Gu
Feb. 18, 2011
Outline


Introduction
Approaches





Application framework level approach
Language level approach
Instruction level approach
Our work: MRlite
Conclusion
2
Cloud computing

Internet services are the most popular
applications nowadays


Millions of users
Computation is large and complex


Google already processed 20TB data in 2004
Cloud computing provides massive
computing resources

Available on demand
A promising model to support processing
large datasets housed on clusters
3
How to program and task?

Challenges






Parallelize the execution
Scheduling the large scale distributed computation
Handling faults
High performance
Ensuring fairness
Programming models for Grid


Do not automatically parallelize users’ programs
Pass the fault-tolerance work to applications
4
Outline


Introduction
Approaches





Application framework level approach
Language level approach
Instruction level approach
Our work: MRlite
Conclusion
5
Approaches
Approach
Advantage
Disadvantage
Application
framework level
Language level
Instruction level
6
MapReduce


MapReduce: parallel computing
framework for large-scale data processing
Successful used in datacenters comprising
commodity computers

A fundamental piece of software in the Google
architecture for many years
Open source variant already exists: Hadoop

Widely used in solving data-intensive problems

MapReduce
Hadoop
… Hadoop or variants …
7
MapReduce



Map and Reduce are higher-order functions
Map: apply an operation to all elements in a list
Reduce: Like “fold”; aggregate elements of a list
12 + 22 + 3 2 + 4 2 + 52 = ?
m: x2
r: +
Initial value
0
1
2
3
4
5
m
m
m
m
m
1
4
9
16
25
r
r
r
r
r
1
5
14
30
55
final value
8
MapReduce’s data flow
9
MapReduce
Massive parallel processing made simple
 Example: world count
 Map: parse a document and generate <word, 1> pairs
 Reduce: receive all pairs for a specific word, and count
Map
// D is a document
for each word w in D
output <w, 1>
Reduce
Reduce for key w:
count = 0
for each input item
count = count + 1
output <w, count>
10
MapReduce easily scales up
Input
files
Map
phase
Intermediate
files
Reduce
phase
Output
files
11
MapReduce
Input
Computation
Output
12
Dryad

General-purpose execution environment
for distributed, data-parallel applications


Concentrates on throughput not latency
Application written in Dryad is modeled as
a directed acyclic graph (DAG)

Many programs can be represented as a
distributed execution graph
13
Dryad
Outputs
Processing
vertices
Channels
(file, pipe,
shared
memory)
Inputs
14
Dryad

Concurrency arise from vertices running
simultaneously across multiple machines


Vertices subroutines are usually quite simple as
sequential programs
User have control over the communication
graph

Each vertex can has multiple input and output
15
Approaches
Approach
Application
framework level
Advantage
Automatically parallelize
users’ programs;
Users are relaxed from the
details of distributing the
execution
Disadvantage
Programs must follow
the specific model
Language’ level
Instruction level
16
Tasking of execution

Performance



Fairness



Locality is crucial
Speculative execution
The same cluster shared by multiple users
Small jobs requires small response time while
throughput is important for big jobs
Correctness

Fault-tolerance
17
Locality and fairness

Locality is crucial



Bandwidth is scarce resource
Input data with duplications are stored in the
same cluster for executions
Fairness

Short jobs requires short response time
Locality and fairness conflicts with
each other
18
FIFO scheduler in Hadoop

Jobs in a queue with priority order


FIFO by default
When there are available slots


Assign slots to tasks, that have local data, in
priority order
Limit the assignment of non-local task to optimize
locality
19
FIFO scheduler
2 tasks
JobQueue
1 tasks
Node 1
Node 2
Node 3
Node 4
20
FIFO scheduler – locality optimization
4 tasks
Only dispatch one
non-local task at
one time
JobQueue
1 tasks
Node 1
Node 2
Node 3
Far away in network topology
Node 4
21
Problem: fairness
3 tasks
JobQueue
3 tasks
Node 1
Node 2
Node 3
Node 4
22
Problem: response time
JobQueue
3 tasks
3 tasks
Small job:
Only 1 task)
1 task
Node 1
Node 2
Node 3
Node 4
23
Fair scheduling


Assign free slots to the job that has the
fewest running tasks
Strict fairness


Running jobs gets nearly equal number of slots
The small jobs finishes quickly
24
Fair Scheduling
JobQueue
Node 1
Node 2
Node 3
Node 4
25
Problem: locality
JobQueue
Node 1
Node 2
Node 3
Node 4
26
Delay Scheduling

Skip the job that cannot launch a local task


Relax fairness slightly
Allow a job to launch non-local tasks if be
skipped long enough

Avoid starvation
27
Delay Scheduling
120 0 0 0
skipcount
Waiting time is short:
Threshold: 2
Tasks finish quickly
Skipped job is in the head of
the queue
Node 1
Node 2
JobQueue
Node 3
Node 4
28
“Fault” Tolerance
 Nodes

fail
Re-run tasks
 Nodes
are slow (stragglers)
Run backup tasks (speculative execution)
 To minimize job’s response time


Important for short jobs
29
Speculative execution



The scheduler schedules backup
executions of the remaining in-progress
tasks
The task is marked as completed
whenever either the primary or the backup
execution completes
Improve job response time by 44%
according Google’s experiments
30
Speculative execution mechanism
Seems a simple problem, but
 Resource for speculative tasks is not free
 How to choose nodes to run speculative
tasks?
 How to distinguish “stragglers” from
nodes that are slightly slower?
 Stragglers should be found out early
31
Hadoop’s scheduler

Start speculative tasks based on a simple
heuristic


Comparing each task’s progress to the average
Assumption of homogeneous environment


The default scheduler works well
Broken in utility computing

Virtualized “utility computing” environments, such
as EC2
How to robustly perform speculative execution
(backup tasks) in heterogeneous environments?
32
Speculative execution in Hadoop

When there is no “higher priority” tasks, looks for
a task to execute speculatively


Assumption: The is no cost to launching a speculative task
Comparing each task’s progress to the average
progress


Assumption: Nodes perform similarly. (“Slow node is faulty”;
“Nodes that ask for new tasks are fast”)
Nodes may be slightly (2-3x) slower in “utility computing”, which
may not hurt the response time or ask for tasks but not fast
33
Speculative execution in Hadoop

Threshold for speculative execution



(Average progress score of each category of
tasks) – 0.2
Tasks beyond the threshold are “equally slow”
Ranks candidates by locality

Wrong tasks may be chosen


35% completed 2x slower task with data available on idle
node or 5% completed 10x slower task?
Too many speculative tasks and thrashing

Taking away resources from useful tasks
34
Speculative execution in Hadoop

Progress score



Map: fraction of input data
Reduce: three phase (1/3 for each) and fraction
of data processed
Incorrect speculation of reduce tasks


Copy phase takes most of the time, but account
only 1/3
30% tasks finishes quickly, 70% are in copy
phase: Avg. progress rate = 30%*1+70%*1/3 =
53%, threshold=33%
35
LATE


Longest Approximate Time to End
Principles

Ranks candidate by longest time to end


Only launch speculative tasks on fast nodes


Choose the right task that hurts the job’s response
time; slow nodes can be utilized as long as it
doesn’t hurt the response time
Not every node that asks for task is fast
Cap speculative tasks

Limit resource contention and thrashing
36
LATE algorithm
Cap speculative tasks
If a node asks for a new task and there are fewer
than SpeculativeCap speculative tasks running:

Ignore the request if the node's total progress is
below SlowNodeThreshold
Only launch speculative
tasks on fast nodes

Rank currently running tasks by estimated time left

Launch a copy of the highest-ranked task with
progress rate below SlowTaskThreshold
Rank candidates by
longest time to end
37
Approaches
Approach
Application framework
level
Advantage
Automatically parallelize users’
programs;
Users are relaxed from the
details of distributing the
execution
Disadvantage
Programs must follow
the specific model
Language level
Instruction level
38
Language level approach

Programming frameworks


Traditional programming language


Still not clear and compact enough
Without giving special focus on high parallelism
for large computing cluster
New language



Clear, compact and expressive
Automatically parallelized “normal” programs
Comfortable way for user to think about data
processing problem on large distributed datasets
39
Sawzall




Interpreted, procedural high-level
programming language
Exploit high parallelism
Automate very large data sets analysis
Give users a way to clearly and
expressively design distributed data
processing programs
40
Overall flow

Filtering



Analysis each record individually
Expressed in Sawzall
Map
Aggregation


Collate and reduce the intermediate
values
Predefined aggregators
Reduce
41
An example
Find out the most-linked-to page of each domain
Aggregator: highest value
Stores url
Indexed by domain
Weighted by pagerank
max_pagerank_url:
table maximum(1)[domain:string] of url:string
weight pagerank:int;
doc:Document = input;
input: pre-defined variable initialized by Sawzall
Interpreted into Documentn type
emit max_pagerank_url[domain(doc.url)] <- doc.url
weight doc.pagerank;
emit: sends intermediate
value to the aggregator
42
Unusual features

Sawzall runs on one record at a time


Nothing in the language to have one input record
influent another
emit statement is the only output primitive

Explicit line between filtering and aggregation
Enables high degree of parallelism even
though it is hidden from the language
43
Approaches
Approach
Application
framework level
Advantage
Automatically parallelize users’
programs;
Users are relaxed from the
details of distributing the
execution
Clearer, more expressive
Language level
Comfortable way for
programming
Disadvantage
Programs must follow
the specific model
More restrict
programming model
Instruction level
44
Instruction level approach


Provides instruction level abstracts and
compatibility to users’ applications
May choose traditional ISA such as
x86/x86-64


Run traditional applications without any
modification
Easier to migrate applications to cloud
computing environments
45
Amazon Elastic Compute Cloud (EC2)

Provides virtual machines runs traditional
OS


Traditional programs can work on EC2
Amazon Machine Image (AMI)



Boot instances
Unit of deployment, packaged-up environment
Users design and implement the application logic
in AMI; EC2 handles the deployment and
resource allocation
46
vNUMA
Virtual shared-memory multiprocessor machine build from
commodity workstations

VM
Make the computational power available to legacy applications
and OSs
VM
PM
Virtualization
VM
VM
PM
PM
PM
vNUMA
47
Architecture

Hypervisor


CPU


On each node
Virtual CPUs are mapped to real CPUs on nodes
Memory


Divided between the nodes with equal-sized portions
Each node manages a subset of the pages
48
Memory mapping
In application’s virtual
memory address
read *a
Application
translate a to VM’s
physical memory
address b
OS
VM
maps b to real
physical address c
on node
VMM
find *c
PM
PM
49
Approaches
Approach
Application framework
level
Advantage
Automatically parallelize users’
programs;
Users are relaxed from the
details of distributing the
execution
Clearer, more expressive
Language level
Instruction level
Comfortable way for
programming
Supports traditional
applications
Disadvantage
Programs must follow
the specific model
More restrict
programming model
Users handles the
tasking
Hard to scale up
50
Outline


Introduction
Approaches





Application framework level approach
Language level approach
Instruction level approach
Our work: MRlite
Conclusion
51
Our work


Analyze MapReduce’s design and use a case
study to probe the limitation

One-way scalability

Difficult to handle dynamic, interactive and semantic-rich applications
Design a new parallelization framework – MRlite



Able to scale “up” like MapReduce, and scale “down” to
process moderate-size data
Low latency and massive parallelism
Small run-time system overhead
Design a general parallelization framework and
programming paradigm for cloud computing
52
Architecture of MRlite
The MRlite master accepts jobs
from clients and schedules them
to execute on slaves
application
MRlite master
scheduler
MRlite client
slave
slave
Distributed nodes
accept tasks from
master and execute
them
slave
Linked together with the
app, the MRlite client
library accepts calls from
app and submits jobs to
the master
High speed
Distributed storage
slave
High speed distributed
storage, stores
intermediate files
Data flow
Command flow
53
9044
Result
9000
8000
7000
gcc (on one node)
mrcc/Hadoop
mrcc/MRlite
Time (sec)
6000
5000
4000
2936
3000
2000
1000
1419
506
653
312
50
128
65
0
Linux kernel
ImageMagick
Xen tools
The evaluation shows that MRlite is one order of magnitude faster than
Hadoop on problems that MapReduce has difficulty in handling.
54
Outline


Introduction
Approaches





Application framework level approach
Language level approach
Instruction level approach
Our work: MRlite
Conclusion
55
Conclusion

Cloud Computing needs a general programming
framework


Cloud computing shall not be a platform to run just simple OLAP
applications. It is important to support complex computation and even
OLTP on large data sets
Design MRlite: a general parallelization framework
for cloud computing



Handles applications with complex logic flow and data dependencies
Mitigates the one-way scalability problem
Able to handle all MapReduce tasks with comparable (if not better)
performance
56
Conclusion
Emerging computing platforms increasingly
emphasize parallelization capability, such as
GPGPU


MRlite respects applications’ natural logic flow
and data dependencies
This modularization of parallelization capability
from application logic enables MRlite to
integrate GPGPU processing very easily (future
work)
57
Thank you!
Appendix
Appendix
LATE: Estimate finish times
progress score
progress rate =
execution time
1 – progress score
estimated time left =
=(
progress rate
1
progress score
- 1 ) X execution time
The smaller progress score, the longer estimated time left.
60
Appendix
LATE: Solve the problems in Hadoop’s
default scheduler



Nodes may be slightly (2-3x) slower in “utility
computing”, which may not hurt the response
time or ask for tasks but not fast
Too many speculative tasks and thrashing
Ranks candidate by locality


Wrong tasks may be chosen
Incorrect speculation of reducers
61