Slides - Eric Zhiqiang Ma
Download
Report
Transcript Slides - Eric Zhiqiang Ma
Survey on Programming and
Tasking in Cloud
Computing Environments
PhD Qualifying Exam
Zhiqiang Ma
Supervisor: Lin Gu
Feb. 18, 2011
Outline
Introduction
Approaches
Application framework level approach
Language level approach
Instruction level approach
Our work: MRlite
Conclusion
2
Cloud computing
Internet services are the most popular
applications nowadays
Millions of users
Computation is large and complex
Google already processed 20TB data in 2004
Cloud computing provides massive
computing resources
Available on demand
A promising model to support processing
large datasets housed on clusters
3
How to program and task?
Challenges
Parallelize the execution
Scheduling the large scale distributed computation
Handling faults
High performance
Ensuring fairness
Programming models for Grid
Do not automatically parallelize users’ programs
Pass the fault-tolerance work to applications
4
Outline
Introduction
Approaches
Application framework level approach
Language level approach
Instruction level approach
Our work: MRlite
Conclusion
5
Approaches
Approach
Advantage
Disadvantage
Application
framework level
Language level
Instruction level
6
MapReduce
MapReduce: parallel computing
framework for large-scale data processing
Successful used in datacenters comprising
commodity computers
A fundamental piece of software in the Google
architecture for many years
Open source variant already exists: Hadoop
Widely used in solving data-intensive problems
MapReduce
Hadoop
… Hadoop or variants …
7
MapReduce
Map and Reduce are higher-order functions
Map: apply an operation to all elements in a list
Reduce: Like “fold”; aggregate elements of a list
12 + 22 + 3 2 + 4 2 + 52 = ?
m: x2
r: +
Initial value
0
1
2
3
4
5
m
m
m
m
m
1
4
9
16
25
r
r
r
r
r
1
5
14
30
55
final value
8
MapReduce’s data flow
9
MapReduce
Massive parallel processing made simple
Example: world count
Map: parse a document and generate <word, 1> pairs
Reduce: receive all pairs for a specific word, and count
Map
// D is a document
for each word w in D
output <w, 1>
Reduce
Reduce for key w:
count = 0
for each input item
count = count + 1
output <w, count>
10
MapReduce easily scales up
Input
files
Map
phase
Intermediate
files
Reduce
phase
Output
files
11
MapReduce
Input
Computation
Output
12
Dryad
General-purpose execution environment
for distributed, data-parallel applications
Concentrates on throughput not latency
Application written in Dryad is modeled as
a directed acyclic graph (DAG)
Many programs can be represented as a
distributed execution graph
13
Dryad
Outputs
Processing
vertices
Channels
(file, pipe,
shared
memory)
Inputs
14
Dryad
Concurrency arise from vertices running
simultaneously across multiple machines
Vertices subroutines are usually quite simple as
sequential programs
User have control over the communication
graph
Each vertex can has multiple input and output
15
Approaches
Approach
Application
framework level
Advantage
Automatically parallelize
users’ programs;
Users are relaxed from the
details of distributing the
execution
Disadvantage
Programs must follow
the specific model
Language’ level
Instruction level
16
Tasking of execution
Performance
Fairness
Locality is crucial
Speculative execution
The same cluster shared by multiple users
Small jobs requires small response time while
throughput is important for big jobs
Correctness
Fault-tolerance
17
Locality and fairness
Locality is crucial
Bandwidth is scarce resource
Input data with duplications are stored in the
same cluster for executions
Fairness
Short jobs requires short response time
Locality and fairness conflicts with
each other
18
FIFO scheduler in Hadoop
Jobs in a queue with priority order
FIFO by default
When there are available slots
Assign slots to tasks, that have local data, in
priority order
Limit the assignment of non-local task to optimize
locality
19
FIFO scheduler
2 tasks
JobQueue
1 tasks
Node 1
Node 2
Node 3
Node 4
20
FIFO scheduler – locality optimization
4 tasks
Only dispatch one
non-local task at
one time
JobQueue
1 tasks
Node 1
Node 2
Node 3
Far away in network topology
Node 4
21
Problem: fairness
3 tasks
JobQueue
3 tasks
Node 1
Node 2
Node 3
Node 4
22
Problem: response time
JobQueue
3 tasks
3 tasks
Small job:
Only 1 task)
1 task
Node 1
Node 2
Node 3
Node 4
23
Fair scheduling
Assign free slots to the job that has the
fewest running tasks
Strict fairness
Running jobs gets nearly equal number of slots
The small jobs finishes quickly
24
Fair Scheduling
JobQueue
Node 1
Node 2
Node 3
Node 4
25
Problem: locality
JobQueue
Node 1
Node 2
Node 3
Node 4
26
Delay Scheduling
Skip the job that cannot launch a local task
Relax fairness slightly
Allow a job to launch non-local tasks if be
skipped long enough
Avoid starvation
27
Delay Scheduling
120 0 0 0
skipcount
Waiting time is short:
Threshold: 2
Tasks finish quickly
Skipped job is in the head of
the queue
Node 1
Node 2
JobQueue
Node 3
Node 4
28
“Fault” Tolerance
Nodes
fail
Re-run tasks
Nodes
are slow (stragglers)
Run backup tasks (speculative execution)
To minimize job’s response time
Important for short jobs
29
Speculative execution
The scheduler schedules backup
executions of the remaining in-progress
tasks
The task is marked as completed
whenever either the primary or the backup
execution completes
Improve job response time by 44%
according Google’s experiments
30
Speculative execution mechanism
Seems a simple problem, but
Resource for speculative tasks is not free
How to choose nodes to run speculative
tasks?
How to distinguish “stragglers” from
nodes that are slightly slower?
Stragglers should be found out early
31
Hadoop’s scheduler
Start speculative tasks based on a simple
heuristic
Comparing each task’s progress to the average
Assumption of homogeneous environment
The default scheduler works well
Broken in utility computing
Virtualized “utility computing” environments, such
as EC2
How to robustly perform speculative execution
(backup tasks) in heterogeneous environments?
32
Speculative execution in Hadoop
When there is no “higher priority” tasks, looks for
a task to execute speculatively
Assumption: The is no cost to launching a speculative task
Comparing each task’s progress to the average
progress
Assumption: Nodes perform similarly. (“Slow node is faulty”;
“Nodes that ask for new tasks are fast”)
Nodes may be slightly (2-3x) slower in “utility computing”, which
may not hurt the response time or ask for tasks but not fast
33
Speculative execution in Hadoop
Threshold for speculative execution
(Average progress score of each category of
tasks) – 0.2
Tasks beyond the threshold are “equally slow”
Ranks candidates by locality
Wrong tasks may be chosen
35% completed 2x slower task with data available on idle
node or 5% completed 10x slower task?
Too many speculative tasks and thrashing
Taking away resources from useful tasks
34
Speculative execution in Hadoop
Progress score
Map: fraction of input data
Reduce: three phase (1/3 for each) and fraction
of data processed
Incorrect speculation of reduce tasks
Copy phase takes most of the time, but account
only 1/3
30% tasks finishes quickly, 70% are in copy
phase: Avg. progress rate = 30%*1+70%*1/3 =
53%, threshold=33%
35
LATE
Longest Approximate Time to End
Principles
Ranks candidate by longest time to end
Only launch speculative tasks on fast nodes
Choose the right task that hurts the job’s response
time; slow nodes can be utilized as long as it
doesn’t hurt the response time
Not every node that asks for task is fast
Cap speculative tasks
Limit resource contention and thrashing
36
LATE algorithm
Cap speculative tasks
If a node asks for a new task and there are fewer
than SpeculativeCap speculative tasks running:
Ignore the request if the node's total progress is
below SlowNodeThreshold
Only launch speculative
tasks on fast nodes
Rank currently running tasks by estimated time left
Launch a copy of the highest-ranked task with
progress rate below SlowTaskThreshold
Rank candidates by
longest time to end
37
Approaches
Approach
Application framework
level
Advantage
Automatically parallelize users’
programs;
Users are relaxed from the
details of distributing the
execution
Disadvantage
Programs must follow
the specific model
Language level
Instruction level
38
Language level approach
Programming frameworks
Traditional programming language
Still not clear and compact enough
Without giving special focus on high parallelism
for large computing cluster
New language
Clear, compact and expressive
Automatically parallelized “normal” programs
Comfortable way for user to think about data
processing problem on large distributed datasets
39
Sawzall
Interpreted, procedural high-level
programming language
Exploit high parallelism
Automate very large data sets analysis
Give users a way to clearly and
expressively design distributed data
processing programs
40
Overall flow
Filtering
Analysis each record individually
Expressed in Sawzall
Map
Aggregation
Collate and reduce the intermediate
values
Predefined aggregators
Reduce
41
An example
Find out the most-linked-to page of each domain
Aggregator: highest value
Stores url
Indexed by domain
Weighted by pagerank
max_pagerank_url:
table maximum(1)[domain:string] of url:string
weight pagerank:int;
doc:Document = input;
input: pre-defined variable initialized by Sawzall
Interpreted into Documentn type
emit max_pagerank_url[domain(doc.url)] <- doc.url
weight doc.pagerank;
emit: sends intermediate
value to the aggregator
42
Unusual features
Sawzall runs on one record at a time
Nothing in the language to have one input record
influent another
emit statement is the only output primitive
Explicit line between filtering and aggregation
Enables high degree of parallelism even
though it is hidden from the language
43
Approaches
Approach
Application
framework level
Advantage
Automatically parallelize users’
programs;
Users are relaxed from the
details of distributing the
execution
Clearer, more expressive
Language level
Comfortable way for
programming
Disadvantage
Programs must follow
the specific model
More restrict
programming model
Instruction level
44
Instruction level approach
Provides instruction level abstracts and
compatibility to users’ applications
May choose traditional ISA such as
x86/x86-64
Run traditional applications without any
modification
Easier to migrate applications to cloud
computing environments
45
Amazon Elastic Compute Cloud (EC2)
Provides virtual machines runs traditional
OS
Traditional programs can work on EC2
Amazon Machine Image (AMI)
Boot instances
Unit of deployment, packaged-up environment
Users design and implement the application logic
in AMI; EC2 handles the deployment and
resource allocation
46
vNUMA
Virtual shared-memory multiprocessor machine build from
commodity workstations
VM
Make the computational power available to legacy applications
and OSs
VM
PM
Virtualization
VM
VM
PM
PM
PM
vNUMA
47
Architecture
Hypervisor
CPU
On each node
Virtual CPUs are mapped to real CPUs on nodes
Memory
Divided between the nodes with equal-sized portions
Each node manages a subset of the pages
48
Memory mapping
In application’s virtual
memory address
read *a
Application
translate a to VM’s
physical memory
address b
OS
VM
maps b to real
physical address c
on node
VMM
find *c
PM
PM
49
Approaches
Approach
Application framework
level
Advantage
Automatically parallelize users’
programs;
Users are relaxed from the
details of distributing the
execution
Clearer, more expressive
Language level
Instruction level
Comfortable way for
programming
Supports traditional
applications
Disadvantage
Programs must follow
the specific model
More restrict
programming model
Users handles the
tasking
Hard to scale up
50
Outline
Introduction
Approaches
Application framework level approach
Language level approach
Instruction level approach
Our work: MRlite
Conclusion
51
Our work
Analyze MapReduce’s design and use a case
study to probe the limitation
One-way scalability
Difficult to handle dynamic, interactive and semantic-rich applications
Design a new parallelization framework – MRlite
Able to scale “up” like MapReduce, and scale “down” to
process moderate-size data
Low latency and massive parallelism
Small run-time system overhead
Design a general parallelization framework and
programming paradigm for cloud computing
52
Architecture of MRlite
The MRlite master accepts jobs
from clients and schedules them
to execute on slaves
application
MRlite master
scheduler
MRlite client
slave
slave
Distributed nodes
accept tasks from
master and execute
them
slave
Linked together with the
app, the MRlite client
library accepts calls from
app and submits jobs to
the master
High speed
Distributed storage
slave
High speed distributed
storage, stores
intermediate files
Data flow
Command flow
53
9044
Result
9000
8000
7000
gcc (on one node)
mrcc/Hadoop
mrcc/MRlite
Time (sec)
6000
5000
4000
2936
3000
2000
1000
1419
506
653
312
50
128
65
0
Linux kernel
ImageMagick
Xen tools
The evaluation shows that MRlite is one order of magnitude faster than
Hadoop on problems that MapReduce has difficulty in handling.
54
Outline
Introduction
Approaches
Application framework level approach
Language level approach
Instruction level approach
Our work: MRlite
Conclusion
55
Conclusion
Cloud Computing needs a general programming
framework
Cloud computing shall not be a platform to run just simple OLAP
applications. It is important to support complex computation and even
OLTP on large data sets
Design MRlite: a general parallelization framework
for cloud computing
Handles applications with complex logic flow and data dependencies
Mitigates the one-way scalability problem
Able to handle all MapReduce tasks with comparable (if not better)
performance
56
Conclusion
Emerging computing platforms increasingly
emphasize parallelization capability, such as
GPGPU
MRlite respects applications’ natural logic flow
and data dependencies
This modularization of parallelization capability
from application logic enables MRlite to
integrate GPGPU processing very easily (future
work)
57
Thank you!
Appendix
Appendix
LATE: Estimate finish times
progress score
progress rate =
execution time
1 – progress score
estimated time left =
=(
progress rate
1
progress score
- 1 ) X execution time
The smaller progress score, the longer estimated time left.
60
Appendix
LATE: Solve the problems in Hadoop’s
default scheduler
Nodes may be slightly (2-3x) slower in “utility
computing”, which may not hurt the response
time or ask for tasks but not fast
Too many speculative tasks and thrashing
Ranks candidate by locality
Wrong tasks may be chosen
Incorrect speculation of reducers
61