Document 9656591

Download Report

Transcript Document 9656591

Why static is bad!
Today: static partitioning
Want dynamic sharing
Hadoop
Pregel
Shared cluster
MPI
Comparing Sharing
Frameworks: choice
• Choice of resources
•
•
•
Can a framework pick between all resources?
A predefined subset?
Or a random chosen subset?
• Why important?
•
•
Policies may need to be global --localization
If you can preempt you can get your preference
Comparing Sharing
Frameworks: Interference
• Can frameworks tray to use the same
machines?
•
Can a framework pick between all resources?
• How to avoid this?
•
•
•
Offer resources to machines one at a time
Statically partition
Offer in parallel and arbitrate when conflict arises.
Comparing Sharing
Frameworks: Granularity
• Allocation Granularity
•
•
MPI tasks: gang-schedule, job can’t run until all slots are
acquired.
Hadoop: elastic, job can start running when it allocates a
few slots
• Why important?
•
•
If gang-scheduling, then the framework will hoard until it
gets all the slots it needs.
The cluster may or may not be underutilized.
Mesos
Other Benefits of Mesos
Run multiple instances of the same framework
» Isolate production and experimental jobs
» Run multiple versions of the framework concurrently
Build specialized frameworks targeting particular
problem domains
» Better performance than general-purpose abstractions
Goals
High utilization of resources
Support diverse frameworks (current & future)
Scalability to 10,000’s of nodes
Reliability in face of failures
Resulting design: Small microkernel-like core
that pushes scheduling logic to frameworks
Design Elements
Fine-grained sharing:
» Allocation at the level of tasks within a job
» Improves utilization, latency, and data locality
Resource offers:
» Simple, scalable application-controlled scheduling
mechanism
Element 1: Fine-Grained Sharing
Coarse-Grained Sharing (HPC):
Fine-Grained Sharing (Mesos):
Framework 1
Fw. 3
Fw. 1
Fw. 23
Fw. 2
Fw. 1
Fw. 2
Framework 2
Fw. 2
Fw. 3
Fw. 1
Fw. 3
Fw. 13
Fw. 2
Framework 3
Fw. 2
Fw. 1
Fw. 3
Fw. 21
Fw. 2
Fw. 3
Storage System (e.g. HDFS)
Storage System (e.g. HDFS)
+ Improved utilization, responsiveness, data locality
Element 2: Resource Offers
Option: Global scheduler
» Frameworks express needs in a specification language,
global scheduler matches them to resources
+ Can make optimal decisions
– Complex: language must support all framework needs
– Difficult to scale and to make robust
– Future frameworks may have unanticipated needs
Element 2: Resource Offers
Mesos: Resource offers
» Offer available resources to frameworks, let them pick
which resources to use and which tasks to launch
+ Keeps Mesos simple, lets it support future frameworks
- Decentralized decisions might not be optimal
Mesos Architecture
MPI job
Hadoop job
MPI
scheduler
Hadoop
scheduler
Mesos Allocation
Resource
master module
offer
Mesos slave
Mesos slave
MPI
executor
MPI
executor
task
task
Pick framework to
offer resources to
Mesos Architecture
MPI job
Hadoop job
MPI
scheduler
Hadoop
scheduler
Resource offer =
Pick framework to
Mesos
list of (node,
Allocation
availableResources)
Resource
offer resources to
master module
offer
E.g. { (node1, <2 CPUs, 4 GB>),
(node2, <3 CPUs, 2 GB>) }
Mesos slave
Mesos slave
MPI
executor
MPI
executor
task
task
Mesos Architecture
MPI job
Hadoop job
MPI
scheduler
Hadoop
task
scheduler
Mesos Allocation
Resource
master module
offer
Mesos slave
MPI
executor
task
Mesos slave
MPI
Hadoop
executor executor
task
Framework-specific
scheduling
Pick framework to
offer resources to
Launches and
isolates executors
Drawbacks
• Poor fairness
•
•
Jobs with long tasks can dominate
There is NO preemption!!
• Sticky slots
•
•
Jobs with higher priority can dominate a set of preferred slots
Mesos uses lottery scheduling, probability of being offered a slot is
proportional to the frameworks priority
• Head of line blocking
•
Mesos offers resources one framework at a time
•
•
•
•
Prevents frameworks from trying to use the same slots
Based on assumptions: scheduling decisions are quick,
Mesos revokes offers if a schedules takes too long
Essentially leads to a queue
Omega
Omega
• Scales
•
•
Central layer only does optimistic conflict resolution
No head of Line blocking
• Allows for flexible and evolvable scheduling
•
Framework can implement any arbitrary form of
scheduling
•
•
Each framework has global view
Frameworks can preempt each other
Comparing Sharing
Frameworks
• Choice of resources
• Interference
• Allocation Granularity
• Cluster-wide behaviors
Comparing Frameworks