mClock: Handling Throughput Variability for Hypervisor IO

Transcript mClock: Handling Throughput Variability for Hypervisor IO

mClock : Handling Throughput Variability for Hypervisor IO Scheduling

Ajay Gulati VMware Inc.

Arif Merchant HP Labs Peter Varman Rice University

in USENIX conference on Operating Systems Design and Implementation (OSDI ) 2010.

Outline

• • • • • • Introduction Scheduling goals of mClock mClock Algorithm Distributed mClock Performance Evaluation Conclusion 2

Introduction

• Hypervisors are responsible for multiplexing the underlying hardware resources among VMs – CPU, memory, network and storage IO The amount of CPU and memory resources on a host are fixed and time-invariant .

CPU VM VM VM RAM Host Storage IO Scheduler Throughput available to a host is not under its own control CPU VM VM VM RAM Host Storage IO Scheduler Storage Array 3

Introduction (cont’d)

• • Existing methods provide many knobs for allocating CPU and memory to VMs.

The current state of the art in terms of allocation is much more rudimentary.

– IO resource

Limited to providing proportional shares

to different VMs.

• Lack of QoS support for IO resources can have widespread effects – rendering existing CPU and memory controls ineffective when applications block on IO requests .

Introduction (cont’d)

• The amount of IO throughput available to any particular host can fluctuate widely based on the behavior of other hosts accessing the shared device.

VM5 starts VM1 starts VM2,3 start VM4 starts VM1 stop VM2,3 stop VM4 stop 5

Introduction (cont’d)

• Three main controls in resource allocation – Shares (a.k.a. weights) • proportional resource allocation – Reservations • minimum amount of resource allocation • to provide latency guarantee – Limits • maximum allowed resource allocation • prevent competing IO-intensive applications from consuming all the spare bandwidth in the system 6

Scheduling goals of mClock

VM Remote Desktop (RD) Online Transaction Processing (OLTP) Data Migration (DM)

When reservations

cannot be met: Proportional to reservations

IO Throughput

Low High High

IO Latency

Low Low Insensitive

When reservations

can be met: Satisfy reservations first, then proportional to weight Limit the maximum throughput of DM 7

Scheduling goals of mClock (cont’d)

• • • Each VM i has three parameters: – Reservation(r i ), Limit (l i ), Weight (w i ) VMs are partitioned into three sets: – Reservation-clamped(

) , limit-clamped (

) or proportional (

) , based on whether their current allocation is clamped at the lower or upper bound or is in between.

Define 8

mClock Algorithm

• mClock uses two main ideas: – multiple real-time clocks • Reservation-based, Limit-based , and Weight-based clocks – dynamic clock selection • Dynamic select one from multiple real-time clocks for scheduling.

Tag assignment method is similar to the Virtual Clock scheduling.

mClock Algorithm (cont’d)

• Tag Adjustment – To calibrate the proportional share tags against real time • • • To prevent starvation.

In virtual time based scheduling, this synchronization is done using global virtual time. ( S i,k = max{F i,k-1 , V(a i,k )} ) In mClock, the reservation tag and limit tag must base on real time. => Adjust the origin of existing P tags to the real time.

mClock Algorithm (cont’d)

Tag Adjustment Reservation first Active_IOs : count the queue length.

Select the request from the VMs under limitation.

mClock Algorithm (cont’d)

• This maintains the condition that R tags are always spaced apart by 1/r

, so that reserved service is not affected by the service provided in the weight-based phase.

R k 1 R k 2 R k 3 Current time t r k 3 is served.

R k 4 R k 5 1/r k time The waiting time of r k 4 may be longer than 1/r k 12

Storage-specific Issues

• Bust Handling – Storage workloads are known to be bursty – Requests from the same VM often have a high spatial locality.

– We help bursty workloads that were idle to gain a limited preference in scheduling when the system next has spare capacity.

– To accomplish this, we allow VMs to gain

idle credits

P k 1 P k 2 P k 2 +1/w i P k 3 idle σ i /w i t Current time t: r k 3 arrival time 13

Storage-specific Issues (cont’d)

• IO size – Since larger IO sizes take longer to complete, differently sized IOs should not be treated equally by the IO scheduler.

– – The IO latency with n random outstanding IOs with an IO size of S each can be written as: T m : mechanical delay due to seek and disk rotation.

B peak : the peak transfer bandwidth of a disk.

Converting latency observed for an IO of size S 1 reference size S 2 , to a IO of a For a smaller reference size, this part is negligible – A single request of IO size S is treated equivalent to (1 + S/(T

B peak

)) IO requests . 14

Storage-specific Issues (cont’d)

• Request Location – mClock improves the overall efficiency of the system by scheduling IOs with high locality as a batch.

• A VM is allowed to issue IO requests in a batch as long as the requests are close in logical block number space .

• Reservation Setting – IOPS = Outstanding IOs / Latency – Application that keeps 8 IOs outstanding and requires 25ms latency, 8 / 0.025 = 320 IOPS for reservation 15

Distributed mClock

• • Cluster-based storage systems dmClock runs a modified version of mClock – piggyback two integers ρ

and δ

a storage server s

• with each request of VM v i to δ

: the number of IO requests from VM v

i that have completed service at all the servers between the previous request (from v i ) to the server s j and the current request.

•

ρ i : the number of IO requests from v i that have been served as part of constraint-satisfying phase between the previous request to s j and the current request

Performance Evaluation

• • Implemented in VMware ESX server hypervisor – By modifying the SCSI scheduling layer in the I/O stack of VMware ESX server hypervisor.

The host is a Dell Poweredge 2950 server – two Qlogic HBAs connected to an EMC CLARiiON CX3-40 storage array over FC SAN.

– Used two different storage volumes • • A 10 disk RAID 0 disk group A 10 disk RAID 5 disk group 17

Performance Evaluation

• • Two kinds of VMs – Linux VMs with a 10GB virtual disk, one VCPU and 512MB memory – Windows server 2003 VMs with a 16GB virtual disk , one VCPU and 1GB memory Workload generator – Iometer in the Windows server VMs • http://www.iometer.org/ – A self-designed work-load generator in Linux VMs 18

Performance Evaluation (cont’d)

• Limit Enforcement

Workload IO size Latency bound Weight 32 random IO (75% read) every 250ms 4KB 30ms 2

OLTP

Always backlogged (75% read) 8KB 30ms 2

Always backlogged (All sequential read) 32KB X 1 At t=140 the limit for DM is set to 300 IOPS .

Performance Evaluation (cont’d)

• Reservations Enforcement – Five VMs with weights in ratio 1:1:2:2:2.

– VMs are started at 60 sec intervals SFQ only does proportional allocation mClock enforces reservations 300 IOPS 250 IOPS 20

Performance Evaluation (cont’d)

• Bursty VM Workloads – – VM1: 128 IOs every 400ms, all 4KB reads, 80% random.

VM2: 16 KB reads, 20% of them random and the rest sequential with 32 outstanding IOs.

– – Idle credits do not impact the overall bandwidth allocation over time.

The latency seen by the bursty VM1 decreases as we increase the idle credits.

Performance Evaluation (cont’d)

• Filebench Workloads – Emulate the workload of OLTP VMs [25] R. McDougall. Filebench: Application level file system benchmark. http://www.solarisinternals.com/si/tools/filebench/index.php

Performance Evaluation (cont’d)

• dmClock Evaluation – Implementation in a distributed storage system that consists of multiple storage servers (nodes).

– Each node is implemented using a virtual machine running RHEL Linux with a 10GB OS disk and a 10GB experimental disk.

Conclusion

• • • The mClock provides per-VM quality of service. The QoS requirements are expressed as – minimum reservation – maximum limit – proportional shares (weight) The controls provided by mClock would allow stronger isolation among VMs.

The techniques are quite generic and can be applied to array level scheduling and to other resources such as network bandwidth allocation as well 24

• •

Comments

Existing VM services only provide resources in terms of CPU, memory, and storage. But I/O throughput may be the largest factor in QoS provisioning.

– In terms of response time or delay time.

It’s a good idea to combine reservation, limit and proportional share in schedule algorithms.

– WF 2 Q-M considered the limit but no reservations.

• The problem of reservation, limit and proportional share between VMs in different hosts ??

• •

Comments (cont’d)

Experiments just validate the correctness of mClock. – How about the short term fairness, latency distribution and computation overhead ?

The experiments just use one host machine. – Cannot reflect the condition of throughput variability when there are multiple hosts.

mClock: Handling Throughput Variability for Hypervisor IO

Transcript mClock: Handling Throughput Variability for Hypervisor IO