Windows Azure for Research Roger Barga & Jared Jackson Contributors include Nelson Araujo, Dennis Gannon and Wei Lu Cloud Computing Futures Group, Microsoft.

Transcript Windows Azure for Research Roger Barga & Jared Jackson Contributors include Nelson Araujo, Dennis Gannon and Wei Lu Cloud Computing Futures Group, Microsoft.

Slide 1

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 2

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 3

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 4

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 5

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 6

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 7

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 8

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 9

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 10

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 11

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 12

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 13

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 14

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 15

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 16

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 17

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 18

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 19

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 20

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 21

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 22

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 23

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 24

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 25

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 26

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 27

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 28

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 29

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 30

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 31

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 32

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 33

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 34

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 35

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 36

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 37

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 38

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 39

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 40

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 41

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 42

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 43

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 44

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 45

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 46

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 47

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 48

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 49

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 50

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 51

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 52

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 53

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 54

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 55

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 56

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 57

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 58

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 59

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 60

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 61

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 62

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 63

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 64

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 65

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 66

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 67

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 68

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 69

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 70

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 71

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 72

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 73

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 74

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 75

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 76

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 77

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 78

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 79

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Slide 80

Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research

Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A

[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]

[15 minutes]
[ 5 minutes]
[ * ]

“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”

Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers

Scientific computing analog
Available systems shape research agendas

Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids

Today’s examples
multicore, sensors, clouds and services …

What lessons can we draw?

Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born

Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion

This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters

Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology

Cost in smallsized Data
Center

Cost in Large
Data Center

Ratio

Network

$95 per Mbps/
month

$13 per Mbps/
month

7.1

Storage

$2.20 per GB/
month

$0.40 per GB/
month

5.7

Administration

~140 servers/
Administrator

>1000 Servers/
Administrator

7.1

Each data center is
11.5 times
the size of a football field

Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures
o

Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

o

Node and system architectures

o

Communication fabric

o

Node and system architectures

o

Communication fabric

o

Storage systems
o
o

HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience
o
o

HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures

o

Node and system architectures

o

Communication fabric

o

Storage systems

o

Reliability and resilience

o

Programming model and services

Azure FC Owns this Hardware

Highly-available
Fabric Controller (FC)

At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB

Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB

Azure Platform

Compute

Storage

A closer look
Web Role

HTTP
Load
Balancer

IIS

Worker Role

ASP.NET, WCF,
etc.
Agent

main()
{ … }
Agent

Fabric

VM

Using queues for reliable messaging
To scale, add more of either

1) Receive work

Worker Role

Web Role

main()
{ … }

ASP.NET, WCF,
etc.
2) Put work in
queue

3) Get work
from queue

Queue

4) Do
work

Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).

Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models

Blob

REST
API
Load Balancer

Queue

Table

A closer look

HTTP
Blobs

Application
Storage

Compute
Fabric

…

Drives

Tables

Queues

Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties

Queues: reliable message-based communication

Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications

Work

Home

Develop
Development Fabric
Develop

Your
App

Run
Development Storage

Source
Control

Version
Local

Application Works Locally

Application Works Locally

Application Works
In Staging

Cloud

What the ‘Value Add’ ?
Provide a platform that is scalable and available

Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.

Fabric Controller

Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources

Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal

Fault Domains

Purpose: Avoid single points of failures
Fault domains

Allocation is across
fault domains

Update Domains

Purpose: ensure the service stays up
while undergoing an update

Update domains

Unit of software/configuration update
Example: set of nodes to update

Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains

Allocation is across
update domains

Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains

Step 2: Place OS and role images on nodes

Allocation across
fault and update
domains

Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers

Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated

LoadBalancers

The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state

Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it

If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change

Key takeaways

Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues

Windows Azure frees service developers from
many platform issues

Windows Azure manages both services and servers

Demonstrating Scientific Research Applications in the Cloud

AzureBLAST
- Finding similarities in genetic sequences

Azure Ocean
- Rich client visualization with cloud based data computation

Azure MODIS
- Imagery analysis from satellite photos

PhyloD
- Finding relationships in phylogenetic trees

Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers

...
Research Results
Download
Queue

Data Collection Stage

Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal

Reprojection Stage

Derivation Reduction Stage

•

Statistical tool used to analyze DNA of HIV
from large studies of infected patients

•

PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers

•




100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results

Cover of PLoS Biology
November 2008

Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–

Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)

Step 1. Staging
Local
Sequence
Database
1.

2.
3.

Compressed

Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk

Uploaded

Azure Storage

Deployed
BLAST
Executable

…

Step 2. Partitioning a Job
User Input

Input Partition

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Step 3. Doing the Work
User Input

Input Partition

BLAST Output

Azure Storage
Queue Message
Web Role

Single Partitioning
Worker Role

Logs
…

BLAST ready Worker Roles

•

Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere

•

Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job

•

Test runs are your friend
- Blowing $20,000 of computation is not a good idea

•

Make ample use of logging features
- When failure does happen, it’s good to know where

•

Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it

Resources
Workers

25
16
8
4
2

Clock
Duration

0:12:00
0:15:00
0:26:00
0:47:00
1:27:00

Total run time

2:19:39
2:25:12
2:33:23
2:34:17
2:31:39

Computational run time

1:49:43
1:53:47
2:00:14
2:01:06
1:59:13

Resources

Time
Time-Space
fungibility in the
Cloud

Time

Utilizes a general jobs based task manager
which registers jobs and their resulting data

Data
Products
Job definition

Task
Task
Task
Task
Task

Registry

(HPC) Cluster

Administrator
Registry Broker
Highly Sensitive Data

User
Local
Registry

Web Management

Results

User Premises (or internet)

Azure Datacenters

Client Visualization / Cloud Data and Computation
•
•

•

The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user

A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals

A sampling of best practices

Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives

Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive

User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,

Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account

Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

http://jared.blob.core.windows.net/images/PIC01.JPG

Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit

Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible

Associate Metadata with Container
Metadata are pairs
Up to 8KB per container

Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID

Size limit 200GB per blob

Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob

Size limit 1TB per blob

Account

Container
images

jared

Blob
PIC01.JPG
PIC02.JPG

movies
MOV1.AVI

Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3

Block Id N

Block Id 1
Block Id 2
Block Id 3

10 GB Movie

blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);

TheBlob.wmv

Windows Azure
Storage

Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob

Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob

Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob

GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block

Create MyBlob
Specify Blob Size = 10 GBytes

Fixed Page Size = 512 bytes

0

Random Access Operations

512

1536
2048
2560

10 GB

10 GB Address Space

1024

PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)

GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)

GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048

Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks

Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases

All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob

MyBlob

Promote

A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB

A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface

Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows

Highly Available & Durable
Data is replicated several times

Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language

Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)

Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp

Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores

Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications

http://research.microsoft.com/azure
http://azurescope.cloudapp.net

• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.

The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.

Paradigm shift for research
The ability to marshal needed resources on demand.

Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.

Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…

Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards

Coupled with
Access to a research-oriented technical team

Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support

International program, discussions underway…

http://research.microsoft.com/azure
[email protected]

Windows Azure for Research Roger Barga & Jared Jackson Contributors include Nelson Araujo, Dennis Gannon and Wei Lu Cloud Computing Futures Group, Microsoft.

Transcript Windows Azure for Research Roger Barga & Jared Jackson Contributors include Nelson Araujo, Dennis Gannon and Wei Lu Cloud Computing Futures Group, Microsoft.

Directory