Windows Azure for Research Roger Barga & Jared Jackson Contributors include Nelson Araujo, Dennis Gannon and Wei Lu Cloud Computing Futures Group, Microsoft.
Download ReportTranscript Windows Azure for Research Roger Barga & Jared Jackson Contributors include Nelson Araujo, Dennis Gannon and Wei Lu Cloud Computing Futures Group, Microsoft.
Slide 1
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 2
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 3
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 4
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 5
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 6
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 7
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 8
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 9
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 10
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 11
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 12
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 13
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 14
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 15
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 16
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 17
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 18
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 19
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 20
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 21
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 22
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 23
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 24
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 25
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 26
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 27
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 28
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 29
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 30
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 31
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 32
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 33
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 34
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 35
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 36
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 37
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 38
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 39
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 40
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 41
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 42
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 43
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 44
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 45
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 46
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 47
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 48
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 49
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 50
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 51
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 52
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 53
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 54
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 55
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 56
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 57
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 58
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 59
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 60
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 61
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 62
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 63
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 64
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 65
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 66
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 67
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 68
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 69
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 70
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 71
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 72
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 73
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 74
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 75
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 76
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 77
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 78
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 79
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 80
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are pairs
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 2
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 3
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 4
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 5
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 6
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 7
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 8
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 9
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 10
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 11
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 12
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 13
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 14
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 15
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 16
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 17
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 18
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 19
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 20
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 21
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 22
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 23
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 24
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 25
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 26
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 27
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 28
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 29
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 30
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 31
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 32
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 33
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 34
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 35
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 36
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 37
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 38
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 39
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 40
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 41
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 42
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 43
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 44
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 45
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 46
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 47
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 48
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 49
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 50
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 51
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 52
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 53
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 54
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 55
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 56
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 57
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 58
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 59
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 60
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 61
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 62
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 63
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 64
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 65
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 66
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 67
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 68
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 69
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 70
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 71
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 72
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 73
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 74
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 75
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 76
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 77
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 78
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 79
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]
Slide 80
Windows Azure for Research
Roger Barga & Jared Jackson
Contributors include Nelson Araujo, Dennis Gannon and Wei Lu
Cloud Computing Futures Group, Microsoft Research
Microsoft and Cloud Computing
Introduction to Windows Azure
Research Applications on Azure, demos
How They Were Built
A Closer Look at Azure
Cloud Research Engagement Initiative
Q&A
[10 minutes]
[35 minutes]
[10 minutes]
[15 minutes]
[15 minutes]
[ 5 minutes]
[ * ]
“In the last two decades advances in computing
technology, from processing speed to network
capacity and the Internet, have revolutionized
the way scientists work.
From sequencing genomes to monitoring the
Earth's climate, many recent scientific
advances would not have been possible
without a parallel increase in computing
power - and with revolutionary technologies
such as the quantum computer edging
towards reality, what will the relationship
between computing and science bring us over
the next 15 years?”
Sapir–Whorf Hypothesis (SWH)
Language influences the habitual thought of its speakers
Scientific computing analog
Available systems shape research agendas
Consider some past examples
Cray-1 and vector computing
VAX 11/780 and UNIX
Workstations and Ethernet
PCs and web
Inexpensive clusters and Grids
Today’s examples
multicore, sensors, clouds and services …
What lessons can we draw?
Moore’s “Law” favored consumer commodities
Economics drove enormous improvements
Specialized processors and mainframes faltered
The commodity software industry was born
Today’s economics
Manycore processors/accelerators
Software as a service/cloud computing
Multidisciplinary data analysis and fusion
This is driving change in research and technical computing
Just as did “killer micros” and inexpensive clusters
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1000 servers) and a larger,
100K server center.
Technology
Cost in smallsized Data
Center
Cost in Large
Data Center
Ratio
Network
$95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage
$2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration
~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
Each data center is
11.5 times
the size of a football field
Conquering complexity
Building racks of servers & complex
cooling systems all separately is not
efficient.
Package and deploy into bigger units, JITD
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or
Shanghai, multiple processors, big chunk of memory on the nodes
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
o
Node and system architectures
o
Communication fabric
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
o
HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage
DC: TB local storage, secondary is JBOD, tertiary is non-existent
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
o
HPC: periodic checkpoints, rollback and resume in response to failures, MTBF
approaching zero, checkpoint frequency increasing, I/O demand intolerable.
DC: loosely consistent models, designed to transparently recover from failures
o
Node and system architectures
o
Communication fabric
o
Storage systems
o
Reliability and resilience
o
Programming model and services
Azure FC Owns this Hardware
Highly-available
Fabric Controller (FC)
At Minimum
CPU: 1.5-1.7 GHz x64
Memory: 1.7GB
Network: 100+ Mbps
Local Storage: 500GB
Up to
CPU: 8 Cores
Memory: 14.2 GB
Local Storage: 2+ TB
Azure Platform
Compute
Storage
A closer look
Web Role
HTTP
Load
Balancer
IIS
Worker Role
ASP.NET, WCF,
etc.
Agent
main()
{ … }
Agent
Fabric
VM
Using queues for reliable messaging
To scale, add more of either
1) Receive work
Worker Role
Web Role
main()
{ … }
ASP.NET, WCF,
etc.
2) Put work in
queue
3) Get work
from queue
Queue
4) Do
work
Queues are the application glue
• Decouple parts of application, easier to scale independently;
• Resource allocation, different priority queues and backend servers
• Mask faults in worker roles (reliable messaging).
Use Inter-role communication for performance
• TCP communication between role instances
• Define your ports in the service models
Blob
REST
API
Load Balancer
Queue
Table
A closer look
HTTP
Blobs
Application
Storage
Compute
Fabric
…
Drives
Tables
Queues
Points of interest
Storage types
Blobs: Simple interface for storing named files along with metadata for
the file
Drives – Durable NTFS volumes
Tables: entity-based storage
Not relational – entities, which contain a set of properties
Queues: reliable message-based communication
Access
Data is exposed via .NET and RESTful interfaces
Data can be accessed by:
Windows Azure apps
Other on-premise applications or cloud applications
Work
Home
Develop
Development Fabric
Develop
Your
App
Run
Development Storage
Source
Control
Version
Local
Application Works Locally
Application Works Locally
Application Works
In Staging
Cloud
What the ‘Value Add’ ?
Provide a platform that is scalable and available
Services are always running, rolling upgrades/downgrades
Failure of any node is expected, state has to be replicated
Failure of a role (app code) is expected, automatic recovery
Services can grow to be large, provide state management
that scales automatically
Handle dynamic configuration changes due to load or failure
Manage data center hardware: from CPU cores, nodes, rack,
to network infrastructure and load balancers.
Fabric Controller
Owns all data center hardware
Uses inventory to host services
Deploys applications to free
resources
Maintains the health of those
applications
Maintains health of hardware
Manages the service life cycle
starting from bare metal
Fault Domains
Purpose: Avoid single points of failures
Fault domains
Allocation is across
fault domains
Update Domains
Purpose: ensure the service stays up
while undergoing an update
Update domains
Unit of software/configuration update
Example: set of nodes to update
Used when rolling forward or backward
Developer assigns number required by each
role
Example: 10 front-ends, across 5 update domains
Allocation is across
update domains
Push-button Deployment
Step 1: Allocate nodes
Across fault domains
Across update domains
Step 2: Place OS and role images on nodes
Allocation across
fault and update
domains
Step 3: Configure settings
Step 4: Start Roles
Step 5: Configure load-balancers
Step 6: Maintain desired number of roles
Failed roles automatically restarted
Node failure results in new nodes automatically allocated
LoadBalancers
The FC Keeps Your Service Running
Windows Azure FC monitors the health of roles
FC detects if a role dies
A role can indicate it is unhealthy
Current state of the node is updated appropriately
State machine kicks in again to drive us back into goals state
Windows Azure FC monitors the health of host
If the node goes offline, FC will try to recover it
If a failed node can’t be recovered, FC migrates role instances to a
new node
A suitable replacement location is found
Existing role instances are notified of change
Key takeaways
Cloud services have specific design considerations
Always on, distributed state, large scale, fault tolerance
Scalable infrastructure demands a scalable architecture
Stateless roles and durable queues
Windows Azure frees service developers from
many platform issues
Windows Azure manages both services and servers
Demonstrating Scientific Research Applications in the Cloud
AzureBLAST
- Finding similarities in genetic sequences
Azure Ocean
- Rich client visualization with cloud based data computation
Azure MODIS
- Imagery analysis from satellite photos
PhyloD
- Finding relationships in phylogenetic trees
Two satellites:
Terra,“EOS AM” , launched 12/1999,
descending, equator crossing at 10:30 AM
Aqua, “EOS PM”, launched 5/2002,
ascending, equator crossing at 1:30 PM
Near polar orbits, day/night mode, ~2300 KM
swath
L0 (raw) and L1 (calibrated) data held at
Goddard DAAC
L2 and L3 products made by a collection of
different algorithms provided by a number
of different researchers
...
Research Results
Download
Queue
Data Collection Stage
Analysis Reduction Stage
AzureMODIS
Service Web Role
Portal
Reprojection Stage
Derivation Reduction Stage
•
Statistical tool used to analyze DNA of HIV
from large studies of infected patients
•
PhyloD was developed by Microsoft
Research and has been highly impactful
Small but important group of researchers
•
100’s of HIV and HepC researchers actively use it
1000’s of research communities rely on results
Cover of PLoS Biology
November 2008
Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours
–
–
Requires a large number of test runs for a given job (1 – 10M tests)
Highly compressed data per job ( ~100 KB per job)
Step 1. Staging
Local
Sequence
Database
1.
2.
3.
Compressed
Compress required data
Upload to Azure Store
Deploy Worker Roles
- Init() function downloads
and decompresses data
to the local disk
Uploaded
Azure Storage
Deployed
BLAST
Executable
…
Step 2. Partitioning a Job
User Input
Input Partition
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Step 3. Doing the Work
User Input
Input Partition
BLAST Output
Azure Storage
Queue Message
Web Role
Single Partitioning
Worker Role
Logs
…
BLAST ready Worker Roles
•
Always design with failure in mind
- On large jobs it will happen, and it can happen anywhere
•
Factoring work into optimal sizes has large performance impacts
- The optimal size may change depending on the scope of the job
•
Test runs are your friend
- Blowing $20,000 of computation is not a good idea
•
Make ample use of logging features
- When failure does happen, it’s good to know where
•
Cutting 10 years of computation down to 1 week is great!!
- Little Cloud development headaches are probably worth it
Resources
Workers
25
16
8
4
2
Clock
Duration
0:12:00
0:15:00
0:26:00
0:47:00
1:27:00
Total run time
2:19:39
2:25:12
2:33:23
2:34:17
2:31:39
Computational run time
1:49:43
1:53:47
2:00:14
2:01:06
1:59:13
Resources
Time
Time-Space
fungibility in the
Cloud
Time
Utilizes a general jobs based task manager
which registers jobs and their resulting data
Data
Products
Job definition
Task
Task
Task
Task
Task
Registry
(HPC) Cluster
Administrator
Registry Broker
Highly Sensitive Data
User
Local
Registry
Web Management
Results
User Premises (or internet)
Azure Datacenters
Client Visualization / Cloud Data and Computation
•
•
•
The Cloud is not a Jack-of-All-Trades
Client side tools are particularly appropriate for
Applications using periphery devices
Applications with heavy graphics requirements
Legacy user interfaces that would be difficult to port
Our goal then:
Make best use of the capabilities of client and cloud computing
Often by making the cloud invisible to the end user
A deeper dive into Windows Azure’s inner workings
- Focus on Azure Storage internals
A sampling of best practices
Rich Data Abstractions
Large user data items: blobs
Service state: tables
Service workflow: queues
Existing NTFS service migration: drives
Simple and Familiar Programming Interfaces
REST: HTTP and HTTPS
Supported Storage Client Library: .Net APIs
NTFS: Azure Drive
User creates a globally unique storage account name
Can choose geo-location to host storage account
“US Anywhere”, “US North Central”, “US South Central”,
Can co-locate storage account with compute account
Receive a 256 bit secret key when creating account
Storage Account Capacity
Each storage account can store up to 100 TB
Default limit of 5 storage accounts per subscription
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
http://jared.blob.core.windows.net/images/PIC01.JPG
Number of Blob Containers
Can have has many Blob Containers as will fit within the
storage account limit
Blob Container
A container holds a set of blobs
Set access policies at the container level
Private or Public accessible
Associate Metadata with Container
Metadata are
Up to 8KB per container
Block Blob
Targeted at streaming workloads
Each blob consists of a sequence of blocks
Each block is identified by a Block ID
Size limit 200GB per blob
Page Blob
Targeted at random read/write workloads
Each blob consists of an array of pages
Each page is identified by its offset from the start of the blob
Size limit 1TB per blob
Account
Container
images
jared
Blob
PIC01.JPG
PIC02.JPG
movies
MOV1.AVI
Block or
Page
Block or
Page 1
Block or
Page 2
Block or
Page 3
Block Id N
Block Id 1
Block Id 2
Block Id 3
10 GB Movie
blobName = “TheBlob.wmv”;
PutBlock(blobName, blockId1, block1Bits);
PutBlock(blobName, blockId2, block2Bits);
…………
PutBlock(blobName, blockIdN, blockNBits);
PutBlockList(blobName,
blockId1,…,blockIdN);
TheBlob.wmv
Windows Azure
Storage
Block can be up to 4MB each
Each block can be variable size
Each block has a 64 byte ID
Scoped by blob name and stored with the blob
Block operation
PutBlock
Puts an uncommitted block defined by the block ID for the blob
Block List Operations
PutBlockList
Provide the list of blocks to comprise the readable version of the blob
Can use blocks from uncommitted or committed list to update blob
GetBlockList
Returns the list of blocks, committed or uncommitted for a blob
Block ID and Size of Block is returned for each block
Create MyBlob
Specify Blob Size = 10 GBytes
Fixed Page Size = 512 bytes
0
Random Access Operations
512
1536
2048
2560
10 GB
10 GB Address Space
1024
PutPage[512, 2048)
PutPage[0, 1024)
ClearPage[512, 1536)
PutPage[2048,2560)
GetPageRange[0, 4096) returns
valid data ranges:
[0,512) , [1536,2560)
GetBlob[1000, 2048) returns
All 0 for first 536 bytes
Next 512 bytes are data stored in
[1536,2048
Block Blob
Targeted at streaming workloads
Update semantics
Upload a bunch of blocks. Then commit change.
Concurrency: ETag Checks
Page Blob
Targeted at random read/write workloads
Update Semantics
Immediate update
Concurrency: Leases
All writes applied to base blob name
Only delta changes are maintained across snapshots
Restore to a prior version via snapshot promotion
Can use ListBlobs to enumerate the snapshots for a blob
MyBlob
Promote
A Windows Azure Drive is a Page Blob formatted as a NTFS
single volume Virtual Hard Drive (VHD)
Drives can be up to 1TB
A VM can dynamically mount up to 8 drives
A Page Blob can only be mounted by one VM at a time for
read/write
Remote Access via Page Blob
Can upload the VHD to its Page Blob using the blob interface, and then
mount it as a Drive
Can download the Drive through the Page Blob interface
Provides Structured Storage
Massively Scalable Tables
Billions of entities (rows) and TBs of data
Can use thousands of servers as traffic grows
Highly Available & Durable
Data is replicated several times
Familiar and Easy to use API
ADO.NET Data Services – .NET 3.5 SP1
.NET classes and LINQ
REST – with any platform or language
Table
A storage account can create many tables
Table name is scoped by account
Set of entities (i.e. rows)
Entity
Set of properties (columns)
Required properties
PartitionKey, RowKey and Timestamp
Design and Planning
• Design your workers to execute a task only once
• Optimize against storage transactions as well as data size
• Use Azure Drive for distributing existing non-Azure
applications
Azure Storage
• Remember Azure tables only index on partition and row keys
• Batch multiple small tasks into a single queue message
• Use snapshots when you need read only access to a blob
• Use batch updates to all of your data stores
Network Communication
• Increasing VM size will also increase your network throughput
• Use node-to-node communication to save on message latency
costs
- Note that you lose durable messaging when you do this
Testing & Development
• Include retry logic in all instances where you are accessing data
• Use built-in logging and performance measurement APIs
• Use multiple worker nodes to add tasks to the message queue
• Use ‘heartbeat’ mechanisms when debugging your applications
http://research.microsoft.com/azure
http://azurescope.cloudapp.net
• International Engagements. Offer cloud resources to academic and research
communities worldwide, back up this offering with a technical engagements
team. Lower barrier to entry through tutorials, accelerators, developer best
practices. Support policy change in government funding agencies.
• Data. Provide select reference data sets on Azure to enable communities of
researchers. Invest in services and applications to easily upload data and samples
that can be repurposed. Let the community use these to host own data sets.
• Services for Research. Provide applications and core services for research, as
coherent solution accelerators. Pull through MS products and MSR technologies,
partner with ISVs, make these technologies discoverable and usable.
• Ask the question, what does it take to catalyze a community of researchers,
what are the core services, key products to pull through to support research.
The Rest of Us
Use laptops.
Got data, now what?
And it is really is about data, not the FLOPS…
Our data collections are not as big as we wished.
When data collection does grow large, not able to analyze.
Paradigm shift for research
The ability to marshal needed resources on demand.
Without caring or knowing how it gets done…
Funding agencies can request grantees to archive research data.
The cloud can support very large numbers of users or communities.
Seamless interaction
Cloud is the lens that magnifies the power of desktop;
Persist and share data from client in the cloud;
Analyze data initially captured in client tools, such as Excel;
Analysis as a service (think SQL, Map-Reduce, R/MatLab).
Data visualization generated in the cloud, display on client;
Provenance, collaboration, other ‘core’ services…
Access to a substantial Windows Azure resources
Available over a three year period
To be allocated by NSF with new NSF awards
Coupled with
Access to a research-oriented technical team
Azure resource offering
20 million core hours per year
200 terabytes of triply replicated storage
1 terabyte/day/project of aggregate ingress/egress bandwidth
Tier one support
International program, discussions underway…
http://research.microsoft.com/azure
[email protected]