• • • • • • • • Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets PDU PDU PDU PDU PDU TOR PDU TOR Servers TOR Servers TOR FC PDU US-North Central Region TOR Servers FC Servers TOR Servers TOR Servers TOR Servers FC Servers FC PDU.
Download ReportTranscript • • • • • • • • Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets PDU PDU PDU PDU PDU TOR PDU TOR Servers TOR Servers TOR FC PDU US-North Central Region TOR Servers FC Servers TOR Servers TOR Servers TOR Servers FC Servers FC PDU.
• • • • • • • • Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets PDU PDU PDU PDU PDU TOR PDU TOR Servers TOR Servers TOR FC PDU US-North Central Region TOR Servers FC Servers TOR Servers TOR Servers TOR Servers FC Servers FC PDU Datacenter Routers Aggregation Routers and Load Balancers Agg Agg Cluster 1 Cluster Network Aggregation Cluster 2 Agg Agg Cluster 3 Agg Cluster 5 Cluster 4 Agg Agg Agg Agg PDU PDU PDU TOR … Servers … TOR Servers PDU … TOR Servers PDU PDU TOR Servers PDU TOR Servers PDU TOR Servers PDU … TOR Servers TOR Servers PDU TOR Servers PDU … TOR Servers PDU TOR Servers PDU TOR Servers PDU … TOR Servers Power Distribution Units TOR Servers Racks TOR Servers Top of Rack Switches PDU Cluster AGG Rack 20 Rack 2 Rack 1 TOR Switch TOR Switch TOR Switch Cluster Agg Servers PDU … PDU … … … PDU TOR … Servers TOR Servers TOR PDU PDU PDU Inside a Physical Server TOR Switch Physical Server VM VM CPU CPU PaaS VM Role Instance … PDU Trust boundary Host Partition CPU VM CPU PaaS VM Role Instance CPU CPU IaaS VM Role CPU CPU Unallocated CPUs AGG TOR Switch TOR Switch PDU PDU FC deploys the role instances in (at least) two different fault domains. Azure Load Balancer Different roles are allocated to fault domains independently An even distribution is maintained when scaling up or down No way to control the Fault Domain mapping, but it can be queried for each role instance: Portal REST service mgmt. APIs (“FaultDomain”) Queuing can be defined between the layers (only LB by default) Web Role Worker Role Web Role Instance 0 Fault Domain 0 Web Role Instance 1 Fault Domain 1 Worker Role Instance 0 Fault Domain 0 Web Role Instance 2 Fault Domain 0 Worker Role Instance 1 Fault Domain 1 Update Domains (UD) control how to the service is updated. Azure Load Balancer A single UD is being updated for a role at a time. Scenarios: User Initiated: PaaS service owner updates the service package or chooses a different Guest OS Web Role Web Role Instance 0 Update Domain 0 Web Role Instance 1 Update Domain 1 Web Role Instance 2 Update Domain 2 Platform Initiated: Update Guest OS for PaaS services when a new version is released (e.g. security fixes); Update the server (hypervisor) Implementation Details: Role instances are assigned into different UDs, circularly Alignment between UDs of the different roles Up to 20 UDs per Service (5 by default) Worker Role Worker Role Instance 0 Update Domain 0 Worker Role Instance 1 Update Domain 1 Web Role FD0 UD0 IN_0 UD1 FD1 IN_1 UD2 IN_2 Worker Role FD0 UD0 IN_0 UD1 FD1 IN_1 mode • • • Walk Upgrade Domain • Rollback update • Swapping • • • • Change Deployment Configuration Delete Role Instances Running Highly Available Cloud Virtual Machines • Sample application to demonstrate Azure Load Balancer Windows Azure Usage (application migrated from customer premise). • Sample application specifics: • • • High redundancy for each component Load balancer for the front end Data layer can be implemented by SQL Server or SQL Azure (here); alternatively, can utilize Windows Azure storage Frontend Availability Set Front End Front End Queueing or load-balancing • Set up the whole application in the same affinity group to gain physical proximity Backend Availability Set Backend Backend Geo-Distributed Storage Or SQL Azure Front End • Availability sets instruct how to allocate VMs in the Azure Load Balancer Frontend Availability Set Front End Fault Domain 1 Front End Fault Domain 2 Front End Fault Domain 1 datacenters to isolate impact for hardware faults and infrastructure updates. • Availability sets are defined through portal or REST APIs. • Availability sets has to be defined for each redundant application tier to achieve 99.95% SLA • Queueing or load-balancing • Backend Availability Set Backend Fault Domain 2 Backend Fault Domain 1 We do not offer SLA unless there are 2 VM instances defined and used in each availability set Application SLA is compositional and dependent on the multiplication of the SLA components (each tier, compute, networking, etc) • • No correspondence between fault domains used in different availability sets • Geo-Distributed Storage Or SQL Azure e.g. Front End may cause unavailability of the entire service. Thus, queuing or load-balancing is being added between the availability sets • • • Scenario: Platform initiated update of the servers which run the IaaS VM instances. Goal: high redundancy for the IaaS service Each role is allocated to a different update domain (up to 5) • • • When physical servers are updated, only fraction of the capacity will be touched at a time (or less). No mapping between update domains in different availability sets. IaaS service update is under the customer responsibility. • • In some cases customer VM update and infrastructure update can happen in the same time. • IaaS update notifications are sent to avoid this. Hardware failures can occur any time. Thus, platform update + hardware failure could still cause service outage for dual VM availability sets. Azure Load Balancer Frontend Availability Set Front End Update Domain 1 Front End Update Domain 0 Queueing or load-balancing Backend Availability Set Backend Update Domain 0 Backend Update Domain 1 Geo-Distributed Storage Or SQL Azure Front End Update Domain 2 Capture Shutdown Add Delete Infrastructure Operations Impacting Customer Services Symptom Healing Operation Potential Causes Issue with a customer code or customer VM Reboot the VM(s) • Role instance or Guest OS crash (PaaS) • Customer OS Crash (IaaS) Issue with physical server or rack Allocate the impacted customer VMs to the different server(s) • Physical server software failure • Physical server hardware failure • Rack / PDU / ToR Failure Note: your role instance keeps the same VM and VHDs, preserving cached data in the resource volume Aspect Cloud Services (PaaS) Azure VMs (IaaS) Fault Domain count Two per Role Two per Availability Set Update Domain count Five by default; up to twenty Five Platform update UD by UD UD by UD Administrator initiated update UD by UD, or Blast, or Customer Controlled UD walk or VIP-Swap Administrator controlled (can be automated using PowerShell or REST management APIs) Frontend and backend highly-available addressability Windows Azure provides Load-Balancer per role; queuing recommended for backend roles Administrator defines endpoints in VMs and maps them to a load-balanced set; queuing recommended for backend roles SLA 99.95% uptime for roles with two or more role instances 99.95% uptime for Availability Sets with two or more VMs Multi-service collocation Yes, using Affinity Groups Yes, using Affinity Groups UD/FD automated Yes (except when deleting a specific management when service instance) grows / shrinks Yes when service grows; no when shrinks • • • •