What is service continuity? What does Microsoft do to make sure uptime is good? What happens when I have an outage? What is.
Download ReportTranscript What is service continuity? What does Microsoft do to make sure uptime is good? What happens when I have an outage? What is.
What is service continuity? What does Microsoft do to make sure uptime is good? What happens when I have an outage? What is the Service Health Dashboard? What are Post Incident Reports ? How does Microsoft approach change management communication? What is the future direction of Office 365 Service communication? What is service continuity? Service continuity is an approach to implement and validate a combination of preventive and recovery controls. Office 365 service continuity includes strategies to: • Increase the availability of the service • Build ability to recover from disasters • Continuously learn and improve the service A good measure of service continuity is Service Uptime What does uptime mean to my organization? The objective is to describe the risk of outage to an individual customer based on the aggregate uptime of the service. • Longer outages have greater impact to the percentage • Outages that affect a greater number of users have greater impact • More severe outages in terms of users or duration lead to greater deviations from 100% The Office 365 service level agreement expresses uptime in this way: 𝑈𝑠𝑒𝑟 𝑚𝑖𝑛𝑢𝑡𝑒𝑠 − 𝑑𝑜𝑤𝑛𝑡𝑖𝑚𝑒 × 100% 𝑈𝑠𝑒𝑟 𝑚𝑖𝑛𝑢𝑡𝑒𝑠 The aggregate uptime of service components can be expressed similarly. Service Credits Customers are eligible for Service Credits whenever monthly uptime falls below 99.9% Service credits are calculated according the table on the right Monthly Uptime Percentage Service Credit < 99.9% < 99% < 95% 25% 50% 100% Redundancy Resiliency Distributed Workloads Physical redundancy Active load balancing Data redundancy Recovery across “failure domains” regularly tested Distributed components are more resilient Functional redundancy Most failures are contained to a single service. Service component isolation Human backup Automated recovery alerts 24x7 on-call engineer On-call engineers are core product group members Inspectability and predictability Complexity avoidance and graceful degradation Detailed log and tracing Standardized hardware Deep internal monitoring augmented by extensive outside-in monitoring diagnostics Fully automated deployment Built-in workload management mechanisms Redundancy: physical Office 365 provides physical redundancy at multiple levels to protect against hardware failures Network and hardware redundancy Facilities and power redundancy At least 2 datacenters per region Physical redundancy at disk, NIC, power supply, and server levels. Data centers located in seismically safe zones Redundancy: functional Online and offline functionality provide continuity in case of: Cloud disruptions Network interruptions The realities of business life (airplane mode) Resiliency Active load balancing to restructure the system against rare extreme load conditions Automated failover to healthy resources in response to: Hardware or software failures Monitoring alerts Human initiated failover to healthy resources in response to: Service incidents Customer reported incidents Recovery across “failure domains” tested regularly Distributed workloads Microsoft Online ID Office 365 Portal Office 365 Provisioning EXO SPO Lync Separation of function with distributed functional components Loose coupling serves to further limit the scope and impact of most failures Service component isolation to avoid failure cascades Replication of directory data across services ensures a seamless experience. Human backup Automated recovery actions 24x7 on-call engineer: “Human in loop” Rapid response and information collection Dedicated support teams 13 Service incident Service-interrupting incidents Planned maintenance Planned service maintenance, including transitions/upgrades, repair, and update scenarios Service alteration Changes to service features, capabilities, or business terms of service Account life cycle Milestones in the subscription life cycle Additional Channels Primary Channels Status Description Investigating Monitors have indicated a service anomaly and/or Microsoft has received reports of a potential service incident. Microsoft is currently investigating. Service Interruption Microsoft has confirmed that normal services are being impacted. Microsoft is taking immediate action to understand the cause of the failure and determine best course of action to restore service. Service Degradation Services are still active, but service responsiveness and/or delivery times may be slower than usual. Microsoft is working to restore normal service responsiveness. Restoring Service Microsoft has isolated the likely cause of the incident and is in the process of restoring service Extended Recovery Services are restored and may be slower than usual Service Restored Normal system services have been restored False Positive The service is healthy and a service incident did not actually occur Additional Information There is additional information provided Normal Service The service is healthy SHD icon ? Service Health Dashboard First and Best Content Regional Updated Hourly Emergency Broadcast System will automatically redirect customers http://status.office365.com. Click on “View history for past 30 days” Click on “Incident ID MO2708”” RSS Feed Regional Tenant Admin Points to SHD Community http://community.office365.com Forums are helpful resource Technet or local marketing site is used in countries without full community site. To: Customer Email For Limited Set of Service Incidents Explanation of Incident Localized Content Twitter @Office365 Roles and Responsibilities Are published for Service Availability issues that span multiple customers Available within 5 business days Downloadable document accessible from SHD 30 day historical view in SHD A PIR includes: • Incident Information • Summary • Customer Impact • Incident Start Date and Time • Root Cause • Next Steps New survey feedback option Click on “Postincident report published” Post incident review Service review Improvement Next steps determined Focus is on future protection from Solid next steps similar issues Tracked through delivery within 5 days within 30 days 10 additional changes in comprehensive plan 1 immediate next step in PIR Type Description Channel Planned Maintenance Update • 5 business days prior notification of planned service maintenance. • Notification includes start and end time. • Service Health Dashboard • RSS Admin Feed (for subscribed admins) Primary service alteration communication channel Tailored to your environment: only those actions you must take appear 33 Supporting service alteration communication channel Nearly every task has an FAQ covering • The technical task required • Why the change is important • What happens if you don’t take action Best experience Latest version of Internet Explorer Recommended Current and previous versions Internet Explorer Latest versions of Chrome, Firefox and Safari Best experience Office 365 ProPlus Recommended Any Office client in mainstream support Not recommended Office clients in extended support Best experience Latest version of Windows or MacOS Supported Any supported version of Windows or MacOS Web browser Office client Operating system 37 Commercially reasonable support 12 months’ notice of substantial user experience degradation More detailed information and programmatic approach around service updates and service incidents In Product Notifications Transparent non-customer impacting service maintenance Tenant Level Reporting Service Health Dashboard Customer Preview Programs https://twitter.com/Office365 3724282 http://www.linkedin.com/groups/Microsoft-Office-365- www.microsoft.com/garage : http://fasttrack.office.com// http://channel9.msdn.com/Events/TechEd www.microsoft.com/learning http://microsoft.com/technet http://microsoft.com/msdn Thank you!