What is service continuity? What does Microsoft do to make sure uptime is good? What happens when I have an outage? What is.

Download Report

Transcript What is service continuity? What does Microsoft do to make sure uptime is good? What happens when I have an outage? What is.

What is service continuity?
What does Microsoft do to make sure uptime is good?
What happens when I have an outage?
What is the Service Health Dashboard?
What are Post Incident Reports ?
How does Microsoft approach change management communication?
What is the future direction of Office 365 Service communication?
What is service continuity?
Service continuity is an approach to
implement and validate a combination
of preventive and recovery controls.
Office 365 service continuity includes strategies to:
•
Increase the availability of the service
•
Build ability to recover from disasters
•
Continuously learn and improve the service
A good measure of service continuity is Service
Uptime
What does uptime mean to
my organization?
The objective is to describe the risk of outage
to an individual customer based
on the aggregate uptime of the service.
•
Longer outages have greater impact to
the percentage
•
Outages that affect a greater number of
users
have greater impact
•
More severe outages in terms of users or
duration lead to greater deviations from
100%
The Office 365 service level
agreement expresses uptime
in this way:
𝑈𝑠𝑒𝑟 𝑚𝑖𝑛𝑢𝑡𝑒𝑠 − 𝑑𝑜𝑤𝑛𝑡𝑖𝑚𝑒
× 100%
𝑈𝑠𝑒𝑟 𝑚𝑖𝑛𝑢𝑡𝑒𝑠
The aggregate uptime of service
components can be expressed
similarly.
Service Credits
Customers are eligible for Service
Credits whenever monthly
uptime falls below 99.9%
Service credits are calculated
according the table on the right
Monthly Uptime
Percentage
Service Credit
< 99.9%
< 99%
< 95%
25%
50%
100%
Redundancy
Resiliency
Distributed Workloads
Physical redundancy
Active load balancing
Data redundancy
Recovery across “failure
domains” regularly tested
Distributed components
are more resilient
Functional redundancy
Most failures are contained
to a single service.
Service component isolation
Human backup
Automated recovery alerts
24x7 on-call engineer
On-call engineers are core
product group members
Inspectability and
predictability
Complexity avoidance
and graceful degradation
Detailed log and tracing
Standardized hardware
Deep internal monitoring
augmented by extensive
outside-in monitoring
diagnostics
Fully automated
deployment
Built-in workload
management mechanisms
Redundancy: physical
Office 365 provides physical redundancy at
multiple levels to protect against hardware
failures
Network and hardware redundancy
Facilities and power redundancy
At least 2 datacenters per region
Physical redundancy at disk, NIC, power supply,
and server levels.
Data centers located in seismically safe zones
Redundancy: functional
Online and offline functionality provide continuity
in case of:
Cloud disruptions
Network interruptions
The realities of business life (airplane mode)
Resiliency
Active load balancing to restructure the system
against rare extreme load conditions
Automated failover to healthy resources in
response to:
Hardware or software failures
Monitoring alerts
Human initiated failover to healthy resources
in response to:
Service incidents
Customer reported incidents
Recovery across “failure domains” tested regularly
Distributed workloads
Microsoft Online ID
Office 365 Portal
Office 365 Provisioning
EXO
SPO
Lync
Separation of function with distributed
functional components
Loose coupling serves to further limit the
scope and impact of most failures
Service component isolation to avoid failure
cascades
Replication of directory data across services
ensures a seamless experience.
Human backup
Automated recovery actions
24x7 on-call engineer: “Human in loop”
Rapid response and information collection
Dedicated support teams
13
Service incident
Service-interrupting incidents
Planned maintenance
Planned service maintenance, including transitions/upgrades, repair, and update scenarios
Service alteration
Changes to service features, capabilities, or business terms of service
Account life cycle
Milestones in the subscription life cycle
Additional
Channels
Primary
Channels
Status
Description
Investigating
Monitors have indicated a service anomaly and/or Microsoft has received reports of a potential
service incident. Microsoft is currently investigating.
Service Interruption
Microsoft has confirmed that normal services are being impacted. Microsoft is taking immediate
action to understand the cause of the failure and determine best course of action to restore
service.
Service Degradation
Services are still active, but service responsiveness and/or delivery times may be slower than
usual. Microsoft is working to restore normal service responsiveness.
Restoring Service
Microsoft has isolated the likely cause of the incident and is in the process of restoring service
Extended Recovery
Services are restored and may be slower than usual
Service Restored
Normal system services have been restored
False Positive
The service is healthy and a service incident did not actually occur
Additional Information
There is additional information provided
Normal Service
The service is healthy
SHD
icon
?
Service Health
Dashboard
First and Best Content
Regional
Updated Hourly
Emergency Broadcast System will
automatically redirect customers
http://status.office365.com.
Click on “View
history for past
30 days”
Click on
“Incident ID
MO2708””
RSS Feed
Regional
Tenant Admin
Points to SHD
Community
http://community.office365.com
Forums are helpful resource
Technet or local marketing site
is used in countries without full
community site.
To: Customer
Email
For Limited Set of Service Incidents
Explanation of Incident
Localized Content
Twitter
@Office365
Roles and Responsibilities
Are published for Service Availability issues that span multiple customers
Available within 5 business days
Downloadable document accessible from SHD
30 day historical view in SHD
A PIR includes:
• Incident Information
• Summary
• Customer Impact
• Incident Start Date and Time
• Root Cause
• Next Steps
New survey feedback option
Click on “Postincident report
published”
Post incident review
Service review
Improvement
Next steps determined
Focus is on future protection from
Solid next steps
similar issues
Tracked through delivery
within 5 days
within 30 days
10 additional changes
in comprehensive plan
1 immediate next
step in PIR
Type
Description
Channel
Planned Maintenance
Update
• 5 business days prior notification of planned
service maintenance.
• Notification includes start and end time.
• Service Health
Dashboard
• RSS Admin Feed (for
subscribed admins)
Primary service alteration communication channel
Tailored to your environment: only those actions you
must take appear
33
Supporting service alteration
communication channel
Nearly every task has an FAQ covering
•
The technical task required
•
Why the change is important
•
What happens if you don’t take action
Best experience
Latest version of Internet Explorer
Recommended
Current and previous versions Internet Explorer
Latest versions of Chrome, Firefox and Safari
Best experience
Office 365 ProPlus
Recommended
Any Office client in mainstream support
Not recommended
Office clients in extended support
Best experience
Latest version of Windows or MacOS
Supported
Any supported version of Windows or MacOS
Web browser
Office client
Operating system
37
Commercially reasonable support
12 months’ notice of substantial user experience degradation
More detailed information and programmatic approach around
service updates and service incidents
In Product Notifications
Transparent non-customer impacting service maintenance
Tenant Level Reporting
Service Health Dashboard Customer Preview Programs
https://twitter.com/Office365
3724282
http://www.linkedin.com/groups/Microsoft-Office-365-
www.microsoft.com/garage
: http://fasttrack.office.com//
http://channel9.msdn.com/Events/TechEd
www.microsoft.com/learning
http://microsoft.com/technet
http://microsoft.com/msdn
Thank you!