Transcript What is a Data Center?
Current Trends in Data Center COMMISSIONING RICHARD L SAWYER, Strategist - HP Critical Facilities
ACG – Chicago April 2013
AGENDA:
•
WHAT IS A DATA CENTER?
•
DIRTY LITTLE SECRET
•
RISK MITIGATION
•
LEVERAGING COMMISSIONING
•
USING FAILURE TO SUCCEED
• • •
What is a Data Center?
By NFPA 70: “Critical Operations Data System” By Clients: Where ever I process data.
By Commissioning
Agents: A power intensive critical space.
PAN OPEN
Status
Heating Cooling
Alarms
Low Temperature Loss of Air Flow High Humidity
ON PRESENT
Local Alarm
SILENCE Liebert system 3
Successful Data Center Operations Start with Commissioning
• • •
Data Centers are designed to a certain availability expectation to meet business goals.
Whether or not they meet the designed goal depends on the contractor.
Commissioning is the only way to assure the availability of the design is achieved in practice!
It’s all about availability!
Data Centers have specified design features.
These are investments to deliver a specified availability…….
Tier 1
4
Tier 2 Tier 3-
Concurrently Maintainable
Tier 4-
Fault Tolerant 1 May 2020
Single Generator or No Generator Generator Basic UPS for LAN Room, non-redundant N+1 UPS with redundant components Single Utility or on Radial line from Loop 99.671% Availability per Uptime Institute Single Utility Feeders, N+1 Mechanical System 99.741% Availability per Uptime Institute N+1 Generator System N+1 UPS with redundant components One Active, One Passive, Utility Source, N+1 Mechanical System 99.982% Availability per Uptime Institute 2N Generator System 2N UPS Systems Dual Active Utility Feeders, 2N Mechanical System, compartmentalization 99.995% Availability per Uptime Institute
The cost is huge: Availability is expensive!
Tier II, III, IV build costs ($/sq. ft.) related to power density $5,000 $4,500 $4,000 $3,500 $3,000 $2,500 $2,000 $1,500 $1,000 $500 $ 50 w/sf 100 w/sf 150 w/sf 200 w/sf 250 w/sf 1. Data center tier costs increase per sq. ft. (sqM) costs 2. As tier level increases, build cost rises.
3. Costs of Tier IV are almost double those of Tier II.
300 w/sf Tier IV Tier III Tier II A 20K sf Tier III data center costs $35
Million
w/sf @ 150 5 HP data, based on a 40,000 sq. ft. raised-floor data center.
1 May 2020
And the IT investment is even larger!
• A 20,000 square foot data center built to 150 watts/square foot can accommodate 800 racks of IT equipment @3.75 kW per rack.
• This 3,000 to 10,000 servers depending on architecture, form factor and configuration.
• The IT investment in hardware, software and service can amount to
5 to 8 times
the data center facility investment.
Can you safely assume the data center investment will work as designed from Day One?
6 1 May 2020
Availability interdependency
End-to-end availability is the product of the availability of the IT Architecture times the availability of the Facility Infrastructure (FI).
Formula: ( Availability of IT ) X ( Availability of FI ) = Total
End-to-End Availability
(Tier 3 FI x MS Server) = Total availability 99.982% x 99.202% = 99.184% IT architecture and facility infrastructure are interdependent in meeting the data center goal. . .
. .
the speed of IT recovery is dependent on the speed of facility recovery!
Dirty Little Secret:
Data Centers Fail
Failure is:
Expensive Inevitable Predictable Manageable Useful
Failure is Inevitable
AFCOM 2007: “Understanding Tier Systems”, Tom Roberts, Rick Sawyer
5 YEAR PROBABILITY OF FAILURE
Predictability of Failure
Utility G Primary Bus 1 UPS Bypass Option 2N
2 Utilities 2 Generators 2 ATS 2 UPS Systems STS MTBF = 315,766 hours Availability = 99.9985% Probability of Failure in 5 years =
12.95% Static Switch PDU Critical Load G Utility Primary Bus 2 UPS Bypass
Failure is Predictable
Good News!
Failure is Manageab
le
STRATEGY TO SURVIVE:
• Design to Survive • Map Foreseeable Failures • Develop SOP’s, MOP’s, EOP’s • Commission!
Test, Document, Train
Design to Survive
Optimizing Managed
Fault tolerant system features (2N)
Initial Repeatable Defined Data center has concurrent maintainability features
Data center systems have redundant features for resiliency (N+1)
Data center has dedicated cooling, generators, UPS, fire, security and monitoring systems Absence
Data Center is basic server or network room, in a dedicated space having minimal dedicated infrastructure systems
No dedicated data center, processing is in office space
Using ITSM Capability Maturity Model to assess Facility Infrastructure Design
Zoned Availability Scalable Mission Critical infrastructure using Central UPS and Rack based UPS for 2N redundancy
M
Rack based UPS Systems as needed for 2N redundancy
HEAT REJECT I F R E UPS Cold Aisle CRAC UPS Hot Aisle CRAC UPS Cold Aisle CRAC UPS Hot Aisle CRAC Cold Aisle S E C U R EPO M SYSTEM MONITOR pdu UPS Battery
WEBLINK
CRAC pdu pdu
Central UPS for one “N” side, scalable UPS System
Site Availability – pdu 99.995% HEAT REJECT
Map Foreseeable Failures
SPOF Matrix - Common Single Points of Failure Check observed SPOFs found in the survey Electrical HVAC There is one utility supply with no standby generator.
Multiple generators are connected via a single paralleling switchgear There is one transfer switch where the generator and utility are switched.
The UPS and Static Bypass are fed off of the same circuit breaker.
The UPS output distribution is controlled by one circuit breaker.
The UPS synchronization is controlled by one external circuit.
There is one electrical path to the critical load with no redundancy or automatic bypass provisions.
There is one step-down transformer in the critical electrical path, or step down transformers are in series if multiple.
There is one static switch in series with the UPS output. All power is fed through one piece of supply electrical switchgear.
There is an EPO circuit that disconnects all electrical power.
There is a switchgear ground fault protection circuit that disconnects all electrical power distribution.
All power is fed through one piece of electrical distribution switchgear to the critical load There is one set of electrical cables from utility supply to critical power supplies.
There is one set of electrical cables from critical power supply to critical power distribution.
The HVAC critical cooling system is supplied from one motor control center.
The HVAC critical cooling system is supplied from one piece of distribution switchgear.
The heat rejection system (i.e., cooling towers) are fed from one electrical distribution point.
Critical pumps are fed/controlled from one electrical distribution point.
Water supply is from one distribution point.
The chilled water piping system is a single loop system.
The condenser water piping system is a single loop system.
The glycol piping system is non-redundant.
There are no redundant air handling units supplying the critical load areas.
The building management system can only be operated/controlled from a single point.
The building management system is required for default HVAC system operation.
The water treatment system is not monitored for free chlorine content or biological contamination.
There is only one method, or piece of equipment to provide adequate critical space cooling.
The heat rejection system is non-redundant.
The fire detection system interrupts air flow to the critical load spaces without verifying sensors. There is an EPO circuit that interrupts cooling to the critical load. There are common valves that can fail, interrupting chilled water, condenser water or supply water.
Test, Document, Train
3 3
Develop MOP’s, SOP’s, EOP’s
1 Automate Networks Optimizing Automate Servers Automate Storage 2 2 2
Runbook Automation
Repeatable Managed
Real time monitoring, continuous improvement features
Defined Documentation is complete, available, compliance is measured and trended
Procedures are associated with asset management systems and are tracked to completion, effectiveness
Initial
Standard, Maintenance and Emergency Operating Procedures exist and are site specific
Absence
Maintenance and operations are not site specific or complete, ad hoc and depend on staff memory/knowledge
No operational processes formally in place or measured
O&M MGE EPS 8000
UPS System A, Module 01
• Simplified One-Line power supply diagram • Simplified One-Line UPS system diagram – Normal power flow diagrams – Emergency power flow diagrams – Automatic Transfer Control diagram • Location of equipment • Start-Up and Shut-Down procedure • Emergency response procedure • Recommended maintenance practices • Reference Engineering Prints • Reference MGE EPS 8000 Operations and Maintenance Manual Based on best available data 05/11- Verify against As-Builts
SG-3A01 B-3A04 k SG-3A02 B-3A29 NO Bypass Power Flow to UPS A01 For Maintenance on Modules or Module Failure Mode UPS Systems A01 & B01
Based on best available data 05/11- Verify against As-Builts
ATS-31A01 T-31A01 13.8 kV 480V CB-01A001 NC SG-01A01
Automatic Transfer Control
CB-01A002 NO CB-01B002 NC
Load Bus Synchronization Control
SG-01A02
From SG-0A04 To SG-01A03 Critical UPS Load A To SG-01B03 Critical UPS Load B
SG-3B02 SG- 3B01 k B-3B04 B-3B33 NO ATS-31B01 CB-01B001 13.8 kV 480V T-31B01 NC SG-01B01 SG-01B02
From SG-0A04
Process for installing a new IT server
Order Delivery Physical Inspection Burn-in functional test Integration with existing systems Install in rack Network assignment Firmware verification Software verification Data test of software Online production
Process for “installing” a new datacenter
Design Construct Physical inspection Equipment startup “Pull-the-plug” integrated test Failure mode tests Controls and monitoring tests Equipment tests System-level tests Capacity tests Turn over to IT and Operations
The Value of Commissioning
• • • •
Assures
design performance is achieved following construction
Verifies
performance levels – Capacity – Availability (redundancies)
Provides
documentation base for SOP’s, MOP’s, and EOP’s
Opportunity
for “hands-on” training of operations staff which they may never see for years!
– Video taping of procedures – Monitoring and alarm testing with response procedures – “New Employee” training guide development
IT investment is 3-5X the data center investment. Commissioning assures the IT architecture support systems work, and can be recovered quickly when they fail.
Leverage Facility Commissioning
1. Involve everyone
: IT, management, vendors, contractor, engineers and operating staff.
2. Manage your documents
– capture everything methodically.
3. Test everything
that can be safely tested.
4. Video tape
procedures, especially risk mitigation procedures for SPOF’s.
Know your data center!
Commissioning Trends
• • • • • •
Standardized procedures
to test standardized systems
Capacity testing
to verify efficiency at all load levels
Staff training
during the commissioning process
Video taping
of test procedures for future training
Integrated testing
of raised floor areas before IT equipment is installed
Digital data logging
information.
of system performance during commissioning to lower cost and provide better
Typical Integrated Test
Generator capacity and redundancy is tested by failing units Utility is failed to test transfer switch and generator performance UPS redundancy is tested by failing modules and system Static switch sources are failed to test performance Load banks are installed to simulate critical load
Static Switch PDU Critical Load
1 May 2020 25
Utility G G Primary Bus 1 UPS UPS Bypass
Digital meters record performance at critical load
Things happen……
Use Failure as an Opportunity
• •
When you’re down, you’re down.
Use the downtime to access, maintain or modify systems you can’t get to any other time
– Verify breaker operation – “
retro commission
”!
– Inspect and repair equipment in a powered down condition – Tie in valves and breakers for future use –
Test systems and operations procedures Plan recovery procedures to leverage downtime opportunity for maintenance, testing and training!
Summary
• • • • • Modern office building contain high power data center spaces Availability of those spaces is a key client demand Design can only do so much, performance must be proven-
Through Commissioning!
Actual availability is an operational issue.
Data center performance is contingent on a strong commissioning program from the start!
Questions?
Richard L. Sawyer
Strategist, HP Critical Facility Services [email protected]
518-857-9751