What is a Data Center?

Download Report

Transcript What is a Data Center?

Current Trends in Data Center COMMISSIONING RICHARD L SAWYER, Strategist - HP Critical Facilities

ACG – Chicago April 2013

AGENDA:

WHAT IS A DATA CENTER?

DIRTY LITTLE SECRET

RISK MITIGATION

LEVERAGING COMMISSIONING

USING FAILURE TO SUCCEED

• • •

What is a Data Center?

By NFPA 70: “Critical Operations Data System” By Clients: Where ever I process data.

By Commissioning

Agents: A power intensive critical space.

PAN OPEN

Status

Heating Cooling

Alarms

Low Temperature Loss of Air Flow High Humidity

ON PRESENT

Local Alarm

SILENCE Liebert system 3

Successful Data Center Operations Start with Commissioning

• • •

Data Centers are designed to a certain availability expectation to meet business goals.

Whether or not they meet the designed goal depends on the contractor.

Commissioning is the only way to assure the availability of the design is achieved in practice!

It’s all about availability!

Data Centers have specified design features.

These are investments to deliver a specified availability…….

Tier 1

4

Tier 2 Tier 3-

Concurrently Maintainable

Tier 4-

Fault Tolerant 1 May 2020

Single Generator or No Generator Generator Basic UPS for LAN Room, non-redundant N+1 UPS with redundant components Single Utility or on Radial line from Loop 99.671% Availability per Uptime Institute Single Utility Feeders, N+1 Mechanical System 99.741% Availability per Uptime Institute N+1 Generator System N+1 UPS with redundant components One Active, One Passive, Utility Source, N+1 Mechanical System 99.982% Availability per Uptime Institute 2N Generator System 2N UPS Systems Dual Active Utility Feeders, 2N Mechanical System, compartmentalization 99.995% Availability per Uptime Institute

The cost is huge: Availability is expensive!

Tier II, III, IV build costs ($/sq. ft.) related to power density $5,000 $4,500 $4,000 $3,500 $3,000 $2,500 $2,000 $1,500 $1,000 $500 $ 50 w/sf 100 w/sf 150 w/sf 200 w/sf 250 w/sf 1. Data center tier costs increase per sq. ft. (sqM) costs 2. As tier level increases, build cost rises.

3. Costs of Tier IV are almost double those of Tier II.

300 w/sf Tier IV Tier III Tier II A 20K sf Tier III data center costs $35

Million

w/sf @ 150 5 HP data, based on a 40,000 sq. ft. raised-floor data center.

1 May 2020

And the IT investment is even larger!

• A 20,000 square foot data center built to 150 watts/square foot can accommodate 800 racks of IT equipment @3.75 kW per rack.

• This 3,000 to 10,000 servers depending on architecture, form factor and configuration.

• The IT investment in hardware, software and service can amount to

5 to 8 times

the data center facility investment.

Can you safely assume the data center investment will work as designed from Day One?

6 1 May 2020

Availability interdependency

End-to-end availability is the product of the availability of the IT Architecture times the availability of the Facility Infrastructure (FI).

Formula: ( Availability of IT ) X ( Availability of FI ) = Total

End-to-End Availability

(Tier 3 FI x MS Server) = Total availability 99.982% x 99.202% = 99.184% IT architecture and facility infrastructure are interdependent in meeting the data center goal. . .

. .

the speed of IT recovery is dependent on the speed of facility recovery!

Dirty Little Secret:

Data Centers Fail

Failure is:

Expensive Inevitable Predictable Manageable Useful

Failure is Inevitable

AFCOM 2007: “Understanding Tier Systems”, Tom Roberts, Rick Sawyer

5 YEAR PROBABILITY OF FAILURE

Predictability of Failure

Utility G Primary Bus 1 UPS Bypass Option 2N

2 Utilities 2 Generators 2 ATS 2 UPS Systems STS MTBF = 315,766 hours Availability = 99.9985% Probability of Failure in 5 years =

12.95% Static Switch PDU Critical Load G Utility Primary Bus 2 UPS Bypass

Failure is Predictable

Good News!

Failure is Manageab

le

STRATEGY TO SURVIVE:

• Design to Survive • Map Foreseeable Failures • Develop SOP’s, MOP’s, EOP’s • Commission!

Test, Document, Train

Design to Survive

Optimizing Managed

Fault tolerant system features (2N)

Initial Repeatable Defined Data center has concurrent maintainability features

Data center systems have redundant features for resiliency (N+1)

Data center has dedicated cooling, generators, UPS, fire, security and monitoring systems Absence

Data Center is basic server or network room, in a dedicated space having minimal dedicated infrastructure systems

No dedicated data center, processing is in office space

Using ITSM Capability Maturity Model to assess Facility Infrastructure Design

Zoned Availability Scalable Mission Critical infrastructure using Central UPS and Rack based UPS for 2N redundancy

M

Rack based UPS Systems as needed for 2N redundancy

HEAT REJECT I F R E UPS Cold Aisle CRAC UPS Hot Aisle CRAC UPS Cold Aisle CRAC UPS Hot Aisle CRAC Cold Aisle S E C U R EPO M SYSTEM MONITOR pdu UPS Battery

WEBLINK

CRAC pdu pdu

Central UPS for one “N” side, scalable UPS System

Site Availability – pdu 99.995% HEAT REJECT

Map Foreseeable Failures

SPOF Matrix - Common Single Points of Failure Check observed SPOFs found in the survey Electrical HVAC There is one utility supply with no standby generator.

Multiple generators are connected via a single paralleling switchgear There is one transfer switch where the generator and utility are switched.

The UPS and Static Bypass are fed off of the same circuit breaker.

The UPS output distribution is controlled by one circuit breaker.

The UPS synchronization is controlled by one external circuit.

There is one electrical path to the critical load with no redundancy or automatic bypass provisions.

There is one step-down transformer in the critical electrical path, or step down transformers are in series if multiple.

There is one static switch in series with the UPS output. All power is fed through one piece of supply electrical switchgear.

There is an EPO circuit that disconnects all electrical power.

There is a switchgear ground fault protection circuit that disconnects all electrical power distribution.

All power is fed through one piece of electrical distribution switchgear to the critical load There is one set of electrical cables from utility supply to critical power supplies.

There is one set of electrical cables from critical power supply to critical power distribution.

The HVAC critical cooling system is supplied from one motor control center.

The HVAC critical cooling system is supplied from one piece of distribution switchgear.

The heat rejection system (i.e., cooling towers) are fed from one electrical distribution point.

Critical pumps are fed/controlled from one electrical distribution point.

Water supply is from one distribution point.

The chilled water piping system is a single loop system.

The condenser water piping system is a single loop system.

The glycol piping system is non-redundant.

There are no redundant air handling units supplying the critical load areas.

The building management system can only be operated/controlled from a single point.

The building management system is required for default HVAC system operation.

The water treatment system is not monitored for free chlorine content or biological contamination.

There is only one method, or piece of equipment to provide adequate critical space cooling.

The heat rejection system is non-redundant.

The fire detection system interrupts air flow to the critical load spaces without verifying sensors. There is an EPO circuit that interrupts cooling to the critical load. There are common valves that can fail, interrupting chilled water, condenser water or supply water.

Test, Document, Train

3 3

Develop MOP’s, SOP’s, EOP’s

1 Automate Networks Optimizing Automate Servers Automate Storage 2 2 2

Runbook Automation

Repeatable Managed

Real time monitoring, continuous improvement features

Defined Documentation is complete, available, compliance is measured and trended

Procedures are associated with asset management systems and are tracked to completion, effectiveness

Initial

Standard, Maintenance and Emergency Operating Procedures exist and are site specific

Absence

Maintenance and operations are not site specific or complete, ad hoc and depend on staff memory/knowledge

No operational processes formally in place or measured

O&M MGE EPS 8000

UPS System A, Module 01

• Simplified One-Line power supply diagram • Simplified One-Line UPS system diagram – Normal power flow diagrams – Emergency power flow diagrams – Automatic Transfer Control diagram • Location of equipment • Start-Up and Shut-Down procedure • Emergency response procedure • Recommended maintenance practices • Reference Engineering Prints • Reference MGE EPS 8000 Operations and Maintenance Manual Based on best available data 05/11- Verify against As-Builts

SG-3A01 B-3A04 k SG-3A02 B-3A29 NO Bypass Power Flow to UPS A01 For Maintenance on Modules or Module Failure Mode UPS Systems A01 & B01

Based on best available data 05/11- Verify against As-Builts

ATS-31A01 T-31A01 13.8 kV 480V CB-01A001 NC SG-01A01

Automatic Transfer Control

CB-01A002 NO CB-01B002 NC

Load Bus Synchronization Control

SG-01A02

From SG-0A04 To SG-01A03 Critical UPS Load A To SG-01B03 Critical UPS Load B

SG-3B02 SG- 3B01 k B-3B04 B-3B33 NO ATS-31B01 CB-01B001 13.8 kV 480V T-31B01 NC SG-01B01 SG-01B02

From SG-0A04

Process for installing a new IT server

Order Delivery Physical Inspection Burn-in functional test Integration with existing systems Install in rack Network assignment Firmware verification Software verification Data test of software Online production

Process for “installing” a new datacenter

Design Construct Physical inspection Equipment startup “Pull-the-plug” integrated test Failure mode tests Controls and monitoring tests Equipment tests System-level tests Capacity tests Turn over to IT and Operations

The Value of Commissioning

• • • •

Assures

design performance is achieved following construction

Verifies

performance levels – Capacity – Availability (redundancies)

Provides

documentation base for SOP’s, MOP’s, and EOP’s

Opportunity

for “hands-on” training of operations staff which they may never see for years!

– Video taping of procedures – Monitoring and alarm testing with response procedures – “New Employee” training guide development

IT investment is 3-5X the data center investment. Commissioning assures the IT architecture support systems work, and can be recovered quickly when they fail.

Leverage Facility Commissioning

1. Involve everyone

: IT, management, vendors, contractor, engineers and operating staff.

2. Manage your documents

– capture everything methodically.

3. Test everything

that can be safely tested.

4. Video tape

procedures, especially risk mitigation procedures for SPOF’s.

Know your data center!

Commissioning Trends

• • • • • •

Standardized procedures

to test standardized systems

Capacity testing

to verify efficiency at all load levels

Staff training

during the commissioning process

Video taping

of test procedures for future training

Integrated testing

of raised floor areas before IT equipment is installed

Digital data logging

information.

of system performance during commissioning to lower cost and provide better

Typical Integrated Test

Generator capacity and redundancy is tested by failing units Utility is failed to test transfer switch and generator performance UPS redundancy is tested by failing modules and system Static switch sources are failed to test performance Load banks are installed to simulate critical load

Static Switch PDU Critical Load

1 May 2020 25

Utility G G Primary Bus 1 UPS UPS Bypass

Digital meters record performance at critical load

Things happen……

Use Failure as an Opportunity

• •

When you’re down, you’re down.

Use the downtime to access, maintain or modify systems you can’t get to any other time

– Verify breaker operation – “

retro commission

”!

– Inspect and repair equipment in a powered down condition – Tie in valves and breakers for future use –

Test systems and operations procedures Plan recovery procedures to leverage downtime opportunity for maintenance, testing and training!

Summary

• • • • • Modern office building contain high power data center spaces Availability of those spaces is a key client demand Design can only do so much, performance must be proven-

Through Commissioning!

Actual availability is an operational issue.

Data center performance is contingent on a strong commissioning program from the start!

Questions?

Richard L. Sawyer

Strategist, HP Critical Facility Services [email protected]

518-857-9751