Transcript Document

ITM1.1 – Increasing Reliability
Through Design and Documentation
Scott Milliken, CDCMP
Oak Ridge National Laboratory
Computer Facilities Manager
1
Data Center World – Certified Vendor Neutral
Each presenter is required to certify that their
presentation will be vendor-neutral.
As an attendee you have a right to enforce this
policy of having no sales pitch within a session by
alerting the speaker if you feel the session is not
being presented in a vendor neutral fashion. If the
issue continues to be a problem, please alert Data
Center World staff after the session is complete.
2
Increasing Reliability Through Design and
Documentation
This presentation will be an exploration in what can
happen when a data center grows “organically” and
the steps that ORNL had to take in order to fix it.
3
ORNL Started with a Common Problem
• Data Center resources were limited
• Customers/Departments were on their own for
housing equipment
•
•
•
Under someone’s desk (most common)
In an unused office
In a dedicate closet (if they were lucky)
• Administration knew that something had to be
done
4
ORNL Computational Sciences Building
This will fix ALL of our problems!
• Opened in 2001
• Designed for 40,000 SF of data center across 2
floors (20,000 SF each)
• Multiple feeders from TVA utility
• 1 x 500 kVA UPS
• Brought the ‘servers’ in that had been under desks
and in closets
• Designed to house these new fangled Super
Computers
5
Sidebar – How Does Research Granting Work?
• Researchers compete for grant money to fund their work
•
•
Their goal is to show how their approach is different – in other words, why
give money to them instead of someone else?
Often times “different” gets translated into “micromanaged”
•
•
•
This leads to under-desk “server” deployment
If you’re lucky, they found a closet and at least added a dedicated thermostat for
the VAV box
Standardization becomes a dirty word, because users feel like they can’t
be on the leading edge, or unique
• Facilities is often an afterthought
•
•
Users have to make a choice between things they have no expertise in
Users don’t know the right questions to ask
6
Diversity is good, right?
• Users only pay for enough upgrades to meet
their needs
•
•
•
If a user paid for an upgrade, they feel entitled to determine
who can use it
In some grants, the rules state that any upgrades covered by
the grant money have to be exclusive to that project, even
after it leaves
If the project has to pay for upgrades, it will be value
engineered to death
7
Ultimate Diversity = Wild, Wild West
• Fiefdoms are created and maintained
•
•
Just kidding. They’re not maintained.
Users become territorial with space they were given
• Site-wide upgrades don’t happen
•
•
Who’s going to pay for them?
Turns the data center/facility manager into a street beggar
8
A Tale of Two Data Centers
• Downstairs
•
•
•
•
Basically 3 computers taking up 20,000 SF
•
•
•
Jaguar, Kraken, and GAEA – 350 Cabinets of CRAY
File Storage – 100 Cabinets of DDN Disk
Very homogenous
Everything engineered as a project due to the massive size
and scope
A system would come in and be relatively static for 3-5 years
at a time
Roughly 65 F temperature
9
A Tale of Two Data Centers
• Upstairs
•
•
•
•
Comprised of nearly 100% commodity computing equipment
A mixture of rack mounted and shelf mounted hardware
Average footprint for a project was less than 1 cabinet
Power was provisioned as needed
•
•
•
•
No live work allowed at a DOE Site
Plug type was based on what the salesperson bundled
Some cabinets had 6 circuits just to support the load in non-redundant
manner
Cold enough to hang meat - only way to deal with hot spots, with
some areas showing < 60 F at rack inlet
10
Unmanageable Workflows
Over a third of work orders for the instrumentation
technicians had the phrase, “Have the techs see me and I will
show them what to do.” This was mainly due to incorrect or
missing labeling and no common vocabulary with which to
communicate.
This is completely unsustainable, because there is no way to
understand if this is a 1 hour job, or a 10 day job.
Additionally, I have no means to follow-up and understand
whether or not the task was completed correctly.
11
Overheard at the Office
• This can’t ever go down, so I need it on the UPS.
• Oh, that’s a small system, just go find a place for
it in the upstairs data center.
• This salesperson told me that it was cheaper if I
bought the cabinet and power strips from them
and they only carry this brand/model.
• Why do I need to waste $300 on a second power
supply? It’s on the UPS, isn’t it?
12
We Didn’t Design It To Work This Way…
13
Lesson Learned #1 – Manage Expectations
The lab built a nice data center, but then simply
assumed that the users would know how to act to
keep it that way. Without guidance, facilities always
became the most value engineered line item
because they assumed it was 100% reliable – after
all, we have/had the fastest Supercomputer in the
world.
To their credit, it took about 7 years before they
were completely painted into a corner.
14
“Wait, which Peachtree did you mean?”
• Users spoke their language
• Facilities spoke their language
• Users and Facilities interpreted
the same word in different ways
• Labeling was confusing
•
•
•
•
Cables were re-used without relabeling
Electrical circuits not updated
Systems renamed without relabeling
Users were “responsible” for labeling
• Result = Unplanned Outages
15
Efforts Made to Resolve Problems
• Problems encountered during a
maintenance activity were tracked
and put on “the list”
•
•
•
Mislabeled breaker panels
Incorrect power configurations
Hot spots / cold spots
•
•
•
“The List” was no longer accurate
Higher priority items were addressed
Only the squeaky wheels were greased
• By the time the next activity came
around…
16
Responsiveness Made the DMV Look Good
• DOE Restrictions on hot work
•
•
•
Panel had to be de-energized to add a
circuit
Systems powered by a panel were not
diversified
Annual maintenance… really? We have to
wait until this once a year activity?
• There was so much catch-up work
to perform, rarely was any pro-active
work completed
• Kludges were the norm, rather than
the exception
17
Compounding Problems
• Once a project became urgent enough, it would
force management’s hands to schedule an outage
• A full center outage was impractical and politically
unsound, so they would schedule a partial outage
•
•
Since labeling was wrong, the affected systems list was a guess,
more than anything
Dependencies weren’t documented, so even if someone knew that
a switch would go down, they had no idea what else was affected
18
Why think about the future?
• Customers only care about
their environment – not how
it interacts with the physical
world
•
•
•
•
Expansion plans
Cable management
Labeling
Serviceability
19
Lesson #2 – Decide what your end result should be,
otherwise you are wandering aimlessly
• Determine what your standards are
•
•
•
Start with code requirements
Build on them with recognized public standards
Add a dash of best practices
• Understand that change isn’t a light switch
•
•
Change in a data center is akin to eating an elephant
Lasting change happens when you can show the benefit,
rather than barking orders
20
How Do We Fix This?
• Get and keep accurate
documentation
• Triage your equipment
•
•
•
New installations
Must fix now
Wait until lifecycle
•
•
•
Facility includes cabinets
Facility includes power distribution
Facility includes air flow management
• Extend line of demarcation
21
Be ready to make the tough call
• FDCCI mandated a reduction in data centers –
to reduce energy consumption and costs
• ORNL tasked a team with estimating the impact
(cost and downtime) of fixing the problems
• Numbers clearly showed that the best course of
action was against the FDCCI mandate – we
needed to occupy new space
22
Step 1 - Documentation
• Fully audit the data center
•
•
•
Who owns what?
What is misconfigured?
How long will it stay?
•
•
Users don’t care. Really, they don’t.
Find the right point in your process to inject the
documentation step.
Funnel all MACDs through your process.
• OWN the process to keep records accurate
•
23
Make documentation part of your process
• Started by creating an Internal Operations
Procedure for ITSD
•
•
•
•
All new installations were to be pre-built in DCIM
Allowed users and technicians time to learn the system
Allowed developer to see where improvements needed
Essentially banned the phrase, “Have the techs see me”
• Increased the efficiency of technicians
•
•
Ability to pre-print cable labels dramatically reduced time
Accurate labeling decreased unplanned outages
24
Document what you have today
• Do you really know what’s in your data center?
•
•
•
Do you know how many specific models you have?
Do you know who owns which system?
Does the user call it the same thing you do?
• How close to your design density is your
implementation?
• Where does this cable go?
25
Regular Auditing
• Auditing is a continual process, not a one-time
• Auditing can trigger other events
•
•
•
This server was supposed to leave a year ago, but it’s still on
The listed owner of this device retired last June
The cabinet is over the designed safe power limit
26
Formulate a plan for repeatable growth
• Standardize on your build-out
•
•
•
•
Cabinets
Power Strips
Cooling
Air-flow Management
• Determine the cost and lead time for each
module
• Repeat as necessary
27
What ORNL Standardized On
•
•
•
•
•
•
Power Strips – Geist RCX Series
Circuits – 208VAC 3-Phase, 30A or 50A
In-row Coolers – Liebert CRV
Cabinets – SharkRack T2
Cold Aisle Containment - Polargy
DCIM – openDCIM
It’s less important what you standardize on and more
important that you standardize at all.
28
29
How Standardizing Helped
• Users no longer bought fixtures
•
•
•
No more value engineering
No more bad choices
No more wildly varying configurations
• Expansion became a predictable and repeatable
process and expense
• Safety is improved (both physical and service)
• Efficiency increased (both in energy and in labor)
and average temperatures are now in mid-70s F
• Unplanned downtime reduced
30
3 Key Things You Have Learned During this Session
1. Manage your user/customer expectations
2. Unless you have a design for tomorrow, your
efforts to fix things today will be in vain.
Standards are good.
3. Documentation is key to understanding where
you are today and how to get where you want to
be.
31
Thank you
Scott Milliken, CDCMP
Oak Ridge National Laboratory
[email protected]
32