Transcript Document
ITM1.1 – Increasing Reliability Through Design and Documentation Scott Milliken, CDCMP Oak Ridge National Laboratory Computer Facilities Manager 1 Data Center World – Certified Vendor Neutral Each presenter is required to certify that their presentation will be vendor-neutral. As an attendee you have a right to enforce this policy of having no sales pitch within a session by alerting the speaker if you feel the session is not being presented in a vendor neutral fashion. If the issue continues to be a problem, please alert Data Center World staff after the session is complete. 2 Increasing Reliability Through Design and Documentation This presentation will be an exploration in what can happen when a data center grows “organically” and the steps that ORNL had to take in order to fix it. 3 ORNL Started with a Common Problem • Data Center resources were limited • Customers/Departments were on their own for housing equipment • • • Under someone’s desk (most common) In an unused office In a dedicate closet (if they were lucky) • Administration knew that something had to be done 4 ORNL Computational Sciences Building This will fix ALL of our problems! • Opened in 2001 • Designed for 40,000 SF of data center across 2 floors (20,000 SF each) • Multiple feeders from TVA utility • 1 x 500 kVA UPS • Brought the ‘servers’ in that had been under desks and in closets • Designed to house these new fangled Super Computers 5 Sidebar – How Does Research Granting Work? • Researchers compete for grant money to fund their work • • Their goal is to show how their approach is different – in other words, why give money to them instead of someone else? Often times “different” gets translated into “micromanaged” • • • This leads to under-desk “server” deployment If you’re lucky, they found a closet and at least added a dedicated thermostat for the VAV box Standardization becomes a dirty word, because users feel like they can’t be on the leading edge, or unique • Facilities is often an afterthought • • Users have to make a choice between things they have no expertise in Users don’t know the right questions to ask 6 Diversity is good, right? • Users only pay for enough upgrades to meet their needs • • • If a user paid for an upgrade, they feel entitled to determine who can use it In some grants, the rules state that any upgrades covered by the grant money have to be exclusive to that project, even after it leaves If the project has to pay for upgrades, it will be value engineered to death 7 Ultimate Diversity = Wild, Wild West • Fiefdoms are created and maintained • • Just kidding. They’re not maintained. Users become territorial with space they were given • Site-wide upgrades don’t happen • • Who’s going to pay for them? Turns the data center/facility manager into a street beggar 8 A Tale of Two Data Centers • Downstairs • • • • Basically 3 computers taking up 20,000 SF • • • Jaguar, Kraken, and GAEA – 350 Cabinets of CRAY File Storage – 100 Cabinets of DDN Disk Very homogenous Everything engineered as a project due to the massive size and scope A system would come in and be relatively static for 3-5 years at a time Roughly 65 F temperature 9 A Tale of Two Data Centers • Upstairs • • • • Comprised of nearly 100% commodity computing equipment A mixture of rack mounted and shelf mounted hardware Average footprint for a project was less than 1 cabinet Power was provisioned as needed • • • • No live work allowed at a DOE Site Plug type was based on what the salesperson bundled Some cabinets had 6 circuits just to support the load in non-redundant manner Cold enough to hang meat - only way to deal with hot spots, with some areas showing < 60 F at rack inlet 10 Unmanageable Workflows Over a third of work orders for the instrumentation technicians had the phrase, “Have the techs see me and I will show them what to do.” This was mainly due to incorrect or missing labeling and no common vocabulary with which to communicate. This is completely unsustainable, because there is no way to understand if this is a 1 hour job, or a 10 day job. Additionally, I have no means to follow-up and understand whether or not the task was completed correctly. 11 Overheard at the Office • This can’t ever go down, so I need it on the UPS. • Oh, that’s a small system, just go find a place for it in the upstairs data center. • This salesperson told me that it was cheaper if I bought the cabinet and power strips from them and they only carry this brand/model. • Why do I need to waste $300 on a second power supply? It’s on the UPS, isn’t it? 12 We Didn’t Design It To Work This Way… 13 Lesson Learned #1 – Manage Expectations The lab built a nice data center, but then simply assumed that the users would know how to act to keep it that way. Without guidance, facilities always became the most value engineered line item because they assumed it was 100% reliable – after all, we have/had the fastest Supercomputer in the world. To their credit, it took about 7 years before they were completely painted into a corner. 14 “Wait, which Peachtree did you mean?” • Users spoke their language • Facilities spoke their language • Users and Facilities interpreted the same word in different ways • Labeling was confusing • • • • Cables were re-used without relabeling Electrical circuits not updated Systems renamed without relabeling Users were “responsible” for labeling • Result = Unplanned Outages 15 Efforts Made to Resolve Problems • Problems encountered during a maintenance activity were tracked and put on “the list” • • • Mislabeled breaker panels Incorrect power configurations Hot spots / cold spots • • • “The List” was no longer accurate Higher priority items were addressed Only the squeaky wheels were greased • By the time the next activity came around… 16 Responsiveness Made the DMV Look Good • DOE Restrictions on hot work • • • Panel had to be de-energized to add a circuit Systems powered by a panel were not diversified Annual maintenance… really? We have to wait until this once a year activity? • There was so much catch-up work to perform, rarely was any pro-active work completed • Kludges were the norm, rather than the exception 17 Compounding Problems • Once a project became urgent enough, it would force management’s hands to schedule an outage • A full center outage was impractical and politically unsound, so they would schedule a partial outage • • Since labeling was wrong, the affected systems list was a guess, more than anything Dependencies weren’t documented, so even if someone knew that a switch would go down, they had no idea what else was affected 18 Why think about the future? • Customers only care about their environment – not how it interacts with the physical world • • • • Expansion plans Cable management Labeling Serviceability 19 Lesson #2 – Decide what your end result should be, otherwise you are wandering aimlessly • Determine what your standards are • • • Start with code requirements Build on them with recognized public standards Add a dash of best practices • Understand that change isn’t a light switch • • Change in a data center is akin to eating an elephant Lasting change happens when you can show the benefit, rather than barking orders 20 How Do We Fix This? • Get and keep accurate documentation • Triage your equipment • • • New installations Must fix now Wait until lifecycle • • • Facility includes cabinets Facility includes power distribution Facility includes air flow management • Extend line of demarcation 21 Be ready to make the tough call • FDCCI mandated a reduction in data centers – to reduce energy consumption and costs • ORNL tasked a team with estimating the impact (cost and downtime) of fixing the problems • Numbers clearly showed that the best course of action was against the FDCCI mandate – we needed to occupy new space 22 Step 1 - Documentation • Fully audit the data center • • • Who owns what? What is misconfigured? How long will it stay? • • Users don’t care. Really, they don’t. Find the right point in your process to inject the documentation step. Funnel all MACDs through your process. • OWN the process to keep records accurate • 23 Make documentation part of your process • Started by creating an Internal Operations Procedure for ITSD • • • • All new installations were to be pre-built in DCIM Allowed users and technicians time to learn the system Allowed developer to see where improvements needed Essentially banned the phrase, “Have the techs see me” • Increased the efficiency of technicians • • Ability to pre-print cable labels dramatically reduced time Accurate labeling decreased unplanned outages 24 Document what you have today • Do you really know what’s in your data center? • • • Do you know how many specific models you have? Do you know who owns which system? Does the user call it the same thing you do? • How close to your design density is your implementation? • Where does this cable go? 25 Regular Auditing • Auditing is a continual process, not a one-time • Auditing can trigger other events • • • This server was supposed to leave a year ago, but it’s still on The listed owner of this device retired last June The cabinet is over the designed safe power limit 26 Formulate a plan for repeatable growth • Standardize on your build-out • • • • Cabinets Power Strips Cooling Air-flow Management • Determine the cost and lead time for each module • Repeat as necessary 27 What ORNL Standardized On • • • • • • Power Strips – Geist RCX Series Circuits – 208VAC 3-Phase, 30A or 50A In-row Coolers – Liebert CRV Cabinets – SharkRack T2 Cold Aisle Containment - Polargy DCIM – openDCIM It’s less important what you standardize on and more important that you standardize at all. 28 29 How Standardizing Helped • Users no longer bought fixtures • • • No more value engineering No more bad choices No more wildly varying configurations • Expansion became a predictable and repeatable process and expense • Safety is improved (both physical and service) • Efficiency increased (both in energy and in labor) and average temperatures are now in mid-70s F • Unplanned downtime reduced 30 3 Key Things You Have Learned During this Session 1. Manage your user/customer expectations 2. Unless you have a design for tomorrow, your efforts to fix things today will be in vain. Standards are good. 3. Documentation is key to understanding where you are today and how to get where you want to be. 31 Thank you Scott Milliken, CDCMP Oak Ridge National Laboratory [email protected] 32