Planning the LCG Fabric at CERN openlab TCO Workshop

Download Report

Transcript Planning the LCG Fabric at CERN openlab TCO Workshop

Planning the LCG
Fabric at CERN
openlab TCO Workshop
November 11th 2003
[email protected]
Fabric Area Overview
Installation
Configuration + monitoring
Fault tolerance
Automation, Operation, Control
Infrastructure
Electricity, Cooling, Space
Batch system
(LSF, CPU server)
Storage system
(AFS, CASTOR, disk server)
Network
Benchmarks, R&D,
Architecture
Prototype, Testbeds
Purchase, Hardware selection,
Resource planning
GRID services !?
Coupling of components through hardware and software
[email protected]
2
Agenda
 Building
 Batch
Fabric
Subsystem
 Storage
subsystem
 Installation
 Monitoring
 Hardware
and Configuration
and control
Purchase
[email protected]
3
Agenda
 Building
 Batch
Fabric
Subsystem
 Storage
subsystem
 Installation
 Monitoring
 Hardware
and Configuration
and control
Purchase
[email protected]
4
Building Fabric — I
 B513
was constructed in the early 1970s and the
machine room infrastructure has evolved slowly
over time.
– Like the eye, the result is often not ideal…
[email protected]
5
Current Machine Room Layout
Problem:
Normabarres
run one way,
services run the
other….
Services
Services
Services
Services
[email protected]
6
Building Fabric — I
 B513
was constructed in the early 1970s and the
machine room infrastructure has evolved slowly
over time.
– Like the eye, the result is often not ideal…
 With
the preparations for LHC we have the
opportunity to remodel the infrastructure.
[email protected]
7
Future Machine Room Layout
9m double rows
of racks for
critical servers
Aligned
normabarres
18m double
rows of racks
12 shelf units
or 36 19” racks
528 box PCs
105kW
1440 1U PCs
288kW
324 disk servers
120kW(?)
[email protected]
8
Building Fabric — I
 B513
was constructed in the early 1970s and the
machine room infrastructure has evolved slowly
over time.
– Like the eye, the result is often not ideal…
 With
the preparations for LHC we have the
opportunity to remodel the infrastructure.
– Arrange services in clear groupings associated with power
and network connections.
» Clarity for general operations plus ease of service restart should
there be any power failure.
– Isolate critical infrastructure such as networking, mail
and home directory services.
– Clear monitoring of planned power distribution system.
 Just
“good housekeeping”, but we expect to reap
the benefits during LHC operation.
[email protected]
9
Building Fabric — II
 Beyond
good housekeeping, though, there are
building fabric issues that are intimately related
with recurrent equipment purchase.
– Raw power: We can support a maximum equipment load of
2.5MW. Does the recurrent additional cost of blade
systems avoid investment in additional power capacity?
– Power efficiency: Early PCs had power factors of ~0.7
and generated high levels of 3rd harmonics. Fortunately,
we now see power factors of 0.95 or better, avoiding the
need to install filters in the PDUs. Will this continue?
– Many sites need to install 1U or 2U rack mounted
systems for space reasons. This is not a concern for us at
present but may become so eventually.
» There is a link here to the previous point: the small power
supplies for 1U systems often have poor power factors.
[email protected]
10
Agenda
 Building
 Batch
Fabric
Subsystem
 Storage
subsystem
 Installation
 Monitoring
 Hardware
and Configuration
and control
Purchase
[email protected]
11
Fabric Architecture
Physical and logical coupling
Level of complexity
Hardware
CPU
Disk
Storage tray,
NAS server,
SAN element
PC
Cluster
World wide cluster
Software
Motherboard, backplane,
Bus, integrating devices
(memory,Power supply,
controller,..)
Operating system, driver
Network
(Ethernet, fibre channel,
Myrinet, ….)
Hubs, switches, routers
Batch system, load balancing,
Control software,
Hierarchical Storage Systems
Wide area network
[email protected]
Grid middleware
12
[email protected]
13
[email protected]
14
Batch Subsystem
 Looking
purely at batch system issues, TCO is
reduced as the efficiency of node usage increases.
What are the dependencies?
–
–
–
–
–
–
The load characteristics
The batch scheduler
Chip technology
Processors/box
The operating system
Others?
[email protected]
15
Batch Subsystem
 Looking
purely at batch system issues, TCO is
reduced as the efficiency of node usage increases.
What are the dependencies?
– The load characteristics
» Not much we in IT can do here!
–
–
–
–
–
The batch scheduler
Chip technology
Processors/box
The operating system
Others?
[email protected]
16
Batch Subsystem
 Looking
purely at batch system issues, TCO is
reduced as the efficiency of node usage increases.
What are the dependencies?
– The load characteristics
– The batch scheduler
» LSF is pretty good here, fortunately.
–
–
–
–
Chip technology
Processors/box
The operating system
Others?
[email protected]
17
Batch Subsystem
 Looking
purely at batch system issues, TCO is
reduced as the efficiency of node usage increases.
What are the dependencies?
– The load characteristics
– The batch scheduler
– Chip technology
» Take hyperthreading, for example. Tests have shown that, for
HEP codes at least, hyperthreading wastes 20% of the system
performance running two tasks on a dual processor machine.
There are no clear benefits to running with hyperthreading
enabled when running three tasks. What is the outlook here?
– Processors/box
– The operating system
– Others?
[email protected]
18
Batch Subsystem
 Looking
purely at batch system issues, TCO is
reduced as the efficiency of node usage increases.
What are the dependencies?
–
–
–
–
The load characteristics
The batch scheduler
Chip technology
Processors/box
» At present, a single 100baseT NIC would support the I/O load of
a quad processor CPU server. Quad processor boxes would halve
the cost of networking infrastructure—but they come at a hefty
price premium (XEON MP vs XEON DP, heftier chassis, …). What
is the outlook here?

And total system memory becomes an issue.
– The operating system
– Others?
[email protected]
19
Batch Subsystem
 Looking
purely at batch system issues, TCO is
reduced as the efficiency of node usage increases.
What are the dependencies?
–
–
–
–
–
The load characteristics
The batch scheduler
Chip technology
Processors/box
The operating system
» Linux is getting better, but things such as processor affinity
would be nice.

Relationship to hyperthreading…
– Others?
[email protected]
20
Batch Subsystem
 Looking
purely at batch system issues, TCO is
reduced as the efficiency of node usage increases.
What are the dependencies?
–
–
–
–
–
–
The load characteristics
The batch scheduler
Chip technology
Processors/box
The operating system
Others?
[email protected]
21
Agenda
 Building
 Batch
Fabric
Subsystem
 Storage
subsystem
 Installation
 Monitoring
 Hardware
and Configuration
and control
Purchase
[email protected]
22
Storage subsystem
 Simple
building blocks:
CPU server + Fiber Channel Interface + tape drive == Tape server
Processors  “desktop+” node == CPU server
CPU server + larger case + 6*2 disks == Disk server
[email protected]
23
[email protected]
24
[email protected]
25
Storage subsystem — Disk Storage
 TCO:
Maximise available online capacity within
fixed budget (material & personnel).
– IDE based disk servers are much cheaper than high end
SAN servers. But are we spending too much time on
maintenance?
» Yes, at present, but we need to analyse carefully the reasons for
the current load.

Complexities of Linux drivers seem under control, but numbers have
exploded. And are some problems related to batch of hardware?
– Where is the optimum? Switching to fibre channel disks
would reduce capacity by factor of ~5.
» Naively, buy, say, 10% extra systems to cover failures. Sadly,
this is not as simple as for CPU servers; active data on the
servers must be reloaded elsewhere.
» Always have duplicate data? => purchase 2x required space. Still
cheaper than SAN? How does this relate to …
[email protected]
26
Storage System — Tapes
 The
first TCO question is “Do we need them?”
 Disk storage costs are dropping…
[email protected]
27
Disk Price/Performance Evolution
price in SFr per GByte
100
Non-mirrored disk server
40 GB disk
60 GB disk
80 GB disk
120 GB
160 GB
180 GB
200 GB
disk server
SFr/GB
factor 2.5 difference
10
factor 6 in 3 years
1
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
time since Jan 2000
[email protected]
28
Storage System — Tapes
 The
first TCO question is “Do we need them?”
 Disk storage costs dropping… But
– Disk servers need system administrators, idle tapes
sitting in a tape silo don’t.
– With disk only solution, we need storage for at least
twice the total data volume to ensure no data loss.
– Server lifetime of 3-5 years; data must be copied
periodically.
» Also an issue for tape, but the lifetime of a disk server is
probably still less than the lifetime of a given tape media format.
 Assumption
required.
today is that tape storage will be
[email protected]
29
Storage System — Tapes
 Tape
robotics is easy.
– Bigger means better cost/slot.
[email protected]
30
[email protected]
31
Storage System — Tapes
 Tape
robotics is easy.
– Bigger means better cost/slot.
 Tape
drives: High end vs LTO
– TCO issue: LTO drives are cheaper than high end IBM
and STK drives, but are they reliable enough for our use?
» c.f. the IDE disk server area.
 Real
problem, though is tape media.
– Vast portion of the data is accessed rarely but must be
stored for long period. Strong pressure to select a
solution that minimises an overall cost dominated by tape
media.
[email protected]
32
Storage System — Managed Storage
 Should
CERN build or buy software systems?
 How to measure the value of a software system?
– Initial cost:
» Build: Staff time to create required functionality
» Buy: Initial purchase cost of system as delivered plus staff time
to install and figure for CERN.
– Ongoing cost
» Build: Staff time to maintain system and add extra functionality
» Buy: License/maintenance cost plus staff time to track releases.

 Choice:
Extra functionality that we consider useful may or may not arrive.
– Batch system: Buy LSF.
– Managed storage system: Build CASTOR.
 Use
this model as we move on to consider system
management software.
[email protected]
33
Agenda
 Building
 Batch
Fabric
Subsystem
 Storage
subsystem
 Installation
 Monitoring
 Hardware
and Configuration
and control
Purchase
[email protected]
34
Installation and Configuration
 Reproducibility
and guaranteed homogeneity of
system configuration is a clear method to minimise
ongoing system management costs. A management
framework is required that can cope with the
numbers of systems we expect.
 We faced the same issues as we moved from
mainframes to RISC systems. Vendor solutions
offered then were linked to hardware—so we
developed our own solution.
 Is a vendor framework acceptable if we have a
homogeneous park of Linux systems?
– Being honest, why have we built our own again?
[email protected]
35
Installation and Configuration
 Installation
and configuration is only part of the
overall computer centre management:
[email protected]
36
ELFms architecture
Fault Mgmt
System
Monitoring
System
Node
Configuration
System
Installation
System
[email protected]
37
Installation and Configuration
 Installation
and configuration is only part of the
overall computer centre management:
 Systems provided by vendors cannot (yet) be
integrated into such an overall framework.
 And there is still a tendency to differentiate
products on the basis of management software, not
raw hardware performance.
– This is a problem for us as we cannot ensure we always
buy brand X rack mounted servers or blade systems.
– In short, life is not so different from the RISC system
era.
[email protected]
38
Agenda
 Building
 Batch
Fabric
Subsystem
 Storage
subsystem
 Installation
 Monitoring
 Hardware
and Configuration
and control
Purchase
[email protected]
39
Monitoring and Control
 Assuming
that there are clear interfaces, why not
integrate a commercial monitoring package into our
overall architecture?
 Two reasons:
– No commercial package meets (met) our requirements in
terms of, say, long term data storage and access for
analysis.
» This could be considered self serving: we produce requirements
that justify a build rather than buy decision.
– Experience has show, repeatedly, that monitoring
frameworks require effort to install and maintain, but
don’t deliver the sensors we require.
» Vendors haven’t heard of LSF, let alone AFS.
» A good reason!
[email protected]
40
Hardware Management System

A specific example of the
integration problem. Workflows
must interface to local
procedures for, e.g., LAN
address allocation. Can we
integrate a vendor solution? Do
complete solutions exist?
Request New Machine Install [FIO/IS]
Decide New Identity [FIO/OPT]
Physically Install Machine [DCS]
Request Physical Machine Install [FIO/OPT]
Request Network Connection [FIO/OPT]
Connect to Network [CS]
Install [FIO/IS]
Check and Update Information [FIO/OPT]
FIO/OPT
Remedy/HMS
Remedy/PRMS
FIO/IS
CS
Remedy/DCS
DCS
Import Node Map
Raise Ticket
Observe
Retire Node
Close Ticket
Change Status
Raise Ticket
Observe
Move Machine
Close Ticket
Change Status
Observe
Req. n/w conn & dns entry
Update CS DB & DNS
Confirmation email
Perform db updates & checks
Change Status
Raise Ticket
Observe
Install S/W & put in prod'n
Close Ticket
Close Ticket
[email protected]
41
Console Management
 Done
poorly now:
[email protected]
42
Console Management

We will do better:
CDB – config service
• Machine – port @ head node mapping
• User – machine authorisations
User
app
RS/232
lxplusnnn
conf
pcitfionnn
xxx
log
Server
proc
Console server 1
Machine 1.1
.
.
.
.
Machine 1.44
…
conf
Console log
repository
log
Server
proc
Console server 75
Machine 75.1
.
.
.
.
Machine 75.44
TCO issue: Do the benefits of a single console management
system outweigh costs of developing our own? How do we
integrate vendor supplied racks of preinstalled systems?
[email protected]
43
Agenda
 Building
 Batch
Fabric
Subsystem
 Storage
subsystem
 Installation
 Monitoring
 Hardware
and Configuration
and control
Purchase
[email protected]
44
Hardware Purchase
 The
issue at hand: How do we work within our
purchasing procedures to purchase equipment that
minimises our total cost of ownership?
 At present, we eliminate vast areas of the multidimensional space by assuming we will rely on ELFms
for system management and Castor for data
management. Simplified[!!!] view:
– CPU: White box vs 1U vs blades; install or ready packaged
– Disk: IDE vs SAN; level of vendor integration
 HELP!
 Can
we benefit from management software that
comes with ready built racks of equipment in a
multi-vendor environment?
[email protected]
45