ACAT2002 Fabric Summary

Download Report

Transcript ACAT2002 Fabric Summary

Building Large Scale
Fabrics – A Summary
Marcel Kunze, FZK
Observation
Everybody seems to need unprecedented amount of CPU,
Disk and Network b/w
Trend to PC based computing fabrics and commodity
hardware








LCG (CERN), L. Robertson
CDF (Fermilab), M. Neubauer
D0 (FermiLab), I. Terekhov
Belle (KEK), P. Krokovny
Hera-B (DESY), J. Hernandez
Ligo, P. Shawhan
Virgo, D. Busculic
AMS, A.Klimentov
Considerable savings in cost wrt. RISC based farm:
Not enough ‘bang for the buck’ (M. Neubauer)
ACAT 2002 Moscow
Marcel Kunze - FZK
AMS02 Benchmarks
1)
Brand, CPU , Memory
OS/Compiler
“Sim”
“Rec”
Intel PII dual-CPU 450 MHz, 512 MB RAM
RH Linux 6.2 / gcc 2.95
1
1
Intel PIII dual-CPU 933 MHz, 512 MB RAM
RH Linux 6.2 / gcc 2.95
0.54
0.54
Compaq, Quad α-ev67 600 MHz, 2 GB RAM
RH Linux 6.2 / gcc 2.95
0.58
0.59
AMD Athlon,1.2GHz, 256 MB RAM
RH Linux 6.2 / gcc 2.95
0.39
0.34
Intel Pentium IV 1.5GHz, 256 MB RAM
RH Linux 6.2 / gcc 2.95
0.44
0.58
Compaq dual-CPU PIV Xeon 1.7GHz, 2GB RAM
RH Linux 6.2 / gcc 2.95
0.32
0.39
Compaq dual α-ev68 866MHz, 2GB RAM
Tru64 Unix/ cxx 6.2
0.23
0.25
Elonex Intel dual-CPU PIV Xeon 2GHz, 1GB RAM
RH Linux 7.2 / gcc 2.95
0.29
0.35
AMD Athlon 1800MP, dual-CPU 1.53GHz, 1GB RAM
RH Linux 7.2 / gcc 2.95
0.24
0.23
8 CPU SUN-Fire-880, 750MHz, 8GB RAM
Solaris 5.8/C++ 5.2
0.52
0.45
24 CPU Sun Ultrasparc-III+, 900MHz, 96GB RAM
RH Linux 6.2 / gcc 2.95
0.43
0.39
Compaq α-ev68 dual 866MHz, 2GB RAM
RH Linux 7.1 / gcc 2.95
0.22
0.23
Executive time of AMS “standard” job compare to CPU clock
1) V.Choutko, A.Klimentov AMS note 2001-11-01
ACAT 2002 Moscow
Marcel Kunze - FZK
Fabrics and Networks: Commodity Equipment
RENATER
National Research
Networks
Needed for LHC at CERN in 2006:
Storage
Raw recording rate 0.1 – 1 GB/sec
Accumulating at 5-8 PetaBytes/year
10 PetaBytes of disk
Processing
200’000 of today’s (2001) fastest PCs
Networks
5-10 Gbps between main Grid nodes
Distributed computing effort to avoid
congestion: 1/3 at CERN 2/3
elsewhere
2M
b /s
SWITCH
s
/
b
M
Mission Oriented
0
IN2P3
10 Mb/s
Link & USLIC
34
s
/
WHO
2Mb
39/155 Mb/s
CERN
TEN-155
45Mb
Public
/
s
100
2M
Mb
b/s
KPNQwest
/s
Genesis
Project
JEG (Japan)
C-IXP
ACAT 2002 Moscow
Marcel Kunze - FZK
b/s
20 M
PC Cluster 5
(Belle)
1U server
Pentium III
1.2GHz
256 CPU
(128 nodes)
ACAT 2002 Moscow
Marcel Kunze - FZK
3U
PC Cluster 6 Blade server:
LP Pentium III 700MHz
40CPU (40 nodes)
ACAT 2002 Moscow
Marcel Kunze - FZK
Disk Storage
ACAT 2002 Moscow
Marcel Kunze - FZK
IDE Performance
ACAT 2002 Moscow
Marcel Kunze - FZK
Basic Questions
Compute farms contain several 1000s of computing elements
Storage farms contain 1000s of disk drives
How to build scalable systems ?
How to build reliable systems ?
How to operate and maintain large fabrics ?
How to recover from errors ?
EDG deals with the issue (P. Kunszt)
IBM deals with the issue (N. Zheleznykh)

Project Eliza: Self healing clusters
Several ideas and tools are already on the market
ACAT 2002 Moscow
Marcel Kunze - FZK
Storage Scalability
Difficult to scale up to systems of 1000s of components
and keep single system image:
NFS-Automounter, Symbolic links etc.
(M.Neubauer, CAF: ROOTD does not need this and allows for
direct worldwide access to distributed files w/o mounts)
Scalability in size and throughput by means of storage
virtualisation
Allows to set up non-TCP/IP based systems to handle
multi-GB/s
ACAT 2002 Moscow
Marcel Kunze - FZK
Virtualisation of Storage
Data Servers mount virtual storage as SCSI-Device
Input
Load balancing
switch
Internet
Intranet
Shared Data
Access
(Oracle, PROOF)
Storage Area Network
(FCAL, InfiniBand,…)
200 MB/s sustained
Scalability
ACAT 2002 Moscow
Marcel Kunze - FZK
Storage Elements
(M. Gasthuber)
PNFS = Perfectly Normal FileSystem


Store MetaData with the Data
8 hierarchies of file tags
Migration of data (hierarchical storage systems):
dCache





Development of DESY and FermiLab
ACLs, Kerberos, ROOT-aware
Web-Monitoring
Cached as well as direct tape access
Fail-safe
ACAT 2002 Moscow
Marcel Kunze - FZK
Necessary admin. Tools
(A. Manabe)
System (SW) Installation /update

Dolly++ (Image cloning)
Configuration


Arusha (http://ark.sourceforge.net)
LCFGng (http://www.lcfg.org)
Status Monitoring/ System Health Check


CPU/memory/disk/network utilization: Ganglia*1,plantir*2
(Sub-)system service sanity check: Pikt*3/Pica*4/cfengine
*1 http://ganglia.sourceforge.net *2 http://www.netsonde.com
*3 http://pikt.org *4 http://pica.sourceforge.net/wtf.html
Command Execution

WANI: WEB base remote command executer
ACAT 2002 Moscow
Marcel Kunze - FZK
WANI is implemented on
`Webmin’ GUI
Start
Command input
Node selection
ACAT 2002 Moscow
Marcel Kunze - FZK
Command execution
result
Host name
Results
from 200nodes
in 1 Page
ACAT 2002 Moscow
Marcel Kunze - FZK
Stdout output
Click here
Click here
Stderr output
ACAT 2002 Moscow
Marcel Kunze - FZK
CPU Scalability
The current tools scale up to ~1000 CPUs
(In the previous example 10000 CPUs would
require to check 50 pages)
Autonomous operation required
Intelligent self-healing clusters
ACAT 2002 Moscow
Marcel Kunze - FZK
Resource Scheduling
Problem: How to access local resources from the
Grid ?
Local batch queues vs. Global batch queues

Extension of Dynamite (Amsterdam university) to work
with Globus: Dynamite-G (I. Shoshmina)
Open Question: How do we deal with interactive
applications on the Grid ?
ACAT 2002 Moscow
Marcel Kunze - FZK
Conclusions
A lot of tools
exist
A lot of work
needs yet to be
done in the
Fabric area in
order to get
reliable, scalable
systems
ACAT 2002 Moscow
Marcel Kunze - FZK