Geoff Quigley, Stephen Childs and Brian Coghlan Trinity College Dublin e-INIS Regional Datastore @TCD Recent storage procurement Physical infrastructure 10Gb networking • Simple.
Download
Report
Transcript Geoff Quigley, Stephen Childs and Brian Coghlan Trinity College Dublin e-INIS Regional Datastore @TCD Recent storage procurement Physical infrastructure 10Gb networking • Simple.
Geoff Quigley, Stephen Childs
and Brian Coghlan
Trinity College Dublin
e-INIS
Regional Datastore @TCD
Recent storage procurement
Physical infrastructure
10Gb networking
• Simple lessons learned
STEP09
Experiences
Monitoring
• Network (STEP09)
• Storage
The
Irish National e-Infrastructure
Funds Grid-Ireland Operations Centre
Creating a National Datastore
• Multiple Regional Datastores
• Ops Centre runs TCD regional datastore
For
all disciplines
• Not just science & technology
Projects
with (inter)national dimension
Central allocation process
Grid and non-grid use
Grid-Ireland @ TCD already had
• Dell Poweredge 2950 (2xQuad Xeon)
• Dell MD1000 (SAS - JBOD)
After procurement data store has
• 8x Dell PE2950 (6x1TB disks, 10GbE)
• 30x MD1000, each with 15x 1TB disks
some
total
~11.6 TiB each after RAID6 and XFS format (~350TiB total)
• 2x Dell Blade Chassis with 8x M600 blades each
• Dell tape library (24x Ultrium 4 tapes)
• HP ExDS9100 with 4 capacity blocks of 82x 1TB disks
each and 4 blades
~ 233 TiB total available for NFS/http export
DPM installed on Dell hardware
•
•
•
~100TB for Ops Centre to allocate
Rest for Irish users via allocation process
May also try to combine with iRODS
HP-ExDS high availability store
•
•
•
•
iRODS primarily
vNFS exports
Not for conventional grid use
Bridge services on blades for community
specific access patterns
Room
needed upgrade
• Another cooler
• UPS maxed out
New high-current AC circuits added
2x
3kVA UPS per rack acquired for Dell
equipment
ExDS has 4x 16A 3Ø - 2 on room UPS, 2
raw
10 GbE to move data!
Benchmarked with netperf
• http://www.netperf.org
Initially 1-2Gb/s… not good
Had machines that produced figures 4Gb/s +
• What’s the difference?
Looked at a couple of documents on this:
• http://www.redhat.com/promo/summit/2008/download
s/pdf/Thursday/Mark_Wagner.pdf
• http://docs.sun.com/source/819-0938-13/D_linux.html
Tested various of these optimisations
• Initially little improvement (~100Mb/s)
• Then identified the most important changes
Cards fitted to wrong PCI-E port
• Were x4 instead of x8
New kernel version
• New kernel supports MSI-X (multiqueue)
• Was saturating one core, now distributes
Increase MTU (from 1500 to 9216)
• Large difference to netperf
• Smaller difference to real loads
Then compared two switches with direct
connection
netperf 60s transfer test - showing repeat results for Arista switch
9000
8000
7000
Mbits/sec
6000
A -B
C -D
A -B
C -D
5000
4000
3000
2000
1000
0
D irec t
Forc e 1 0
A ris ta
Swit ch
A ris ta rerun
Solo
S olo
Sim
S im
netperf 60s TCP Request/Response
9000
8000
7000
Requests/sec
6000
A -B
C -D
A -B
C -D
5000
4000
3000
2000
1000
0
D irec t
Forc e 1 0
Swit ch
A ris ta
Solo
S olo
Sim
S im
Storage was mostly in place
10GbE was there but being tested
• Brought into production early in STEP09
Useful exercise for us
• See bulk data transfer in conjunction with user access to stored
data
• The first large 'real' load on the new equipment
Grid-Ireland OpsCentre at TCD involved as Tier-2 site
• Associated with NL Tier-1
Peak traffic observed
during STEP ‘09
Data
transfers into TCD from NL
• Peaked at 440 Mbit/s (capped at 500)
• Recently upgraded FW box coped well
HEAnet view of GEANT link
TCD view of Grid-Ireland link
Lots of analysis jobs
• Running on cluster nodes
• Accessing large datasets
directly from storage
• Caused heavy load on
network and disk servers
• Caused problems for other
jobs accessing storage
• Now known that access
patterns were pathological
Also
production jobs
ATLAS
production
ATLAS
analysis
LHCb
production
Almost all data stored
on this server
3x1Gbit bonded links set up
Fix to distinguish FS with
identical names on different
servers
Fixed display of long labels
Display space token stats in TB
New code for pool stats
Pool
stats first to use DPM C API
• Previously everything was done via MySQL
Was
able to merge some of these fixes
• Time-consuming to contribute patches
• Single “maintainer” with no dedicated effort …
MonAMI
useful but future uncertain
• Should UKI contribute effort to plugin development?
• Or should similar functionality be created for “native”
Ganglia?
Recent procurement gave us a huge increase in
capacity
STEP09 great test of data paths into and within our new
infrastructure
Identified bottlenecks and tuned configuration
•
•
•
•
Back-ported SL5 kernel to support 10GbE on SL4
Spread data across disk servers for load-balancing
Increased capacity of cluster-storage link
Have since upgraded switches
Monitoring crucial to understanding what’s going on
• Weathermap for quick visual check
• Cacti for detailed information on network traffic
• LEMON and Ganglia for host load, cluster usage, etc.
Thanks for your attention!
Ganglia monitoring system
• http://ganglia.info/
Cacti
•
Network weathermap
•
http://www.cacti.net/
http://www.network-weathermap.com/
MonAMI
• http://monami.sourceforge.net/
Quotas
are close to becoming essential
for us
10GbE
problems have highlighted that
releases on new platforms are needed far
more quickly
Firewall
1Gb outbound 10Gb internally
M8024 switch in ‘bridge’ blade chassis
• 24 port (16 to blades) layer 3 switch
Force10
switch main ‘backbone’
• 10GbE cards in DPM servers
• 10GbE uplink from ‘National Servers’ 6224 switch
10GbE
Copper (CX4) ExDS to M6220 in 2nd
blade chassis
• Link between 2 blade chassis M6220 - M8024
4-way
LAG Force10 - M8024
24
port 10Gb switch
XFP modules
• Dell supplied our XFPs so cost per port reduced
10Gb/s
only
Layer 2 switch
Same Fulcrum ASIC as Arista switch
tested
• Uses a standard reference implementation
Arista
networks 7124S 24 port switch
SFP+ modules
• Low cost per port (switches relatively cheap too)
‘Open’ software
- Linux
• Even has bash available
• Potential for customisation (e.g. iptables being ported)
Can
run 1Gb/s and 10Gb/s simultaneously
• Just plug in the different SFPs
Layer
2/3
• Some docs refer to layer 3 as a software upgrade
Our
10GbE cards are Intel PCI-E
10GBASE-SR
Dell had plugged most into the 4xPCI-E
slot
An error was coming up in dmesg
Trivial solution:
I moved the cards to 8x slots
Now
can get >5Gb/s on some machines
Maximum
Transmission Unit
• Ethernet spec says 1500
• Most hardware/software can support jumbo frames
Ixgbe
driver allowed MTU=9216
• Must be set through whole path
• Different switches have different max value
Makes
a big difference to netperf
Example of SL5 machines, 30s tests:
• MTU=1500, TCP stream at 5399 Mb/s
• MTU=9216, TCP stream at 8009 Mb/s
Machines
on SL4 kernels had very poor
receive performance (50Mb/s)
One core was 0% idle
• Use mpstat -P ALL
• Sys/soft used up the whole core
/proc/interrupts
showed PCI-MSI used
All RX interrupts to one core
New kernel had MSI-X and multiqueue
• Interrupts distributed, full RX performance
-bash-3.1$ grep eth2 /proc/interrupts
114:
247
694613
5597495
1264609
426508
2089709
PCI-MSI-X eth2:v0-Rx
122:
657
2401390
462620
499858
1660625
1098900
PCI-MSI-X eth2:v1-Rx
130:
220
600108
453070
560354
468223
3059723
PCI-MSI-X eth2:v2-Rx
138:
27
764411
1621884
1226975
497416
2110542
PCI-MSI-X eth2:v3-Rx
146:
37
171163
418685
349575
574859
2744006
PCI-MSI-X eth2:v4-Rx
154:
27
251647
210168
1889
2018363
2834302
PCI-MSI-X eth2:v5-Rx
162:
27
85615
2221420
286245
415259
1628786
PCI-MSI-X eth2:v6-Rx
170:
27
1119768
1060578
892101
495187
2266459
PCI-MSI-X eth2:v7-Rx
178:
1834310
371384
149915
104323
461
2405659
PCI-MSI-X eth2:v8-Tx
186:
45
0
158
0
23
0
PCI-MSI-X eth2:lsc
1103
15322
644629
234
1937777
128178
839601
473
1809175
17262
795228
137892
779341
363
1312734
813
27463
16021786
0
1