Geoff Quigley, Stephen Childs and Brian Coghlan Trinity College Dublin  e-INIS Regional Datastore @TCD  Recent storage procurement  Physical infrastructure  10Gb networking • Simple.

Download Report

Transcript Geoff Quigley, Stephen Childs and Brian Coghlan Trinity College Dublin  e-INIS Regional Datastore @TCD  Recent storage procurement  Physical infrastructure  10Gb networking • Simple.

Geoff Quigley, Stephen Childs
and Brian Coghlan
Trinity College Dublin
 e-INIS
Regional Datastore @TCD
 Recent storage procurement
 Physical infrastructure
 10Gb networking
• Simple lessons learned
 STEP09
Experiences
 Monitoring
• Network (STEP09)
• Storage
 The
Irish National e-Infrastructure
 Funds Grid-Ireland Operations Centre
 Creating a National Datastore
• Multiple Regional Datastores
• Ops Centre runs TCD regional datastore
 For
all disciplines
• Not just science & technology
 Projects
with (inter)national dimension
 Central allocation process
 Grid and non-grid use
 Grid-Ireland @ TCD already had
• Dell Poweredge 2950 (2xQuad Xeon)
• Dell MD1000 (SAS - JBOD)
 After procurement data store has
• 8x Dell PE2950 (6x1TB disks, 10GbE)
• 30x MD1000, each with 15x 1TB disks
some
total
 ~11.6 TiB each after RAID6 and XFS format (~350TiB total)
• 2x Dell Blade Chassis with 8x M600 blades each
• Dell tape library (24x Ultrium 4 tapes)
• HP ExDS9100 with 4 capacity blocks of 82x 1TB disks
each and 4 blades
 ~ 233 TiB total available for NFS/http export

DPM installed on Dell hardware
•
•
•

~100TB for Ops Centre to allocate
Rest for Irish users via allocation process
May also try to combine with iRODS
HP-ExDS high availability store
•
•
•
•
iRODS primarily
vNFS exports
Not for conventional grid use
Bridge services on blades for community
specific access patterns
 Room
needed upgrade
• Another cooler
• UPS maxed out
 New high-current AC circuits added
 2x
3kVA UPS per rack acquired for Dell
equipment
 ExDS has 4x 16A 3Ø - 2 on room UPS, 2
raw
 10 GbE to move data!
 Benchmarked with netperf
• http://www.netperf.org
 Initially 1-2Gb/s… not good
 Had machines that produced figures 4Gb/s +
• What’s the difference?
 Looked at a couple of documents on this:
• http://www.redhat.com/promo/summit/2008/download
s/pdf/Thursday/Mark_Wagner.pdf
• http://docs.sun.com/source/819-0938-13/D_linux.html
 Tested various of these optimisations
• Initially little improvement (~100Mb/s)
• Then identified the most important changes
 Cards fitted to wrong PCI-E port
• Were x4 instead of x8
 New kernel version
• New kernel supports MSI-X (multiqueue)
• Was saturating one core, now distributes
 Increase MTU (from 1500 to 9216)
• Large difference to netperf
• Smaller difference to real loads
 Then compared two switches with direct
connection
netperf 60s transfer test - showing repeat results for Arista switch
9000
8000
7000
Mbits/sec
6000
A -B
C -D
A -B
C -D
5000
4000
3000
2000
1000
0
D irec t
Forc e 1 0
A ris ta
Swit ch
A ris ta rerun
Solo
S olo
Sim
S im
netperf 60s TCP Request/Response
9000
8000
7000
Requests/sec
6000
A -B
C -D
A -B
C -D
5000
4000
3000
2000
1000
0
D irec t
Forc e 1 0
Swit ch
A ris ta
Solo
S olo
Sim
S im



Storage was mostly in place
10GbE was there but being tested
• Brought into production early in STEP09
Useful exercise for us
• See bulk data transfer in conjunction with user access to stored
data
• The first large 'real' load on the new equipment

Grid-Ireland OpsCentre at TCD involved as Tier-2 site
• Associated with NL Tier-1
Peak traffic observed
during STEP ‘09
 Data
transfers into TCD from NL
• Peaked at 440 Mbit/s (capped at 500)
• Recently upgraded FW box coped well
HEAnet view of GEANT link
TCD view of Grid-Ireland link
 Lots of analysis jobs
• Running on cluster nodes
• Accessing large datasets
directly from storage
• Caused heavy load on
network and disk servers
• Caused problems for other
jobs accessing storage
• Now known that access
patterns were pathological
 Also
production jobs
ATLAS
production
ATLAS
analysis
LHCb
production
Almost all data stored
on this server
3x1Gbit bonded links set up


Fix to distinguish FS with
identical names on different
servers
Fixed display of long labels
Display space token stats in TB
New code for pool stats
 Pool
stats first to use DPM C API
• Previously everything was done via MySQL
 Was
able to merge some of these fixes
• Time-consuming to contribute patches
• Single “maintainer” with no dedicated effort …
 MonAMI
useful but future uncertain
• Should UKI contribute effort to plugin development?
• Or should similar functionality be created for “native”
Ganglia?

Recent procurement gave us a huge increase in
capacity

STEP09 great test of data paths into and within our new
infrastructure

Identified bottlenecks and tuned configuration
•
•
•
•

Back-ported SL5 kernel to support 10GbE on SL4
Spread data across disk servers for load-balancing
Increased capacity of cluster-storage link
Have since upgraded switches
Monitoring crucial to understanding what’s going on
• Weathermap for quick visual check
• Cacti for detailed information on network traffic
• LEMON and Ganglia for host load, cluster usage, etc.
Thanks for your attention!


Ganglia monitoring system
• http://ganglia.info/
Cacti
•
Network weathermap

•

http://www.cacti.net/
http://www.network-weathermap.com/
MonAMI
• http://monami.sourceforge.net/
 Quotas
are close to becoming essential
for us
 10GbE
problems have highlighted that
releases on new platforms are needed far
more quickly
 Firewall
1Gb outbound 10Gb internally
 M8024 switch in ‘bridge’ blade chassis
• 24 port (16 to blades) layer 3 switch
 Force10
switch main ‘backbone’
• 10GbE cards in DPM servers
• 10GbE uplink from ‘National Servers’ 6224 switch
 10GbE
Copper (CX4) ExDS to M6220 in 2nd
blade chassis
• Link between 2 blade chassis M6220 - M8024
 4-way
LAG Force10 - M8024
 24
port 10Gb switch
 XFP modules
• Dell supplied our XFPs so cost per port reduced
 10Gb/s
only
 Layer 2 switch
 Same Fulcrum ASIC as Arista switch
tested
• Uses a standard reference implementation
 Arista
networks 7124S 24 port switch
 SFP+ modules
• Low cost per port (switches relatively cheap too)
 ‘Open’ software
- Linux
• Even has bash available
• Potential for customisation (e.g. iptables being ported)
 Can
run 1Gb/s and 10Gb/s simultaneously
• Just plug in the different SFPs
 Layer
2/3
• Some docs refer to layer 3 as a software upgrade
 Our
10GbE cards are Intel PCI-E
10GBASE-SR
 Dell had plugged most into the 4xPCI-E
slot
 An error was coming up in dmesg
Trivial solution:
 I moved the cards to 8x slots
 Now
can get >5Gb/s on some machines
 Maximum
Transmission Unit
• Ethernet spec says 1500
• Most hardware/software can support jumbo frames
 Ixgbe
driver allowed MTU=9216
• Must be set through whole path
• Different switches have different max value
 Makes
a big difference to netperf
 Example of SL5 machines, 30s tests:
• MTU=1500, TCP stream at 5399 Mb/s
• MTU=9216, TCP stream at 8009 Mb/s
 Machines
on SL4 kernels had very poor
receive performance (50Mb/s)
 One core was 0% idle
• Use mpstat -P ALL
• Sys/soft used up the whole core
 /proc/interrupts
showed PCI-MSI used
 All RX interrupts to one core
 New kernel had MSI-X and multiqueue
• Interrupts distributed, full RX performance











-bash-3.1$ grep eth2 /proc/interrupts
114:
247
694613
5597495
1264609
426508
2089709
PCI-MSI-X eth2:v0-Rx
122:
657
2401390
462620
499858
1660625
1098900
PCI-MSI-X eth2:v1-Rx
130:
220
600108
453070
560354
468223
3059723
PCI-MSI-X eth2:v2-Rx
138:
27
764411
1621884
1226975
497416
2110542
PCI-MSI-X eth2:v3-Rx
146:
37
171163
418685
349575
574859
2744006
PCI-MSI-X eth2:v4-Rx
154:
27
251647
210168
1889
2018363
2834302
PCI-MSI-X eth2:v5-Rx
162:
27
85615
2221420
286245
415259
1628786
PCI-MSI-X eth2:v6-Rx
170:
27
1119768
1060578
892101
495187
2266459
PCI-MSI-X eth2:v7-Rx
178:
1834310
371384
149915
104323
461
2405659
PCI-MSI-X eth2:v8-Tx
186:
45
0
158
0
23
0
PCI-MSI-X eth2:lsc
1103
15322
644629
234
1937777
128178
839601
473
1809175
17262
795228
137892
779341
363
1312734
813
27463
16021786
0
1