Fermilab Site Report Mark O. Kaletka Head, Core Support Services Department Computing Division.

Download Report

Transcript Fermilab Site Report Mark O. Kaletka Head, Core Support Services Department Computing Division.

Fermilab Site Report
Mark O. Kaletka
Head, Core Support Services Department
Computing Division
CD mission statement
• The Computing Division’s mission is to play a
full part in the mission of the laboratory and in
particular:
• To proudly develop, innovate, and support
excellent and forefront computing solutions
and services, recognizing the essential role of
cooperation and respect in all interactions
between ourselves and with the people and
organizations that we work with and serve.
How we are organized
We participate in all areas
Accelerator
Tev BPM project and many
projects to help Run II luminosity
goals, Accelerator Simulation
High Energy Physics experiments Older Fixed Target analysis,
CDF, DZero, MiniBoone, MINOS,
MIPP, Testbeam, CMS, BTeV
and future proposals such as
Minerva and Nova and future
Kaon expts
Astrophysics Experiments
Pierre Auger, CDMS, SDSS and
future proposals such as SDSS
extension, Joint Dark Energy
Mission (SNAP), and Dark
Energy Survey
Theory
Lattice QCD facility
Production system capacities
Growth in farms usage
Growth in farms density
Projected growth of computers
4500
430
4000
4178
365
3500
3589
Nodes
3000
2500
2000
215
150
2089
1500
1000
1500
500
0
FY04
FY05
FY06
Computing Nodes
FY07
Sever Nodes
FY08
Projected power growth
CD Computer Power Growth
Projected KVA
Actual KVA
2500
1500
1000
FCC Max 750
500
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
0
1995
KVA
2000
Computer rooms
• Provide space, power & cooling for central computers
• Problem: increasing luminosity
– ~ 2600 computers in FCC
– Expect to add ~1,000 systems/year
– FCC has run out of power & cooling, cannot add utility
capacity
• New Muon Lab
– 256 systems for Lattice Gauge theory
– CDF early buys of 160 systems + 160 CDF existing systems
from FCC
– Developing plan for another room
• Wide Band
–
–
–
–
Long term phased plan FY04 – 08
FY04/05 build: 2,880 computers (~$3M)
Tape robot room in FY05
FY06/07: ~3,000 computers
Computer rooms
Computer rooms
Storage and data movement
• 1.72 PB of data in ATL
– Ingest of ~100 TB/mo
• Many 10’s of TB fed to
analysis programs each day
• Recent work:
– Parameterizing storage
systems for SRM
• Apply to SAM
• Apply more generally
– VO notions in storage
systems
FNAL Starlight dark fiber project
• FNAL dark fiber to Starlight
– WAN network R&D projects
– Overflow for production
traffic:
• ESnet link to remain
production network link
– Redundant offsite path
STARLIGHT
10
Former
MREN
sites
/s
10 Gb
Gb
/s
1 Gb/s
{10Gb/s soon}
F
10 NA
x GB L D
1
a
G /s c rk
b/ h F
s an ib
ch n er
an el;
ne an
ls d
10
x
/s
Gb
UKLight
I-Wire
2
• Intended uses of link
CAnet
1
– Completion: Mid-June,
2004
– Initial DWDM configuration:
• One 10 Gb/s
(LAN_PHY) channel
• Two 1 Gb/s (OC48)
channels
CERN
Abilene
(Internet 2)
Border
Router
ESNET
622Mb/s
Fermilab
Key:
Production Traffic:
R & D Traffic:
ESnet
Chicago
PoP
Research
Network A
Research
Network B
general
internet
General network improvements
• Core network upgrades
– Switch/router (Catalyst 6500s) supervisors
upgraded:
• 720 Gb/s switching fabric (Sup720s); provides
40Gb/s per slot
– Initial deployment of 10 Gb/s backbone links
• 1000B-T support expanded
– Ubiquitous on computer room floors:
• New farms acquisitions supported on gigabit
ethernet ports
– Initial deployment in a few office areas
Network security improvements
• Mandatory node registration for network
access
– “Hotel-like” temporary registration utility for visitors
– System vulnerability scan is part of the process
• Automated network scan blocker deployed
– Based on quasi-real time network flow data
analysis
– Blocks outbound & inbound scans
• VPN service deployed
Central services
•
Email
– Spam tagging in place
• X-Spam-Flag: YES
– Capacity upgrades for
gateways, imapservers, virus
scanning
– Redundant load sharing
•
AFS
–
–
–
–
•
Completely on OpenAFS
SAN for backend storage
TiBS Backup system
DOE-funded SBIR for
performance investigations
Windows
– Two-tier patching system for
Windows
• 1st tier under control of OU
(patchlink)
• 2nd tier domain-wide (SUS)
• 0 Sasser infections postimplementation
Central services -- backups
• Site-wide backup plan is moving forward
– SpectraLogic T950-5
– 8 SAIT-1 drives
– Initial 450 tape capacity for 7TB pilot project
• Plan for modular expansion to over 200 TB
Computer security
• Missed by Linux rootkit epidemic
– but no theoretical reason for immunity
• Experimenting w/ AFS cross-cell authentication
– w/ Kerberos 5 authentication
– subtle ramifications
• DHCP registration process
– includes security scan, does not (yet) deny access
– a few VIP’s have been tapped during meetings
• Vigorous self-scanning program
– based on nessus
– maintain database of results
– look especially for “critical vulnerabilities” (& deny access)
Run II – D0
•
D0 reprocessed 600M events in fall 2003
– using grid style tools, 100M of those event processed offsite at 5 other
facilities
– Farm production capacity is roughly 25M events per week
– MC production capacity is 1 M events per week
– about 1B events/week on the analysis systems.
•
Linux SAM station on a 2 TB fileserver to serve the new analysis nodes
– next step in the plan to reduce D0min
– station has been extremely performant, expanding the Linux SAM cache
– station typically delivers about 15 TB of data and 550M events per week.
•
Rolled out a MC production system that has grid-style job submission
– JIM component of SAM-Grid
•
Torque (sPBS) is in use on the most recent analysis nodes
– has been much more robust than PBS.
•
Linux fileservers are being used as "project" space
– physics group managed storage with high access patterns
– good results.
MINOS & BTeV status
• MINOS
– data taking in early 2005
– using “standard” tools
• Fermi Linux
• General-purpose farms
• AFS
• Oracle
• enstore & dcache
• ROOT
• BTeV
– preparations for CD-1 review by DOE
• included review of online (but not offline) computing
• novel feature is that much of the Level2/3 trigger
software will be part of the offline reconstruction software
US-CMS computing
• DC04 Data Challenge and the preparation for the
computing TDR
– preparation for the Physics TDR (P-TDR)
– roll out of the LCG Grid service and federating it with the
U.S. facilities
• Develop the required Grid and Facilities infrastructure
– increase the facility capacity through equipment upgrades
– commission Grid capabilities through Grid2003 and LCG-1
efforts
– develop and integrate required functionalities and services
• Increase the capability of User Analysis Facility
– improve how a physicists would use facilities and software
– facilities and environment improvements
– software releases, documentation, web presence etc
US-CMS computing – Tier 1
• 136 Worker Nodes (Dual 1 U Xeon Servers and Dual 1U Athlon)
– 240 CPUs for Production (174 kSI2000)
– 32 CPUs for Analysis (26 kSI2000)
• All systems purchased in 2003 are connected over gigabit
• 37 TB of Disk Storage
– 24TB in Production for Mass Storage Disk Cache
• In 2003 we switched to SATA disks in external enclosures
connected over fiber channel
• Only marginally more expensive than 3ware based systems,
and much easier to administrate.
– 5TB of User Analysis Space
• Highly available, high performance, backed-up space
– 8TB Production Space
• 70TB of Mass Storage Space
– Limited by tape purchases and not silo space
US-CMS computing
US-CMS computing – DC03 &
GRID 2003
Over 72K CPU-hours used in a week
100 TB of data transferred across
Grid3 sites
Peak numbers of jobs approaching 900
Average numbers during the daytime
over 500
26-Apr-2004
19-Apr-2004
12-Apr-2004
5-Apr-2004
29-Mar-2004
22-Mar-2004
15-Mar-2004
8-Mar-2004
1-Mar-2004
Number of Transferred files
US-CMS computing – DC04
20000
15000
10000
5000
0
1st LHC magnet leaving FNAL for
CERN
And our science has shown up in
some unusual journals…
“Her sneakers squeaked
as she walked down the
halls where Lederman
had walked. The 7th
floor of the high-rise
was where she did her
work, and she found
her way to the small,
functional desk in the
back of the pen.”