XSEDE-Ops-Overview

Download Report

Transcript XSEDE-Ops-Overview

April 8, 2015
XSEDE Operations
Patricia Kovatch, Victor Hazlewood, Justin Whitt
Randy Butler, Chris Jordan, Stephen McNally, Steve
Quinn, Troy Baer, Linda Winkler
XSEDE Operations
• Improve user productivity through enhanced
– Ease of use
– Reliability
– Quality assurance
• Track metrics to gauge our success and
continually improve
2
Operations (1.5 FTE)
Patricia Kovatch (.5)
Victor Hazlewood (.5)
Justin Whitt (.5)
NICS
Software Support – 3.25 FTEs
Troy Baer, NICS (.5)
Stuart Martin, GRAM, Uchicago (.25)
Raj Kettimuthu, GridFTP, Uchicago (.25)
Tom Howe, Registry, Uchicago (.5)
PSC (1.25)
TACC (.5)
Networking – 3.25 FTEs
Linda Winkler, UChicago (.25)
Paul Wefel, NCSA (.25)
Matt Ezell, NICS (1)
Kathy Benninger, PSC (.5)
Chris Rapier (.25)
Joe Lappa, PSC (.5)
William Jones, TACC (.5)
Accounting and Account Management –
1.5 FTEs
Steve Quinn, NCSA (.5)
Ester Soriano, NCSA (.75)
Ed Hanna, PSC (.25)
Systems Operational Support – 12 FTEs
Stephen McNally, NICS (.5)
Mike Lowe, IU (1)
Justin Miller, IU (1)
Nada Cagle, NCSA (1)
Mark Fredericksen, NCSA (1)
Mike Pingleton, NCSA (1)
Frank Wells, NCSA (1)
Rolf Wilson, NCSA (1)
Tom Johnson, IU (.5)
Dave Lifka, Cornell (.25)
Tim Bouvet, NCSA (.25)
Wayne Louis Hoyenga, NCSA, (.25)
Rick Mohr, NICS (.5)
Dave Carver, TACC (.75)
Leo Carson, SDSC (.5)
Shava Smallen, SDSC (.5)
Tom Howe, Iaas/SaaS, UChicago (.5)
Byron Gill, PSC (.1)
Anjana Kar, PSC (.2)
Kevin Sullivan ,PSC (.1)
Jared Yanovich, PSC (.1)
Security – 4.25 FTEs
Randy Butler, NCSA (.25)
Jim Marsteller, PSC (.5)
Adam Fest, PSC (.5)
Nathaniel Mendoza, TACC (.75)
Victor Hazlewood, NICS (.5)
Ryan Braby, NICS (.5)
James Barlow, NCSA (1)
Jim Basney, NCSA (.25)
Data Services – 2.25 FTEs
Chris Jordan, TACC (.25)
Jack Kordas, Uchicago (.5)
Chad Kerner, NCSA (.25)
Rick Mohr, NICS (.5)
Josephine Palencia, PSC (.5)
Tomislav Urban, TACC (.25)
Deliverables and Goals
1. Security
Deploy XSEDE Certificate Authority, deploy two factor authentication
service, federate two factor authentication with BW, perform campus
bridging with InCommon, provide security auditing services for XSEDE
connected hosts, coordinate resource intrusion events;
2. Data Services
Deploy XSEDE-wide parallel file system, coordinate data movement and
management services, and develop a framework for distributed archival
replication;
3. Networking
Facilitate end-to-end performance for users, transition to XSEDEnet, peer
with R&E network;
Deliverables and Goals
4. Software Support
Deploy and perform acceptance testing of new capabilities and services
into the production XSEDE environment, provide feedback to developers;
5. Accounting and Account Management
Maintain current TG automatic distributed accounting and account
management service, streamline account creation process, improve user
access to stats;
6. Systems Operational Support
Provide frontline user support, systems administration for all centralized
XSEDE services and monitoring through the 24x7 XSEDE Operations
Center
XSEDE Services
Service
Primary
Location
Replication
Location
Account and allocation management, usage reporting
database servers, and the XD Central Database (XDCDB)
SDSC
PSC
User allocation online request web and database servers
(POPS)
NCSA
PSC
XD user portal, collaboration and social networking servers
TACC
NICS
User ticket system database and servers
TACC
NICS
24x7 computing and networking operations servers and
displays for monitoring
NCSA
IU
Website, documentation and document repository servers
TACC
NICS
User news mailing list and email servers
TACC
NICS
Online tutorials with CI-tutor and Virtual Workshop servers
NCSA
PSC
xsede.org DNS
NCSA
TACC
6
XSEDE Services
Service
Primary
Location
Replication
Location
Grid identity management including Certificate Authority,
Public Key Infrastructure, MyProxy servers
NCSA
PSC
Two factor authentication servers
NICS
PSC
Two factor authentication token
NICS
PSC
Inter-SP area parallel file system servers and disk
NICS
Each SP as
appropriate
Initial archive replication service
NICS
TACC
XD cross site security logging aggregation service
PSC
NICS
Grid Interface Unit and Resource Namespace Servers (RNS)
Every SP
N/A
Grid services monitoring servers
PSC
TACC
Knowledge Base
IU
N/A
VM hosting
IU
N/A
7
Operational Metrics
Cybersecurity
– Security events, logins and login types, security
items deployed, security awareness training
events
Data management and coordination
– wide area parallel file system usage and uptime
Networking
– Network uptime and usage
Software maintenance and coordination
– Software deployment issues and resolution
8
Operational Metrics – cont’d
Accounting and account management
– Account creation time for PI and non-PI
(Goal: Decrease account creation time to within
five business days)
System operational support
– Deliver 95% uptime on critical centralized
services
– Respond meaningfully to all tickets within 24
hours
– Close 80% of all tickets within two business days
9
Review of activities to July 1
1.1.3.1 Deploy grid middleware infrastructure
1.1.3.3 Deploy account management software
1.1.3.4 Deploy information services infrastructure
1.1.3.5 Deploy common user environment
1.1.3.6 Deploy system of systems test environment
1.1.4.2 Deploy XSEDE website servers
1.2.1.1 Coordinate XSEDE security incident response
1.2.4.1 Test XSEDE software
1.2.6.1 Setup XSEDE Operations Center
1.2.3.1 Transition to XSEDEnet
1.3.2.1 Setup and populate XSEDE.ORG DNS
Review of activities to July 1 (continued)
1.2.6.5 Migrate AMIE to stand alone server off of
XDCDB at both primary and secondary
1.2.6.5 Upgrade XDCDB hardware at SDSC
1.3.2.1 Deploy XSEDE User Portal (XUP) servers
Preview of year 1 activities
1.1.3.2 Deploy data management software
1.2.1.1 Deploy XSEDE Certificate Authority (CA)
1.2.1.2 Develop security awareness program
1.2.1.3 Deploy security authentication program
1.2.1.4 Deploy security tools
1.2.1.5 Deploy security infrastructure
1.2.1.6 Deploy InCommon authentication service
1.2.2.1 Deploy global parallel file system
1.2.2.2 Design archival replication framework
Ongoing
1.2.3.1 Maintain and monitor XSEDEnet
1.2.3.2 Tune end-to-end performance
1.2.4.1/2 Test and deploy XSEDE software
1.2.5.1 Maintain accounting and account management
databases
1.2.5.2 Provide usage reports
1.2.6.1 Provide frontline user support 24x7 XSEDE
Operations Center (XOC)
1.2.6.2 Deploy and support XSEDE system infrastructure
1.2.6.3 Support deploy security tools/infrastructure
1.2.6.4 Report operational metrics (yearly)
DNS transition plan
• Ops Networking leading the DNS transition
• xsede.org primary service moving to NCSA,
backup at TACC
• Delegation of {site}.xsede.org to sites
• XSEDE staff should review DNS needs
– Determine teragrid.org entries to duplicate
– Determine new xsede.org entries
– Review and coordinate with XSEDE L3 manager
• XSEDE L3 Manager or delegate submits dns
requests in TG help ticket