Transcript TACC PowerPoint Template
Grids: TACC Case Study
Ashok Adiga, Ph.D.
Distributed & Grid Computing Group Texas Advanced Computing Center The University of Texas at Austin [email protected]
(512) 471-8196 TEXAS ADVANCED COMPUTING CENTER
Outline
• Overview of TACC Grid Computing Activities • Building a Campus Grid – UT Grid • Addressing common Use Cases – Scheduling & Flow – Grid Portals • Conclusions 2
TACC Grid Program
• Building Grids at TACC – Campus Grid (UT Grid) – State Grid (TIGRE) – National Grid (ETF) • Grid Hardware Resources – Wide range of hardware resources available to research community at UT and partners • Grid Software Resources – NMI Components, NPACKage – User Portals, GridPort – Job schedulers: LSF Multicluster, Community Scheduling Framework – United Devices (desktop grids) • Significantly leveraging NMI Components and experience 3
TACC Resources:
Providing Comprehensive, Balanced Capabilities
HPC
Cray-Dell cluster: 600 CPUs, 3.67 Tflops, 0.6 TB memory, 25 TB disk IBM Power4 system: 224 CPUs, 1.16 Tflops, 0.5 TB memory, 7.1 TB disk
Visualization
SGI Onyx2: 24 CPUs, 25 GB memory, 6 IR2 graphics pipes Sun V880z: 4 CPUs, 2 Zulu graphics pipes Dell/Windows cluster: 18 CPUs, 9 NVIDIA NV30 cards (soon)
Large
immersive environment and 10
large
, tiled displays
Data storage
Sun SAN: 12TB across research and main campuses STK PowderHorn silo: 2.8 PB capacity
Networking
Nortel 10GigE DWDM: between machine room and vislab bldg.
Force10 switch-routers: 1.2Tbps, in machine room and vislab bldg TeraBurst V20s: OC48 video capability for remote, collaborative 3D visualization 4
TeraGrid (National)
• NSF Extensible Terascale Facility (ETF) project – build and deploy the world's largest, fastest, distributed computational infrastructure for general scientific research – Current Members: San Diego Supercomputing Center, NCSA, Argonne National Laboratory, Pittsburg Supercomputing Center, California Institute of Technology – Currently has 40 Gbps backbone with hubs in Los Angeles & Chicago – 3 New Members added in September 2003 The University of Texas (led by TACC) Oakridge National Labs Indiana U/Purdue U 5
Teragrid (National)
• UT awarded $3.2M to join NSF ETF in September 2003 – Establish 10 Gbps network connection to ETF backbone – Provide access to high-end computers capable of 6.2 teraflops, a new terascale visualization system, and a 2.8-petabyte mass storage system – Provide access to geoscience data collections used in environmental, geological climate and biological research: high-resolution digital terrain data worldwide hydrological data global gravity data high-resolution X-ray computed tomography data • Current software stack includes: Globus (GSI, GRAM, GridFTP), MPICH-G2, Condor-G, GPT, MyProxy, SRB 6
TIGRE (State-wide Grid)
• Texas Internet Grid for Research and Education – computational grid to integrate computing & storage systems, databases, visualization laboratories and displays, and instruments and sensors across Texas.
– Current TIGRE particpants: Rice Texas A&M Texas Tech University Univ of Houston Univ of Texas at Austin (TACC) – Grid software for TIGRE Testbed: Globus, MPICH-G2, NWS, SRB Other local packages must be integrated Goal: track NMI GRIDS 7
UT Grid (Campus Grid)
• Mission: integrate and simplify the usage of the diverse computational, storage, visualization, data, and instrument resources of UT to facilitate new, powerful paradigms for research and education.
• UT Austin Participants: – Texas Advanced Computing Center (TACC) – Institute for Computational Engineering & Sciences (ICES) – Information Technology Services (ITS) – Center for Instructional Technologies (CIT) – College of Engineering (COE) 8
What is a Campus Grid?
• Important differences from enterprise grids – Researchers
generally
more independent than in company with tight focus on mission, profits – No central IT group governs researchers’ systems paid for out of grants, so distributed authority owners of PCs, clusters have total reconfigure and participate if willing – Lots of heterogeneity; lots of low-cost, poorly-supported systems – Accounting potentially less important Focus on increasing research effectiveness allows tackling problems early (scheduling, workflow, etc.) 9
UT Grid: Approach
• Unique characteristics present opportunities – Some campus researchers want to be on bleeding edge, unlike commercial enterprises – TACC provides high-end systems that researchers require – Campus users have trust relationships initially with TACC, but not each other • How to build a
campus
grid: – Build a hub & spoke grid first – Address both productivity and grid R&D 10
UT Grid: Logical View
1. Integrate
distributed TACC resources first (Globus, LSF, NWS, SRB, United Devices, GridPort) TACC HPC, Vis, Storage (actually spread across two campuses) 11
UT Grid: Logical View
2. Next
add other UT resources in one bldg.
as spoke using
same tools and procedures
TACC HPC, Vis, Storage ICES Cluster ICES Cluster ICES Cluster 12
UT Grid: Logical View
2. Next
add other UT resources in one bldg.
as spoke using
same tools and procedures
GEO Cluster GEO Cluster TACC HPC, Vis, Storage ICES Cluster ICES Cluster ICES Cluster 13
UT Grid: Logical View
BIO Cluster BIO Cluster
2. Next
add other UT resources in one bldg.
as spoke using
same tools and procedures
GEO Cluster GEO Cluster PGE Cluster TACC HPC, Vis, Storage PGE Cluster ICES Cluster ICES Cluster ICES Cluster 14
UT Grid: Logical View
GEO Cluster BIO Cluster BIO Cluster PGE Cluster
3. Finally
negotiate connections between spokes for
willing
participants GEO Cluster ICES Cluster to develop a P2P grid.
TACC HPC, Vis, Storage ICES Cluster ICES Cluster PGE Cluster 15
UT Grid: Physical View
Research campus NOC CMS Switch TACC Storage TACC PWR4 TACC Cluster Ext nets GAATN NOC PGE PGE Cluster Switch TACC Cluster PGE Cluster PGE Cluster Switch ACES TACC Vis ICES Cluster ICES Cluster Main campus 16
UT Grid: Focus
• Address users interested only in increased productivity – Some users just want to be more productive with TACC resources and their own (and others): scheduling throughput, data collections, workflow – Install ‘lowest common denominator’ software only on TACC production resources, user spokes for productivity: Globus 2.x, GridPort 2.x, WebSphere, LSF MultiCluster, SRB, NWS, United Devices, etc.
17
UT Grid: Focus
• Address users interested in grid R&D issues – Some users want to conduct grid-related R&D: grid scheduling, performance modeling, meta applications, P2P storage, etc.
– Also install bleeding-edge software to support grid R&D on TACC
testbed
and
willing
spoke systems: Globus 3.0 and other OGSA software, GridPort 3.x, Common Scheduling Framework, etc.
18
Scheduling & Workflow
• Use Case: Researcher wants to run climate modeling job on a compute cluster and view results using a specified visualization resource • Grid middleware requirements: – Schedule job to “best” compute cluster – Forward results to specified visualization resource – Support advanced reservations on vis. resource • Currently solved using LSF Multicluster & Globus (GSI, GridFTP, GRAM) • Evaluating CSF meta-scheduler for future use 19
What is CSF?
• CSF (Community Scheduler Framework): Open source meta-scheduler framework contributed by Platform Computing to Globus for possible inclusion in the Globus Toolkit Developed with the latest version of OGSI – grid guideline being developed with Global Grid Forum (OGSA) Extensible framework for implementing meta-schedulers – Supports heterogeneous workload execution software (LSF, PBS, SGE) Negotiate advanced reservations (WS-agreement) Select best resource for a given job based on specified policies – Provides standard API to submit and manage jobs 20
Example CSF Configuration
GT3.0
Queuing Service VO A Job Service Reservation Service Adapter for PBS CA GT3.0
RM Adapter for LSF GT3.0
Queuing Service VO B Job Service Reservation Service CA GT3.0
RM Adapter for LSF PBS LSF
21
Grid Portals
• Use Case: Researcher logs on using a single grid portal account which enables her to – Be authenticated across all resources on the grid – Submit and manage job sequences on the entire grid – View account allocations and usage – View current status of all grid resources – Transfer files between grid resources • GridPort provides base services used to create customized portals (e.g. HotPages). Technologies: – Security: GSI, SSH, MyProxy – Job Execution: GRAM Gatekeeper – Information Services: MDS, NWS, Custom information scripts – File Management: GridFTP 22
23
GridPort Application Portals
• UT/Texas Grids: – – http://gridport.tacc.utexas.edu
http://tigre.hipcat.net
• NPACI/PACI/TeraGrid HotPages (also @PACI/NCSA ) – https://hotpage.npaci.edu
– http://hotpage.teragrid.org
– https://hotpage.paci.org
• Telescience/BIRN (Biomedical Informatics Research Network) – https://gridport.npaci.edu/Telescience • DOE Fusion Grid Portal • Will use GridPort based portal to run scheduling experiments using portals and CSF at upcoming Supercomputing 2003 • Contributing and founding member of NMI Portals Project: – Open Grid Computing Environments (OGCE) 24
Conclusions
• Grid technologies progressing & improving but still ‘raw’ – Cautious outreach to campus community – UT campus grid under construction, working with beta users now • Computational Science problems have not changed: – Users want easier tools, familiar user environments (e.g. command line) or easy portals • Workflow appears to be desirable tool: – GridFlow/GridSteer Project under way – Working with advanced file mgmt and scheduling to automate distributed tasks 25
TACC Grid Computing Activities Participants
• Participants include most of the TACC Distributed & Grid Computing Group: – Ashok Adiga – Jay Boisseau – Maytal Dahan – Eric Roberts – Akhil Seth – Mary Thomas – Tomislav Urban – David Walling – As of Dec. 1, Edward Walker (formerly of Platform Computing) 26