TACC PowerPoint Template

Download Report

Transcript TACC PowerPoint Template

Grids: TACC Case Study

Ashok Adiga, Ph.D.

Distributed & Grid Computing Group Texas Advanced Computing Center The University of Texas at Austin [email protected]

(512) 471-8196 TEXAS ADVANCED COMPUTING CENTER

Outline

• Overview of TACC Grid Computing Activities • Building a Campus Grid – UT Grid • Addressing common Use Cases – Scheduling & Flow – Grid Portals • Conclusions 2

TACC Grid Program

• Building Grids at TACC – Campus Grid (UT Grid) – State Grid (TIGRE) – National Grid (ETF) • Grid Hardware Resources – Wide range of hardware resources available to research community at UT and partners • Grid Software Resources – NMI Components, NPACKage – User Portals, GridPort – Job schedulers: LSF Multicluster, Community Scheduling Framework – United Devices (desktop grids) • Significantly leveraging NMI Components and experience 3

TACC Resources:

Providing Comprehensive, Balanced Capabilities

HPC

Cray-Dell cluster: 600 CPUs, 3.67 Tflops, 0.6 TB memory, 25 TB disk IBM Power4 system: 224 CPUs, 1.16 Tflops, 0.5 TB memory, 7.1 TB disk

Visualization

SGI Onyx2: 24 CPUs, 25 GB memory, 6 IR2 graphics pipes Sun V880z: 4 CPUs, 2 Zulu graphics pipes Dell/Windows cluster: 18 CPUs, 9 NVIDIA NV30 cards (soon)

Large

immersive environment and 10

large

, tiled displays

Data storage

Sun SAN: 12TB across research and main campuses STK PowderHorn silo: 2.8 PB capacity

Networking

Nortel 10GigE DWDM: between machine room and vislab bldg.

Force10 switch-routers: 1.2Tbps, in machine room and vislab bldg TeraBurst V20s: OC48 video capability for remote, collaborative 3D visualization 4

TeraGrid (National)

• NSF Extensible Terascale Facility (ETF) project – build and deploy the world's largest, fastest, distributed computational infrastructure for general scientific research – Current Members:  San Diego Supercomputing Center, NCSA, Argonne National Laboratory, Pittsburg Supercomputing Center, California Institute of Technology – Currently has 40 Gbps backbone with hubs in Los Angeles & Chicago – 3 New Members added in September 2003   The University of Texas (led by TACC) Oakridge National Labs  Indiana U/Purdue U 5

Teragrid (National)

• UT awarded $3.2M to join NSF ETF in September 2003 – Establish 10 Gbps network connection to ETF backbone – Provide access to high-end computers capable of 6.2 teraflops, a new terascale visualization system, and a 2.8-petabyte mass storage system – Provide access to geoscience data collections used in environmental, geological climate and biological research:    high-resolution digital terrain data worldwide hydrological data global gravity data  high-resolution X-ray computed tomography data • Current software stack includes: Globus (GSI, GRAM, GridFTP), MPICH-G2, Condor-G, GPT, MyProxy, SRB 6

TIGRE (State-wide Grid)

• Texas Internet Grid for Research and Education – computational grid to integrate computing & storage systems, databases, visualization laboratories and displays, and instruments and sensors across Texas.

– Current TIGRE particpants:     Rice Texas A&M Texas Tech University Univ of Houston  Univ of Texas at Austin (TACC) – Grid software for TIGRE Testbed:    Globus, MPICH-G2, NWS, SRB Other local packages must be integrated Goal: track NMI GRIDS 7

UT Grid (Campus Grid)

• Mission: integrate and simplify the usage of the diverse computational, storage, visualization, data, and instrument resources of UT to facilitate new, powerful paradigms for research and education.

• UT Austin Participants: – Texas Advanced Computing Center (TACC) – Institute for Computational Engineering & Sciences (ICES) – Information Technology Services (ITS) – Center for Instructional Technologies (CIT) – College of Engineering (COE) 8

What is a Campus Grid?

• Important differences from enterprise grids – Researchers

generally

more independent than in company with tight focus on mission, profits – No central IT group governs researchers’ systems  paid for out of grants, so distributed authority  owners of PCs, clusters have total reconfigure and participate if willing – Lots of heterogeneity; lots of low-cost, poorly-supported systems – Accounting potentially less important  Focus on increasing research effectiveness allows tackling problems early (scheduling, workflow, etc.) 9

UT Grid: Approach

• Unique characteristics present opportunities – Some campus researchers want to be on bleeding edge, unlike commercial enterprises – TACC provides high-end systems that researchers require – Campus users have trust relationships initially with TACC, but not each other • How to build a

campus

grid: – Build a hub & spoke grid first – Address both productivity and grid R&D 10

UT Grid: Logical View

1. Integrate

distributed TACC resources first (Globus, LSF, NWS, SRB, United Devices, GridPort) TACC HPC, Vis, Storage (actually spread across two campuses) 11

UT Grid: Logical View

2. Next

add other UT resources in one bldg.

as spoke using

same tools and procedures

TACC HPC, Vis, Storage ICES Cluster ICES Cluster ICES Cluster 12

UT Grid: Logical View

2. Next

add other UT resources in one bldg.

as spoke using

same tools and procedures

GEO Cluster GEO Cluster TACC HPC, Vis, Storage ICES Cluster ICES Cluster ICES Cluster 13

UT Grid: Logical View

BIO Cluster BIO Cluster

2. Next

add other UT resources in one bldg.

as spoke using

same tools and procedures

GEO Cluster GEO Cluster PGE Cluster TACC HPC, Vis, Storage PGE Cluster ICES Cluster ICES Cluster ICES Cluster 14

UT Grid: Logical View

GEO Cluster BIO Cluster BIO Cluster PGE Cluster

3. Finally

negotiate connections between spokes for

willing

participants GEO Cluster ICES Cluster to develop a P2P grid.

TACC HPC, Vis, Storage ICES Cluster ICES Cluster PGE Cluster 15

UT Grid: Physical View

Research campus NOC CMS Switch TACC Storage TACC PWR4 TACC Cluster Ext nets GAATN NOC PGE PGE Cluster Switch TACC Cluster PGE Cluster PGE Cluster Switch ACES TACC Vis ICES Cluster ICES Cluster Main campus 16

UT Grid: Focus

• Address users interested only in increased productivity – Some users just want to be more productive with TACC resources and their own (and others): scheduling throughput, data collections, workflow – Install ‘lowest common denominator’ software only on TACC production resources, user spokes for productivity: Globus 2.x, GridPort 2.x, WebSphere, LSF MultiCluster, SRB, NWS, United Devices, etc.

17

UT Grid: Focus

• Address users interested in grid R&D issues – Some users want to conduct grid-related R&D: grid scheduling, performance modeling, meta applications, P2P storage, etc.

– Also install bleeding-edge software to support grid R&D on TACC

testbed

and

willing

spoke systems: Globus 3.0 and other OGSA software, GridPort 3.x, Common Scheduling Framework, etc.

18

Scheduling & Workflow

• Use Case: Researcher wants to run climate modeling job on a compute cluster and view results using a specified visualization resource • Grid middleware requirements: – Schedule job to “best” compute cluster – Forward results to specified visualization resource – Support advanced reservations on vis. resource • Currently solved using LSF Multicluster & Globus (GSI, GridFTP, GRAM) • Evaluating CSF meta-scheduler for future use 19

What is CSF?

• CSF (Community Scheduler Framework):   Open source meta-scheduler framework contributed by Platform Computing to Globus for possible inclusion in the Globus Toolkit Developed with the latest version of OGSI – grid guideline being developed with Global Grid Forum (OGSA)  Extensible framework for implementing meta-schedulers – Supports heterogeneous workload execution software (LSF, PBS, SGE)  Negotiate advanced reservations (WS-agreement)  Select best resource for a given job based on specified policies – Provides standard API to submit and manage jobs 20

Example CSF Configuration

GT3.0

Queuing Service VO A Job Service Reservation Service Adapter for PBS CA GT3.0

RM Adapter for LSF GT3.0

Queuing Service VO B Job Service Reservation Service CA GT3.0

RM Adapter for LSF PBS LSF

21

Grid Portals

• Use Case: Researcher logs on using a single grid portal account which enables her to – Be authenticated across all resources on the grid – Submit and manage job sequences on the entire grid – View account allocations and usage – View current status of all grid resources – Transfer files between grid resources • GridPort provides base services used to create customized portals (e.g. HotPages). Technologies: – Security: GSI, SSH, MyProxy – Job Execution: GRAM Gatekeeper – Information Services: MDS, NWS, Custom information scripts – File Management: GridFTP 22

23

GridPort Application Portals

• UT/Texas Grids: – – http://gridport.tacc.utexas.edu

http://tigre.hipcat.net

• NPACI/PACI/TeraGrid HotPages (also @PACI/NCSA ) – https://hotpage.npaci.edu

– http://hotpage.teragrid.org

– https://hotpage.paci.org

• Telescience/BIRN (Biomedical Informatics Research Network) – https://gridport.npaci.edu/Telescience • DOE Fusion Grid Portal • Will use GridPort based portal to run scheduling experiments using portals and CSF at upcoming Supercomputing 2003 • Contributing and founding member of NMI Portals Project: – Open Grid Computing Environments (OGCE) 24

Conclusions

• Grid technologies progressing & improving but still ‘raw’ – Cautious outreach to campus community – UT campus grid under construction, working with beta users now • Computational Science problems have not changed: – Users want easier tools, familiar user environments (e.g. command line) or easy portals • Workflow appears to be desirable tool: – GridFlow/GridSteer Project under way – Working with advanced file mgmt and scheduling to automate distributed tasks 25

TACC Grid Computing Activities Participants

• Participants include most of the TACC Distributed & Grid Computing Group: – Ashok Adiga – Jay Boisseau – Maytal Dahan – Eric Roberts – Akhil Seth – Mary Thomas – Tomislav Urban – David Walling – As of Dec. 1, Edward Walker (formerly of Platform Computing) 26