No Slide Title
Download
Report
Transcript No Slide Title
The Anatomy of the Grid
Enabling Scalable Virtual Organizations
Ian Foster
Mathematics and Computer Science Division
Argonne National Laboratory
and
Department of Computer Science
The University of Chicago
http://www.mcs.anl.gov/~foster
Grids are “hot” …
Computational
Data
Information
DISCOM Access
SinRG
APGrid
Knowledge
TeraGrid
but what are they really about?
[email protected]
ARGONNE CHICAGO
Issues I Propose to Address
Problem statement
Architecture
Globus Toolkit
Futures
[email protected]
ARGONNE CHICAGO
The Grid Problem
Resource sharing & coordinated problem
solving in dynamic, multi-institutional
virtual organizations
[email protected]
ARGONNE CHICAGO
Elements of the Problem
Resource sharing
– Computers, storage, sensors, networks, …
– Sharing always conditional: issues of trust,
policy, negotiation, payment, …
Coordinated problem solving
– Beyond client-server: distributed data
analysis, computation, collaboration, …
Dynamic, multi-institutional virtual orgs
– Community overlays on classic org structures
– Large or small, static or dynamic
[email protected]
ARGONNE CHICAGO
Grid Communities & Applications:
Data Grids for High Energy Physics
~PBytes/sec
Online System
~100 MBytes/sec
~20 TIPS
There are 100 “triggers” per second
Each triggered event is ~1 MByte in size
~622 Mbits/sec
or Air Freight (deprecated)
France Regional
Centre
SpecInt95 equivalents
Offline Processor Farm
There is a “bunch crossing” every 25 nsecs.
Tier 1
1 TIPS is approximately 25,000
Tier 0
Germany Regional
Centre
Italy Regional
Centre
~100 MBytes/sec
CERN Computer Centre
FermiLab ~4 TIPS
~622 Mbits/sec
Tier 2
~622 Mbits/sec
Institute
Institute Institute
~0.25TIPS
Physics data cache
Caltech
~1 TIPS
Institute
Tier2 Centre
Tier2 Centre
Tier2 Centre
Tier2 Centre
~1 TIPS ~1 TIPS ~1 TIPS ~1 TIPS
Physicists work on analysis “channels”.
Each institute will have ~10 physicists working on one or more
channels; data for these channels should be cached by the
institute server
~1 MBytes/sec
Tier 4
Physicist workstations
[email protected]
Image courtesy Harvey Newman, Caltech
ARGONNE CHICAGO
Grid Communities and Applications:
Network for Earthquake Eng. Simulation
NEESgrid: national
infrastructure to couple
earthquake engineers
with experimental
facilities, databases,
computers, & each other
On-demand access to
experiments, data
streams, computing,
archives, collaboration
[email protected]
NEESgrid: Argonne, Michigan, NCSA, UIUC, USC ARGONNE CHICAGO
Grid Communities and Applications:
Mathematicians Solve NUG30
Community=an informal
collaboration of
mathematicians and
computer scientists
Condor-G delivers 3.46E8
CPU seconds in 7 days
(peak 1009 processors) in
U.S. and Italy (8 sites)
Solves NUG30 quadratic
assignment problem
14,5,28,24,1,3,16,15,
10,9,21,2,4,29,25,22,
13,26,17,30,6,20,19,
8,18,7,27,12,11,23
[email protected]
MetaNEOS: Argonne, Iowa, Northwestern, Wisconsin
ARGONNE CHICAGO
Grid Communities and Applications:
Home Computers Evaluate AIDS Drugs
Community =
– 1000s of home
computer users
– Philanthropic
computing vendor
(Entropia)
– Research group
(Scripps)
Common goal=
advance AIDS
research
[email protected]
ARGONNE CHICAGO
Grid Architecture
Why Discuss Architecture?
Descriptive
– Provide a common vocabulary for use when
describing Grid systems
Guidance
– Identify key areas in which services are
required
Prescriptive
– Define standard “Intergrid” protocols and
APIs to facilitate creation of interoperable
Grid systems and portable applications
[email protected]
ARGONNE CHICAGO
What Sorts of Standards?
Need for interoperability when different
groups want to share resources
– E.g., IP lets me talk to your computer, but how
do we establish & maintain sharing?
– How do I discover, authenticate, authorize,
describe what I want to do, etc., etc.?
Need for shared infrastructure services to
avoid repeated development, installation, e.g.
– One port/service for remote access to
computing, not one per tool/application
– X.509 enables sharing of Certificate Authorities
[email protected]
ARGONNE CHICAGO
So, in Defining Grid Architecture,
We Must Address …
Development of Grid protocols & services
– Protocol-mediated access to remote resources
– New services: e.g., resource brokering
– “On the Grid” = speak Intergrid protocols
– Mostly (extensions to) existing protocols
Development of Grid APIs & SDKs
– Facilitate application development by supplying
higher-level abstractions
The (hugely successful) model is the Internet
The Grid is not a distributed OS!
[email protected]
ARGONNE CHICAGO
The Role of Grid Services
(aka Middleware) and Tools
Collaboration
Tools
Remote
access
Information
services
Remote
monitor
[email protected]
Data Mgmt
Tools
Resource
mgmt
...
Distributed
simulation
Fault
detection
...
net
ARGONNE CHICAGO
Layered Grid Architecture
(By Analogy to Internet Architecture)
“Coordinating multiple resources”:
ubiquitous infrastructure services,
app-specific distributed services
Collective
Application
“Sharing single resources”:
negotiating access, controlling use
Resource
“Talking to things”: communication
(Internet protocols) & security
Connectivity
Transport
Internet
“Controlling things locally”: Access
to, & control of, resources
Fabric
Link
[email protected]
Internet Protocol Architecture
Application
ARGONNE CHICAGO
Protocols, Services, and Interfaces
Occur at Each Level
Applications
Languages/Frameworks
Collective Service APIs and SDKs
Collective Services
Resource APIs and SDKs
Resource Services
Collective Service Protocols
Resource Service Protocols
Connectivity APIs
Connectivity Protocols
Local Access APIs and Protocols
Fabric Layer
[email protected]
ARGONNE CHICAGO
Where Are We With Architecture?
No “official” standards exist
– Nor is it clear what this would mean
But:
– Globus Toolkit has emerged as the de facto
standard for several important Connectivity,
Resource, and Collective protocols
– GGF has an architecture working group
– Technical specifications are being developed
for architecture elements: e.g., security,
data, resource management, information
[email protected]
ARGONNE CHICAGO
The Globus Toolkit
Grid Services Architecture (1):
Fabric Layer
Just what you would expect: the diverse
mix of resources that may be shared
– Individual computers, Condor pools, file
systems, archives, metadata catalogs,
networks, sensors, etc., etc.
Few constraints on low-level technology:
connectivity and resource level protocols
form the “neck in the hourglass”
Globus toolkit provides a few selected
components (e.g., bandwidth broker)
[email protected]
ARGONNE CHICAGO
Grid Services Architecture (2):
Connectivity Layer Protocols & Services
Communication
– Internet protocols: IP, DNS, routing, etc.
Security: Grid Security Infrastructure (GSI)
– Uniform authentication & authorization
mechanisms in multi-institutional setting
– Single sign-on, delegation, identity mapping
– Public key technology, SSL, X.509, GSS-API
– Supporting infrastructure: Certificate
Authorities, key management, etc.
[email protected]
ARGONNE CHICAGO
User
User Proxy
Globus
Credential
CREDENTIAL
Single sign-on via “grid-id”
Assignment of
credentials to
“user proxies”
Mutual
user-resource
authentication
Site 2
Site 1
Authorization
GRAM
GSI
Ticket
Process
Process
Authenticated
interprocess
communication
Process
Process
GRAM
GSI
Mapping
to
local ids
Certificate
Process
Kerberos
[email protected]
Process
Public Key
ARGONNE CHICAGO
GSI Futures
Scalability in numbers of users & resources
– Credential management
– Online credential repositories (“MyProxy”)
– Account management
Authorization
– Policy languages
– Community authorization
Protection against compromised resources
– Restricted delegation, smartcards
[email protected]
ARGONNE CHICAGO
GSI Futures:
Community Authorization
1. CAS request, with
resource names
and operations
2. CAS reply, with
capability
and resource CA info
User
3. Resource request,
authenticated with
capability
CAS
Does the
collective policy
authorize this
request for this
user?
user/group
membership
resource/collective
membership
collective policy
information
Resource
Is this request
authorized by
the
capability?
local policy
information
4. Resource reply
Is this request
authorized for
the CAS?
[email protected]
ARGONNE CHICAGO
Grid Services Architecture (3):
Resource Layer Protocols & Services
Resource management: GRAM
– Remote allocation, reservation, monitoring,
control of [compute] resources
Data access: GridFTP
– High-performance data access & transport
Information: MDS (GRRP, GRIP)
– Access to structure & state information
& others emerging: catalog access, code
repository access, accounting, …
All integrated with GSI
[email protected]
ARGONNE CHICAGO
GRAM Resource Management
Protocol
Grid Resource Allocation & Management
– Allocation, monitoring, control of computations
Simple HTTP-based RPC
– Job request:
> Returns a “job contact”: Opaque string that can be passed
between clients, for access to job
– Job cancel, Job status, Job signal
– Event notification (callbacks) for state changes
> Pending, active, done, failed, suspended
Servers for most schedulers; C and Java APIs
[email protected]
ARGONNE CHICAGO
Resource Management Futures
GRAM-2 protocol (ETA late 2001)
– Advance reservations & multiple resource types
– Recoverable requests, timeout, etc.
– Use of SOAP (RPC using HTTP + XML)
– Policy evaluation points for restricted proxies
[email protected]
ARGONNE CHICAGO
Data Access & Transfer
GridFTP: extended version of popular FTP
protocol for Grid data access and transfer
Secure, efficient, reliable, flexible, extensible,
parallel, concurrent, e.g.:
– Third-party data transfers, partial file transfers
– Parallelism, striping (e.g., on PVFS)
– Reliable, recoverable data transfers
Reference implementations
– Existing clients and servers: wuftpd, nicftp
– Flexible, extensible libraries
[email protected]
ARGONNE CHICAGO
Grid Services Architecture (4):
Collective Layer Protocols & Services
Index servers aka metadirectory services
– Custom views on dynamic resource collections
assembled by a community
Resource brokers (e.g., Condor Matchmaker)
– Resource discovery and allocation
Replica management and replica selection
– Optimize aggregate data access performance
Co-reservation and co-allocation services
– End-to-end performance
Etc., etc.
[email protected]
ARGONNE CHICAGO
The Grid Information Problem
Large numbers of distributed “sensors” with
different properties
Need for different “views” of this information,
depending on community membership, security
constraints, intended purpose, sensor type
[email protected]
ARGONNE CHICAGO
The Globus Toolkit Solution: MDS-2
Registration & enquiry protocols, information
models, query languages
– Provides standard interfaces to sensors
– Supports different “directory” structures
supporting various discovery/access strategies
[email protected]
ARGONNE CHICAGO
Resource Management Architecture
ASCI DISCOM
Condor-G
Nimrod-G
Broker
Poznan*
U. Lecce
RSL
RSL
specialization
Queries
& Info
Application
Ground RSL
Co-allocator
Information
Service
DUROC
MPICH-G2
Simple ground RSL
Local
resource
managers
GRAM
GRAM
GRAM
LSF
Condor
NQE
* See talk by Jarek Nabrzyski et al.
[email protected]
ARGONNE CHICAGO
Data Grid Architecture
(See talk by Sudharshan Vazhkudai)
Metadata
Catalog
Attribute
Specification
Application
Logical Collection and
Logical File Name
Selected
Replica
GridFTP commands
Replica
Catalog
Multiple Locations
Replica
Selection
Performance
Information &
Predictions
MDS
NWS
Disk Cache
Tape Library
Disk Array
Replica Location 1
Disk Cache
Replica Location 2
Replica Location 3
+ “Virtual data”: transparency wrt location and
materialization (www.griphyn.org)
[email protected]
ARGONNE CHICAGO
Grid Futures
DOE ASCI DISCOM
DOE Particle Physics Data Grid
DOE Earth Systems Grid
DOE Science Grid
DOE Fusion Collaboratory
European Data Grid
Egrid (see talk by G. Allen et al.)
NASA Information Power Grid
NSF National Technology Grid
NSF Network for Earthquake Eng Simulation
NSF Grid Application Development Software
NSF Grid Physics Network
[email protected]
Large Grid
Projects
are in Place
ARGONNE CHICAGO
Problem Evolution
Past-present: O(102) high-end systems; Mb/s
networks; centralized (or entirely local) control
– I-WAY (1995): 17 sites, week-long; 155 Mb/s
– GUSTO (1998): 80 sites, long-term experiment
– NASA IPG, NSF NTG: O(10) sites, production
Present: O(104-106) data systems, computers;
Gb/s networks; scaling, decentralized control
– Scalable resource discovery; restricted delegation;
community policy; GriPhyN Data Grid: 100s of
sites, O(104) computers; complex policies
Future: O(106-109) data, sensors, computers;
Tb/s networks; highly flexible policy, control
[email protected]
ARGONNE CHICAGO
The Future:
All Software is Network-Centric
We don’t build or buy “computers” anymore,
we borrow or lease required resources
– When I walk into a room, need to solve a
problem, need to communicate
A “computer” is a dynamically, often
collaboratively constructed collection of
processors, data sources, sensors, networks
– Similar observations apply for software
[email protected]
ARGONNE CHICAGO
And Thus …
Reduced barriers to access mean that we
do much more computing, and more
interesting computing, than today =>
Many more components (& services);
massive parallelism
All resources are owned by others =>
Sharing (for fun or profit) is fundamental;
trust, policy, negotiation, payment
All computing is performed on unfamiliar
systems => Dynamic behaviors, discovery,
adaptivity, failure
[email protected]
ARGONNE CHICAGO
Summary
The Grid problem: Resource sharing &
coordinated problem solving in dynamic,
multi-institutional virtual organizations
Grid architecture: Emphasize protocol and
service definition to enable interoperability
and resource sharing
Globus Toolkit as a source of protocol and
API definitions, reference implementations
For more info: www.globus.org,
www.griphyn.org, www.gridforum.org
[email protected]
ARGONNE CHICAGO