Grid Infrastructure

Download Report

Transcript Grid Infrastructure

Grid
Infrastructure
[email protected]
What is it ?
SERVERS
Clients
Eddie Aronovich – Operating System course (TAU CS, Jan 2009)
3
IT all about IT
Eddie Aronovich – Operating System course (TAU CS, Jan 2009)
4
Hardware utilization
Eddie Aronovich – Operating System course (TAU CS, Jan 2009)
5
SOA & Web services
• Decompose processing into services
• Each service works independently
• Main components:
– Universal Description, Discovery and Integration
– Simple Object Access Protocol
– Web Services Description Language
• W3C standard
Eddie Aronovich – Operating System course (TAU CS, Jan 2009)
6
Eddie Aronovich – Operating System course (TAU CS, Jan 2009)
7
Eddie Aronovich – Operating System course (TAU CS, Jan 2009)
8
THE WORLD NEEDS ONLY FIVE COMPUTERS
(Thomas J. Watson)
•
•
•
•
•
•
Google grid
Microsoft's live.com
Yahoo!
Amazon.com
eBay
Salesforce.com
Well, that's O(5) ;)
Greg Matter (http://blogs.sun.com/Gregp/entry/the_world_needs_only_five)
Eddie Aronovich – Operating System course (TAU CS, Jan 2009)
9
Scaling
• Scale-up
–
–
–
–
Add more resources within the system
Does not requires changes in the applications
Limited extension
Singe point of failure
• Scape-out
– Add more systems
– Architecture dependent (needs change of code)
– Economically
• Howto ?
– Split the operation into groups
– Perform each group on a different machine
Eddie Aronovich – Operating System course (TAU CS, Jan 2009)
10
How fast can parallelization be ?
• Let:
– α be the proportion of the process that can not be
parallelized.
– P – number of processors
– S – System speedup
Amdhals law:
S = 1 / (α + (1- α ) / P )
Eddie Aronovich – Operating System course (TAU CS, Jan 2009)
11
Cluster types
• High availability
– Active-Active
– Active-Passive
– Heart beat
• Load Balancing Cluster
– Round robin (weighted/non-weighted)
– System status aware (session, cpu load, etc)
• Compute cluster
– Queuing system (condor, hadoop, open-pbs, LSF, etc.)
– Single system image (ScaleMP, SSI, Mosix, nomad,etc.)
Eddie Aronovich – Operating System course (TAU CS, Jan 2009)
12
Condor script
 #################
 # Sample script #
 #################










Executable
= /bin/hostname
when_to_transfer_output = ON_EXIT_OR_EVICT
Log
= {file name}.log
Error
= err.$(Process)
Output
= out.$(Process)
Requirements
= substr(Machine,0,4)=="dopp"
&& ARCH=="X86_64"
Arguments
= +-u
notification
= Complete
Universe
= VANILLA
Queue 10
From a single PC to a Grid
Farm of PCs
Enterprise grid:
Mutualization of
resources in a
company
Volunteer
computing: CPU
cycles made
available by PC
owners
Examples:
Seti@home
Africa@home
Grid infrastructure:
Internet + disk and storage
resources + services for
information management ( data
collection, transfer and analysis)
Example:
EGEE
Batch to On-Line scale
gLite
&
Globus
Dedicated
resources
PBS Torque
Utility computing
(Condor)
Eddie Aronovich – Operating System course (TAU CS, Jan 2009)
hadoop
15
Key Cloud Services Attributes
•
•
•
•
•
•
•
•
Off-Site, Thirds-party provider
Access via Internet
Minimal/no IT skills required to “implement”
Provisioning - self-service requesting; near
real-time deployment; dynamic & fine-grained
scaling
Fine-grained usage-based pricing model
UI - browser and successors
Web services APIs as System Interface
Shared resources/common versions
Source: IDC, Sep 2008
What is “Grid”
Eddie Aronovich – Operating System course (TAU CS, Jan 2009)
17
What is Grid Computing ?
Definition is not widely agreed
Foster & Kesselman:
• Computing resources are not
administered centrally.
• Open standards are used.
• Non-trivial quality of service is achieved.
Eddie Aronovich – Operating System course (TAU CS, Jan 2009)
18
Other definitions
• "the technology that enables resource virtualization,
on-demand provisioning, and service (resource)
sharing between organizations." (Plaszczak/Wellner)
• "a type of parallel and distributed system that enables
the sharing, selection, and aggregation of
geographically distributed autonomous resources
dynamically at runtime depending on their availability,
capability, performance, cost, and users' quality-ofservice requirements“ (Buyya )
• "a service for sharing computer power and data
storage capacity over the Internet." (CERN)
Eddie Aronovich – Operating System course (TAU CS, Jan 2009)
19
Virtual Organization
• What’s a VO?
– People in different organisations
seeking to cooperate and share
resources across their
organisational boundaries
• Why establish a Grid?
– Share data
– Pool computers
– Collaborate
• The initial vision: “The Grid”
• The present reality: Many “grids”
• Each grid is an infrastructure
enabling one or more “virtual
organisations” to share
computing resources
VO1
Institute A
Institute B
Institute C
Institute D
Institute E
Institute F
Eddie Aronovich – Operating System course (TAU CS, Jan 2009)
VO2
20
The Grid Metaphor
Mobile Access
G
R
I
D
Workstation
M
I
D
D
L
E
W
A
R
E
Supercomputer, PC-Cluster
Data-storage, Sensors, Experiments
Visualising
Internet, networks
Eddie Aronovich – Operating System course (TAU CS, Jan 2009)
21
Stand alone computer
Eddie Aronovich – Operating System course (TAU CS, Jan 2009)
22
Stand alone computer
Eddie Aronovich – Operating System course (TAU CS, Jan 2009)
23
Stand alone computer
Eddie Aronovich – Operating System course (TAU CS, Jan 2009)
24
Middleware components – The batch approach
“User
interface”
Input “sandbox”
DataSets info
Output “sandbox”
Replica
Catalogue
Information
Service
Resource
Broker
Publish
Logging &
Book-keeping
Job Query
Job Submit Event
Author.
&Authen.
Storage
Element
Job Status
Computing
Element
Eddie Aronovich – Operating System course (TAU CS, Jan 2009)
25
Network
Server
RB node
Replica
Location
Server
UI
Workload
Manager
Inform.
Service
Job Contr.
Characts.
& status
Computing
Element
Storage
Element
Job
Status
RB node
Replica
Location
Server
Network
Server
UI
Workload
Manager
UI: allows users to
access the functionalities
of the WMS
(via command line, GUI,
C++ and Java APIs)
Computing
Element
Inform.
Service
Job Contr.
CondorG
CE characts
& status
SE characts
& status
Storage
Element
submitted
edg-job-submit myjob.jdl
Myjob.jdl
UI
Job
Statu
s
RB node
submitted
JobType = “Normal”;
Replica
Network
Location
Executable = "$(CMS)/exe/sum.exe";
Server
Server
InputSandbox = {"/home/user/WP1testC","/home/file*”,
"/home/user/DATA/*"};
OutputSandbox = {“sim.err”, “test.out”, “sim.log"};
Workload
Requirements =Manager
other. GlueHostOperatingSystemName
==
Inform.
“linux" &&
Service
other. GlueHostOperatingSystemRelease == "Red Hat 7.3“ &&
other.GlueCEPolicyMaxCPUTime > 10000;
Job Contr.
Rank = other.GlueCEStateFreeCPUs;
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Job Description Language
(JDL) to specify job
Storage
characteristics and
Element
requirements
Job
RB node
Network
Server
Job
NS: network daemon Status
responsible for accepting submitted
Replica
incoming requests
Location
Server
waiting
UI
Input
Sandbox
files
RB
storage
Workload
Manager
Inform.
Service
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
Job
Status
RB node
Job submission
Replica
Location
Server
Network
Server
Job
UI
RB
storage
Workload
manager
WM: acts to
satisfy the request
Inform.
Service
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
submitted
waiting
Job Status
RB node
Job submission
Network
Server
UI
RB
storage
Workload
Manager
Job Contr.
CondorG
Replica
Location
Server
MatchMaker/
Broker
Where must
job be
executed ?
Inform.
thisService
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
submitted
waiting
Job
Status
RB node
Job submission
Matchmaker: responsible
Network
to find the “best” CEServer
UIfor a job
RB
storage
MatchMaker/
Broker
Workload
Manager
Replica
Location
Server
Inform.
Service
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
submitted
waiting
Job
Status
Where are (which
RB nodeSEs)
Job
submission
the needed
data ?
Network
Server
MatchMaker/
Broker
UI
RB
storage
Workload
Manager
Replica
Location
Server
Inform.
Service
Job Contr.
- What
CondorG
is the
status of the
characts
Grid ? CE
& status
Computing
Element
SE characts
& status
Storage
Element
submitted
waiting
Job
Status
RB node
Job submission
Network
Server
MatchMaker/
Broker
UI
RB
storage
Workload
Manager
CE choice
Replica
Location
Server
Inform.
Service
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
submitted
waiting
Job
Status
RB node
Job submission
Replica
Location
Server
Network
Server
submitted
waiting
UI
RB
storage
Workload
Manager
Inform.
Service
Job
Adapter
Job Contr.
CondorG
characts
SE characts
Job Adapter: responsibleCE
for
“touches”
& statusthe final
& status
to the job before performing submission
(e.g. creation of wrapper script, PFN, etc.)
Computing
Element
Storage
Element
Job
Status
RB node
Job submission
submitted
Replica
Location
Server
Network
Server
waiting
UI
RB
storage
ready
Workload
Manager
Inform.
Service
Job
Job Contr.
Job Controller: responsible for the
actual job management
operations (done via
Computing
CondorG)
Element
CE characts
& status
SE characts
& status
Storage
Element
Job
Status
RB node
Job submission
Replica
Location
Server
Network
Server
UI
RB
storage
waiting
ready
Workload
Manager
Inform.
Service
scheduled
Job Contr.
CondorG
CE characts
& status
SE characts
& status
Job
Computing
Element
submitted
Storage
Element
“Compute element” – reminder!
Job request
Logging
Logging
Globus gatekeeper
I.S.
Info
system
gridmapfile
Grid gate node
Local resource management system:
Condor / PBS / LSF master
Homogeneous set of
worker nodes
Job
Status
RB node
Job submission
Replica
Location
Server
Network
Server
UI
RB
storage
submitted
waiting
ready
Workload
Manager
Inform.
Service
scheduled
Job Contr.
CondorG
running
Input
Sandbox
files
“Grid enabled”
data transfers/
accesses
Computing
Element
Job
Storage
Element
Job
Status
RB node
Job submission
Network
Server
Replica
Location
Server
submitted
waiting
UI
RB
storage
Output
Sandbox
files
Computing
Element
Workload
Manager
ready
Inform.
Service
Job Contr.
CondorG
scheduled
running
done
Storage
Element
Job
Status
RB node
edg-job-get-output
<dg-job-id>
Job submission
Network
Server
Replica
Location
Server
submitted
waiting
UI
RB
storage
Workload
Manager
ready
Inform.
Service
Job Contr.
CondorG
scheduled
running
done
Computing
Element
Storage
Element
Job
Status
RB node
Job submission
Network
Server
UI
RB
storage
Output
Sandbox
files
Workload
Manager
submitted
Replica
Location
Server
waiting
ready
Inform.
Service
Job Contr.
CondorG
scheduled
running
done
cleared
Computing
Element
Storage
Element
RB node
Job monitoring
edg-job-status <dg-job-id>
edg-job-get-logging-info <dg-job-id>
UI
LB: receives and stores
job events; processes
corresponding job status
Network
Server
Workload
Manager
Job
status
Job Contr.
CondorG
Logging &
Bookkeeping
Log
Monitor
Log of
job events
LM: parses CondorG log
file (where CondorG logs
info about jobs) and notifies LB
Computing
Element
Approaches to Security: 1
The Poor Security House
Grid Operation and Security by Eddie Aronovich, Mar 2008
44
Approaches to Security: 2
The Paranoid Security House
Grid Operation and Security by Eddie Aronovich, Mar 2008
45
Approaches to Security: 3
The Realistic Security House
Grid Operation and Security by Eddie Aronovich, Mar 2008
46
Mapping certificate to local user
• Site use local accounting system
• Pool of users dedicated for the Grid
• Each user is mapped using gridmap file or
VOMS
• Mapping can implement
local policy on external users
Grid Operation and Security by Eddie Aronovich, Mar 2008
47
Certificate Request
User send public
key to CA along
with proof of
identity.
User generates
public/private
key pair.
CA confirms identity,
signs certificate and
sends back to user.
Certificate
Request
Public Key
Private Key
encrypted on
local disk
Cert
ID
slide based on presentation given by Carl Kesselman at GGF Summer School 2004
Grid Operation and Security by Eddie Aronovich, Mar 2008
48
Inside the Certificate
• Standard (X.509) defined format.
• User identification (e.g. full name).
Name
Issuer: CA
Public Key
Signature
• Users Public key.
• A “signature” from a CA created by encoding a unique string (a
hash) generated from the users identification, users public key
and the name of the CA. The signature is encoded using the
CA’s private key. This has the effect of:
– Proving that the certificate came from the CA.
– Vouching for the users identification.
– Vouching for the binding of the users public key to their
identification.
Grid Operation and Security by Eddie Aronovich, Mar 2008
49
Mutual Authentication

A sends their certificate;

B verifies signature in A’s certificate;

B sends to A a challenge string;

A encrypts the challenge string with his
private key;
B
A
A’s certificate
Verify CA signature
Random phrase
Encrypt with A’ s private key

A sends encrypted challenge to B
Encrypted phrase



B uses A’s public key to decrypt the
challenge.
B compares the decrypted string with the
original challenge
Decrypt with A’ s public key
Compare with original phrase
If they match, B verified A’s identity and A
can not repudiate it.
Grid Operation and Security by Eddie Aronovich, Mar 2008
50
Proxy certificate
•
•
•
•
Avoid passphrase re-enter by creating a proxy
Proxy consists of a new certificate and a private key
Proxy certificate contains the owner's identity (modified)
Remote party receives proxy's certificate (signed by
the owner), and owner's certificate.
• Proxy certificate is life-time limited
• Chain of trust from the CA to proxy through the owner
Grid Operation and Security by Eddie Aronovich, Mar 2008
51
•Prof. Dieter KRANZLMUELLER , EGEE 08
Grids in Europe
•www.eu-egi.eu
www.eu-egi.eu
EGEE08
Istanbul, Turkey
52
To be continued
Eddie Aronovich – Operating System course (TAU CS, Jan 2009)
53