FermiGrid - PRIMA, VOMS, GUMS & SAZ Keith Chadwick Fermilab

Download Report

Transcript FermiGrid - PRIMA, VOMS, GUMS & SAZ Keith Chadwick Fermilab

FermiGrid - PRIMA, VOMS, GUMS & SAZ
Keith Chadwick
Fermilab
[email protected]
What is FermiGrid?
FermiGrid is:


The Fermilab campus Grid.
A set of common services to support the campus Grid:



The site globus gateway, VOMS, VOMRS, GUMS, SAZ, MyProxy, Gratia
Accounting, etc.
A forum for promoting stakeholder interoperability and resource
sharing within Fermilab.
The portal from the Open Science Grid to Fermilab Compute and
Storage Services:


Production: fermigrid1, fngp-osg, fcdfosg1, fcdfosg2, docabosg2, sdsstam, FNAL_FERMIGRID_SE (public dcache), stken, etc…
Integration: fgtest1, fnpcg, etc…
FermiGrid Web Site & Additional Documentation:

http://fermigrid.fnal.gov/
23 Oct 2006
Keith Chadwick
2
FermiGrid - Infrastructure
Site Globus Gateway:


Job forwarding gateway using Condor-G and CEMon.
Makes use of “accept limited” globus gatekeeper option.
VOMS & VOMRS:


VO Membership Service & VO Management Registration Service .
Allows user to select roles.
GUMS:


Grid User Mapping Service.
maps FQAN in x509 proxy to site specific UID/GID.
SAZ:


Site AuthoriZation Service.
Allows site to to make fine grained job authorization decisions.
MyProxy:

Service to security store and retrieve signed x509 proxies.
23 Oct 2006
Keith Chadwick
3
Site Gatekeeper Job Forwarding
Why?





Single point of control.
Hide site internal details.
Facilitate resource sharing.
Allow (some) load balancing
Support specification of user job requirements (via
ClassAds).
Why not?



Complicates problem diagnosis.
Non-standard configuration.
Can confuse users.
23 Oct 2006
Keith Chadwick
4
Site Gateway Job Forwarding with CEMon and BlueArc - Animation
VOMS
Server
Periodic
Synchronization
GUMS
Server
SAZ
Server
? ? ?
Step 2 – user submits their grid job via
globus-job-run, globus-job-submit, or condor-g
Site
Gateway
clusters send ClassAds
via CEMon
to the site wide gateway
BlueArc
CMS
WC1
23 Oct 2006
CDF
OSG1
CDF
OSG2
D0
CAB2
Keith Chadwick
SDSS
TAM
GP
Farm
LQCD
5
Globus gatekeeper - GUMS & SAZ interface
GUMS and SAZ are interfaced to the globus
gatekeeper through the gsi_authz callout:
/etc/grid-security/gsi_authz.conf
##### PRIMA
globus_mapping /usr/local/vdt/prima/lib/libprima_authz_module_gcc32dbg globus_gridmap_callout
##### SAZ
globus_authorization /usr/local/vdt/saz/client/lib/libSAZ-gt3.2_gcc32dbg globus_saz_access_control_callout
23 Oct 2006
Keith Chadwick
6
SAZ - Site AuthoriZation Service
We deployed the Fermilab Site AuthoriZation (SAZ) service on the
Fermilab Site Globus Gatekeeper (fermigrid1) on Monday October 2,
2006.
SAZ allows Fermilab to make Grid job authorization decisions for the
Fermilab site based using the DN, VO, Role and CA information
contained in the proxy certificate provided by the user.
Fermilab has currently configured SAZ to operate in a default accept
mode for user proxy credentials that are associated with VOs (user proxy
credentials generated by voms-proxy-init).
Users that continue to use grid-proxy-init may no longer be able execute
on Fermilab Compute Elements.
23 Oct 2006
Keith Chadwick
7
SAZ Database Table Structure
DN:

user_name, enabled, trusted, changedAt
VO:

vo_name, enabled, trusted, changedAt
Role:

role_name, enabled, trusted, changedAt
CA:

ca_name, enabled, trusted, changedAt
23 Oct 2006
Keith Chadwick
8
SAZ - Site AuthoriZation Pseudo-Code
Site authorization callout on globus gateway sends SAZ authorization request (example):
user: /DC=org/DC=doegrids/OU=People/CN=Keith Chadwick 800325
VO:
fermilab
Role: /fermilab/Role=NULL/Capability=NULL
CA:
/DC=org/DC=DOEGrids/OU=Certificate Authorities/CN=DOEGrids CA 1
SAZ server on fermigrid4 receives SAZ authorization request, and:
1.
Verifies certificate and trust chain.
2.
If [ the certificate does not verify or the trust chain is invalid ]; then
SAZ returns "Not-Authorized"
fi
3.
4.
Issues select on "user:" against the SAZDB user table
if [ the select on "user:" fails ]; then
a record corresponding to the "user:" is inserted into the SAZDB user table with (user.enabled = Y, user.trusted=F)
fi
5.
6.
Issues select on "VO:" against the local SAZDB vo table
if [ the select on "VO:" fails ]; then
a record corresponding to the "VO:" is inserted into the SAZDB vo table with (vo.enabled = Y, vo.trusted=F)
fi
7.
8.
Issues select on ”Role:" against the local SAZDB role table
if [ the select on “Role:" fails ]; then
a record corresponding to the "VO-Role:" is inserted into the SAZDB role table with (role.enabled = Y, role.trusted=F)
fi
9.
10.
Issues select "CA:" against the local SAZDB ca table
if [ the select on "CA:" fails ]; then
a record corresponding to the "CA:" is inserted into the SAZDB ca table with (ca.enabled = Y, ca.trusted=F)
fi
11.
The SAZ server then returns the logical and of (user.enabled, vo.enabled, vo-role.enabled, ca.enabled ) to the SAZ client (which was called by either the globus
gatekeeper or glexec).
23 Oct 2006
Keith Chadwick
9
SAZ - Animation
DN
VO
SAZ
Job
Role
Gatekeeper
CA
Job
23 Oct 2006
A
D
M
I
N
Keith Chadwick
10
SAZ - A Couple of Caveats
What about grid-proxy-init or voms-proxy-init without a VO?


The “NULL” VO is specifically disabled (vo.enabled=“F”, vo.trusted=“F”).
If a user has user.trusted=“Y” in their user record then



>>> we allow them to execute jobs without VO “sponsorship” <<<.
This granting of user.trusted=“Y” is not automatic.
The number of users with this privilege will be VERY limited.
What about pilot jobs / glide-in operation?



To comply with the (draft) Fermilab policy on pilot jobs, VO’s that submit pilot jobs will shortly be
required to use glexec to launch their user portion of the glide-in jobs.
SAZ authoriization requests from glexec may require that the VO to have role.trusted=“Y” in the
VO specific role record that they are using for glide-in operations.
The granting of role.trusted=“Y” will not be automatic.
Authorization for trusted=“Y” flags in the SAZ database tables is granted and revoked by
the Fermilab Computer Security Executive based on explicit trust relationships.
23 Oct 2006
Keith Chadwick
11
SAZ - Open Issues
Extra /CN=<random number> in DN.

Examples:
/DC=org/DC=doegrids/OU=People/CN=Leigh Grundhoefer (GridCat) 693100/CN=1173547087
/DC=org/DC=doegrids/OU=People/CN=Leigh Grundhoefer (GridCat) 693100/CN=1642479879
/DC=org/DC=doegrids/OU=People/CN=Leigh Grundhoefer (GridCat) 693100/CN=1769868279



Result of user issuing grid-proxy-init.
Does not occur in voms-proxy-init.
Looking at code changes to handle “extra CN problem”.
Condor fails to properly delegate the full voms proxy attributes.


This can be worked around in condor_config by setting:
DELEGATE_JOB_GSI_CREDENTIALS=FALSE
A ticket on this issue has been opened with the Condor developers.
Testing by Chris Green and John Weigand show that Reliable File Transfer (RFT) with WSGram is also failing to properly delegate the full voms attributes:


RFT is using the full voms proxy for the first transaction, but uses a cached copy without the role
information for the second transaction.
A ticket on this issue has been opened with the Globus developers.
23 Oct 2006
Keith Chadwick
12
Draft Fermilab VO Trust Relationship Policy
Fermilab will only accept jobs from Virtual Organizations (VOs) which have established trust relationships in good
standing. Trust relationships can be requested by VO management by contacting Fermilab Computer Security, and are
granted and revoked by the Fermilab Computer Security Executive.
Some VOs such as CDF, D0, MINOS, LQCD, already possess a valid trust relationship with Fermilab due to overlap of
staff or the umbrella of Fermilab's own operational and management controls. Other VOs will be expected to establish
the trust relationship as described below in order to continue using Fermilab resources.
Criteria for Establishing Trust Relationships:






Policies and practices for mutual security are continually adjusted to meet changes in risk perceptions. (NIST)
Acceptable use of Fermilab resources is governed by both the VO's and Fermilab's Acceptable Use Policies. The Open Science Grid's
User AUP (V2.0, February 9, 2006) is an example of an AUP acceptable to Fermilab and applies to users operating under OSG's
auspices.
A VO must describe and operate its technical infrastructure in a transparent manner which permits verification of its functioning.
A VO must have an operational organization with an appropriate number of staff members who respond to Fermilab requests (email and/or
phone calls) within a reasonable time, generally during the normal business hours of its home site.
A VO must have an established and published response plan to deal with security incidents and reports of unauthorized use, and the staff
to implement the plan.
Non-compliance with site policies by a VO or its members may trigger early or frequent re-examination of the trust relationship with the
VO.
23 Oct 2006
Keith Chadwick
13
Draft Pilot Job Policy
A Pilot Job (also called a glide-in or late-binding job) is a batch job which starts on a grid worker node
but loads some other job, termed the User Job, which has been created by another user.
Rules:





Pilot Jobs will only be acceptable from VOs whose trust relationships with Fermilab include authorization to use
them.
A Pilot Job must use the site provided glexec facility to map the application and data files to the actual owner of the
User Job. glexec will perform the necessary callout to the Grid User Management System (GUMS) and Site
Authorization Service (SAZ), and the Pilot Job must respect the result of these Policy Decision Points.
A Pilot Job and the User Job will not attempt to circumvent job accounting or limits on placed system resources by
the batch system.
A Pilot Job may launch multiple User Jobs in serial fashion, but must not attempt to maintain data files between
jobs belonging to different users.
When transferring a User Job into the worker node, the Pilot Job will use a level of security equivalent to that of the
original job submission process.
Consequences:



Fermilab reserves the right to terminate any batch jobs that appear to be operating beyond their authorization,
including Pilot Jobs and User Jobs not in compliance with this policy.
The DN of the Job Manager or the entire VO may be placed on the Site Black List until the situation is rectified.
Fermilab expects any VO authorized to run Pilot Jobs to assure compliance by its users.
23 Oct 2006
Keith Chadwick
14
glexec
Joint development by David Groep / Gerben Venekamp / Oscar Koeroo
(NIKHEF) and Dan Yocum / Igor Sfiligoi (Fermilab).
Integrated (via “plugins”) with LCAS / LCMAPS infrastructure (for LCG) and
GUMS / SAZ infrastructure (for OSG).
glexec is currently deployed on a couple of small clusters at Fermilab, moving
towards a “significant” deployment at Fermilab this week.
Will be included in Condor 6.9.x.
23 Oct 2006
Keith Chadwick
15
glexec block diagram
23 Oct 2006
Keith Chadwick
16
High Availability / Service Redundancy Plans
Gatekeeper:

Redundant Condor_Master and Condor_Negotiator.
VOMS:


Sticky problem.
Have requested a change to VOMRS that will make things much easier.
GUMS:


Have a test active/standby GUMS service operating with Linux-HA.
Believe that we know how to implement an active/active service.
SAZ:

Can implement either active/standby or active/active.
MyProxy:

Need for MyProxy will be eliminated by new CEMon based job forwarding
mechanism.
23 Oct 2006
Keith Chadwick
17
Metrics
In addition to the normal operation effort of installing, running and upgrading the
various FermiGrid services over the past year, we have spent significant effort to
collect and publish operational metrics. Examples:










Globus gatekeeper calls by jobmanager per day
Globus gatekeeper IP connections per day
VOMS calls per day
VOMS server IP connections per day
GUMS calls per day
GUMS server IP connections per day
GUMS server unique Certificates and Mappings per day
SAZ Authorizations and Rejections per day
SAZ server IP connections per day
SAZ server unique DN, VO, Role & CA per day.
Metrics collection scripts run once a day and collect information for the previous
day.
23 Oct 2006
Keith Chadwick
18
Metrics - fermigrid1
23 Oct 2006
Keith Chadwick
19
Service Monitoring
Service Monitor scripts run multiple times per day (typically once per
hour).
They gather detailed information about the service that they are
monitoring.
They also verify the health of the service that they are monitoring
(together with any dependent services), notify administrators and
automatically restart the service(s) as necessary to insure continuous
operations.
23 Oct 2006
Keith Chadwick
20
Service Monitor - fermigrid1
23 Oct 2006
Keith Chadwick
21
Areas of Current Work within FermiGrid
SAZ and glexec - nearing completion.
BlueArc storage and public dcache storage element - ongoing.
Further Metrics and Service Monitor Development - ongoing.
Gratia Accounting.
Web Services.
XEN.
Service Failover
Research, Development & Deployment of future ITBs and OSG releases
23 Oct 2006
Keith Chadwick
22
Parting Comments
Extracting metrics and service monitor information needs to be easier - trolling
through (globus gatekeeper, voms, gums, saz) log files is not an efficient
method.
Having a uniform standard time format (and some sort of unique process/thread
id) is essential.
Problem diagnosis is also very difficult (our job forwarding gateway does
compound this problem).
David Bianco from Jefferson Lab gave a presentation on Sguil at the Fall 2006
HEPiX conference. Having a similar common interface for the globus
gatekeepers and services log files together with the ability to correlate events
from multiple sources would significantly improve problem diagnosis.
https://indico.fnal.gov/conferenceDisplay.py?confId=384
https://indico.fnal.gov/materialDisplay.py?contribId=9&amp;sessionId=17&amp;materialId=slides&amp;confId=384
23 Oct 2006
Keith Chadwick
23
fin
Any questions?
23 Oct 2006
Keith Chadwick
24