Globus activities within INFN Massimo Sgaravatto INFN Padova

Download Report

Transcript Globus activities within INFN Massimo Sgaravatto INFN Padova

Globus activities
within INFN
Massimo Sgaravatto
INFN Padova
for the INFN Globus group
[email protected]
Globus activities within INFN


WP “Installation and Evaluation of the Globus Toolkit”
of the INFN-GRID Project
Goal: evaluate the Globus toolkit as a GRID
framework providing basic services




Which services can be useful ?
What is necessary to integrate/modify ?
What is missing ?
Duration: 6 months

Results of this first evaluation used to plan future activities
Tasks







Security
Information Service
Resource Manager
Globus deployment
Data Access and Migration
Fault Monitoring
Execution Environment Management
Status
Globus installed on ~ 35
machines in 11 sites
TRENTO
TORINO
UDINE
MILANO
PADOVA
LNL
TRIESTE
PAVIA
FERRARA
GENOVA
PARMA
CNAF
BOLOGNA
PISA
FIRENZE
S.Piero
PERUGIA
LNGS
ROMA
L’AQUILA
ROMA2
LNF
SASSARI
NAPOLI
BARI
LECCE
SALERNO
CAGLIARI
COSENZA
PALERMO
CATANIA
LNS
Security (GSI)

Already done:

Evaluation of the Globus security architecture

We like the general architecture, but:




Granting local "identities" based only on certificate subjects
allows the existence of multiple valid certificates for the
same subject
Authentication library not in sync with OpenSSL
development
Cryptic diagnostics (e.g. "certificate chain too long" when
the CA policy check fails)
Globus certificates (for hosts and users) signed by
INFN certification authority
Security (GSI)

To do:

Definition and implementation of architecture
of CAs




Up to task force of the DataGrid project
Make certificate requests easier
Periodic update of CRL
“Management” of grid-mapfile updates

I.e.: a certain Globus resource must be available to
all members of a specific physics group
Information Service (GIS)

Already done:

INFN MDS server serving Globus 1.1.1 and
1.1.2 installations



Lot of problems using the “default” American
MDS server
Definition and implementation of test
architecture of GIS (for Globus 1.1.3)
Web interface for browsing
GIS Architecture (test phase)
Implemented
Top Level INFN GIIS
Dc=infn,dc=it,
o=grid
Implemented using
INFNGRID distribution
To be implemented
Exp=atlas, o=grid
Dc=bo, Dc=infn,
dc=it,o=grid
Dc=mi,Dc=infn,
GIIS dc=it,o=grid
GIIS
Bologna
Milano
INFN ATLAS GIIS
GRIS
Information Service (GIS)

To do:


Netscape LDAP server as Top level INFN GIIS
Tests on performance and scalability



Results used to define and implement the GIS
architecture
Review the information gathered from the various
machines and published in the GIS
Other tools and interfaces for Grid users and
administrators
Resource Management (GRAM)

Already done:

Job submission tests using Globus tools (globusrun, globus-job-run,
globus-job-submit)



GRAM as uniform interface to different underlying resource
management systems (LSF, Condor, PBS)
Some bugs found and fixed

Standard output and error for vanilla Condor jobs

globus-job-status

…
Some bugs can be solved without major re-design and/or reimplementation:

For LSF the RSL parameter (count=x) is translated into: bsub –n x …



Should be: bsub …
…
Two major problems:


Scalability
Fault tolerance
x times
Globus GRAM Architecture
Client
Globus front-end machine
pc1
pc2
pc1% globusrun –b –r pc2.pd.infn.it/jobmanager-xyz \
–f file.rsl
file.rsl:
&
(executable=/diskCms/startcmsim.sh)
(stdin=/diskCms/PythiaOut/filename
(stdout=/diskCms/Cmsim/filename)
(count=1)
LSF/ Condor/ PBS/ …
Jobmanager
Job
Scalability


One jobmanager for each globusrun
If I want to submit 1000 jobs ???

1000 globusrun


1000 jobmanagers running in the front-end machine !!!
%globusrun –b –r pc2.infn.it/jobmanager-xyz –f file.rsl
file.rsl:
&
(executable=/diskCms/startcmsim.sh)
(stdin=/diskCms/PythiaOut/filename)
(stdout=/diskCms/CmsimOut/filename)
(count=1000)

It is not possible to specify in the RSL file 1000 different input
files and 1000 different output files …



$(Process) in Condor
Problems with job monitoring (globus-job-status)
Therefore (count=x) with x>1 not very useful !
Fault tolerance



The jobmanager is not persistent
If the jobmanager can’t be contacted, Globus
assumes that the job(s) has been completed
Example of problem



Submission of n jobs on a cluster managed by a
local resource management systems
Reboot of the front end machine
The jobmanager(s) doesn’t restart

Orphan jobs  Globus assumes that the jobs have been
successfully completed
Resource Management (GRAM)

Already done:


Submission of Condor jobs to Globus resources (Condor-G and
GlideIn mechanisms)
Evaluation of RSL as uniform language to specify resources


The RSL syntax model seems suitable to define even complicated
resource specification expressions
The common set of RSL attributes is often not sufficient


More flexibility is required


The attributes not belonging to the common set are ignored
Resource administrators should be allowed to define new attributes and
users should be allowed to use them in resource specification expressions
(Condor Class-Ads model)
Same language to describe the offered resources and the requested
resources (Condor Class-Ads model) seems a better approach
Resource Management (GRAM)

Already done:

“Cooperation” between GRAM and GIS

The information on characteristics and status of local
resources and on jobs is not enough



As local resources we must consider Farms and not the
single workstations
Other information (i.e. total and available CPU power)
needed
The default schema must be integrated with other info
provided by the underlying resource management
systems or by specific agents
GRAM & Condor & GIS
GRAM & LSF & GIS
Must be fixed
Jobs & GIS

Info on Globus jobs published in the GIS:

User






Subject of certificate
Local user name
RSL string
Globus job id
LSF/Condor/… job id
Status: Run/Pending/…
Resource Management (GRAM)

To do:


Tests with GRAM API
Tests with real applications and real environments (CMS fall
production)

Already started



Memory leak in the job manager ?!?!?!?
Solve the problems
Identity a set of useful attributes of a Condor pool, LSF
cluster, PBS cluster that should be reported to the GIS, and
integrate the default schema


Let’s start with information provided by the underlying resource
management system
Second step: specific agents
Globus deployment

GRID
Tools to enable local administrators to deploy
the GRID software (now Globus 1.1.3 and
related packages: OpenLDAP, …)




Reduce complexity and manpower necessary for
installation
Decrease errors during installations
Collect bug fixes
Include INFN customizations

Certificates (for hosts and users) signed by INFN CA


… but user certificates signed by Globus CA are accepted
as well
Preliminary architecture for GIS
First step (July 2000)

Software distribution available on AFS


Fixes for bugs found during first Globus
evaluations included
INFNGRID installation guide


GRID
Instructions for INFN customizations
included
Scripts to make certain steps (i.e. postinstall operations) automatic
Second step (now)


GRID
Pre-compiled distribution (available now for Linux
Red Hat 6.1): INFNGRID 1.1
Script for installation and deployment: infngrid-install

Users decide to use INFN customizations or “standard”
setup
Would you like the INFN setup (Y/N) ?
(1) Copy INFNGRID tar files from /afs/infn.it/project/infngrid/1.1/Linux to download dir
(2) Decompress and untar INFNGRID distribution files in install dir
(3) Configure INFNGRID software
(4) Globus Setup
(5) Configure GRAM services
Condor and LSF
(6) Globus local deploy
(7) GIIS Configuration
====================================================
Second step

Script for post install operations: globus-root-setup
(1)
(2)
(3)
(4)
(5)

Modify system files and reactive the inetd daemon
Change owner to root of certain files for tighter security
Modify system wide login files
Start/restart Globus now
Configure gsi-wuftpd and restart the inetd daemon
Installation instructions for special environments (configuration
of client machines, shared install-directory) included


GRID
List of included bug fixes
Status



Tests performed in different environments (INFN, CERN, FNAL)
“Officially” released
Available to DATAGRID partners
Next steps


Configuration of PBS as local resource management systems:
1.2
Support for Solaris 2.6: 1.2








GRID
We don’t plan (at least now) to support other platforms
Improvement of current no-precompiled distribution
Eventual use of infngrid-istall script for both pre-compiled and
non pre-compiled distribution
“Unattended” installation
Management of updates
Inclusion of GDMP: 1.2
Inclusion of other GRID software packages ??
Other works will be “triggered” by local administrators and users
Data Management

Already done:


Preliminary tests with GASS and gsiftp
To do:

Tests with GlobusFTP and Replica Catalog
Software (Globus Data Grid Alpha Release
2)
GARA
CISCO 7500
CISCO 7200
sunlab3
sunlab2
VC 100 Mbps
FE
Client
FE
Server
GARA API
GARA Network
Resource Manager

Preliminary tests considering both network and CPU
advance reservation
Other tasks

Fault Monitoring (HBM)




Evaluation of HBM for fault detection (for “system” and
“user” processes)
Data collectors (implementing automatic recovery
mechanisms)
… but the HBM package is not seeing active development
Execution Environment Management (GEM)


Evaluation of GEM as service for code migration
… but the GEM service now provides only limited capabilities
(executable staging)
Other info

http://www.pd.infn.it/~sgaravat/
INFN-GRID/Globus