Transcript Document
PROOF and Condor
Fons Rademakers
http://root.cern.ch
December, 2003
ACAT'03
1
PROOF – Parallel ROOT Facility
Collaboration between core ROOT group
at CERN and MIT Heavy Ion Group
Part of and based on ROOT framework
Uses heavily ROOT networking and
other infrastructure classes
December, 2003
ACAT'03
2
Main Motivation
Design a system for the interactive analysis of very large
sets of ROOT data files on a cluster of computers
The main idea is to speed up the query processing by
employing parallelism
In the GRID context, this model will be extended from a
local cluster to a wide area “virtual cluster”. The
emphasis in that case is not so much on interactive
response as on transparency
With a single query, a user can analyze a globally
distributed data set and get back a “single” result
The main design goals are:
Transparency, scalability, adaptability
December, 2003
ACAT'03
3
#proof.conf
slave node1
slave node2
slave node3
slave node4
Remote
PROOF
Parallel Chain Analysis
Local PC
root
stdout/obj
ana.C
proof
proof
node1
Cluster
TFile
*.root
ana.C
proof
node2
$ root
root [0] tree.Process(“ana.C”)
root [1] gROOT->Proof(“remote”)
proof
root [2] chain.Process(“ana.C”)
proof = master server
proof = slave server
*.root
TNetFile
TFile
*.root
TFile
*.root
node3
proof
node4
December, 2003
ACAT'03
5
PROOF - Architecture
Data Access Strategies
Transparency
Input objects copied from client
Output objects merged, returned to client
Scalability and Adaptability
Local data first, also rootd, rfio, dCache, SAN/NAS
Vary packet size (specific workload, slave
performance, dynamic load)
Heterogeneous Servers
Support to multi site configurations
December, 2003
ACAT'03
6
Workflow For Tree Analysis –
Pull Architecture
Slave 1
Process(“ana.C”)
Master
Initialization GetNextPacket()
Process
Process
Process
GetNextPacket()
200,100
GetNextPacket()
340,100
GetNextPacket()
490,100
SendObject(histo)
Wait for next
command
Slave N
GetNextPacket() Initialization
Packet generator
Process
0,100
Process(“ana.C”)
100,100
Process
GetNextPacket()
300,40
GetNextPacket()
440,50
GetNextPacket()
590,60
Process
Process
Process
SendObject(histo)
Add
histograms
Wait for next
command
Display
histograms
December, 2003
ACAT'03
7
Data Access Strategies
Each slave get assigned, as much as
possible, packets representing data in
local files
If no (more) local data, get remote data
via rootd, rfiod or dCache (needs good
LAN, like GB eth)
In case of SAN/NAS just use round robin
strategy
December, 2003
ACAT'03
8
Additional Issues
Error handling
Authentication
Death of master and/or slaves
Ctrl-C interrupt
Globus, ssh, kerb5, SRP, clear passwd, uid/gid
matching
Sandbox and package manager
Remote user environment
December, 2003
ACAT'03
9
Running a PROOF Job
Specify a collection of TTrees or files with
objects
root[0] gROOT->Proof(“cluster.cern.ch”);
root[1] TDSet *set = new TDSet(“TTree”, “AOD”);
root[2] set->AddQuery(“lfn:/alice/simulation/2003-04”,“V0.6*.root”);
…
root[10] set->Print(“a”);
root[11] set->Process(“mySelector.C”);
Returned by DB or File Catalog query etc.
Use logical filenames (“lfn:…”)
December, 2003
ACAT'03
10
The Selector
Basic ROOT TSelector
Created via TTree::MakeSelector()
// Abbreviated version
class TSelector : public TObject {
Protected:
TList *fInput;
TList *fOutput;
public
void Init(TTree*);
void Begin(TTree*);
void SlaveBegin(TTree *);
Bool_t Process(int entry);
void SlaveTerminate();
void Terminate();
};
December, 2003
ACAT'03
11
PROOF Scalability
8.8GB, 128 files
1 node: 325 s
32 nodes in parallel: 12 s
32 nodes: dual Itanium II 1 GHz CPU’s,
2 GB RAM, 2x75 GB 15K SCSI disk,
1 Fast Eth
Each node has one copy of the data set
(4 files, total of 277 MB), 32 nodes:
8.8 Gbyte in 128 files, 9 million events
December, 2003
ACAT'03
12
PROOF and Data Grids
Many services are a good fit
Authentication
File Catalog, replication services
Resource brokers
Job schedulers
Monitoring
Use abstract interfaces
December, 2003
ACAT'03
13
The Condor Batch System
Full-featured batch system
Flexible, distributed architecture
Dedicated clusters and/or idle desktops
Transparent I/O and file transfer
Based on 15 years of advanced research
Job queuing, scheduling policy, priority scheme,
resource monitoring and management
Platform for ongoing CS research
Production quality, in use around the world, pools
with 100’s to 1000s of nodes.
See: http://www.cs.wisc.edu/condor
December, 2003
ACAT'03
14
COD - Computing On Demand
Active, ongoing research and development
Share batch resource with interactive use
Most of the time normal Condor batch use
Interactive job “borrows” the resource for
short time
Integrated into Condor infrastructure
Benefits
Large amount of resource for interactive burst
Efficient use of resources (100% use)
December, 2003
ACAT'03
15
COD - Operations
Normal batch
Request claim
Activate claim
Suspend claim
Resume
Deactivate
Release
December, 2003
Batch
Batch
COD
Batch
COD
Batch
COD
Batch
Batch
Batch
ACAT'03
16
PROOF and COD
Integrate PROOF and Condor COD
Master starts slaves as COD jobs
Great cooperation with Condor team
Standard connection from master to slave
Master resumes and suspends slaves as
needed around queries
Use Condor or external resource manager
to allocate nodes (vm’s)
December, 2003
ACAT'03
17
PROOF and COD
Condor
Master
Slave
Client
Batch
Condor
Condor
Slave
Batch
Condor
Batch
December, 2003
ACAT'03
18
PROOF and COD Status
Status
Basic implementation finished
Successfully demonstrated at SC’03 with 45
slaves as part of PEAC
TODO
Further improve interface between PROOF
and COD
Implement resource accounting
December, 2003
ACAT'03
19
PEAC –
PROOF Enabled Analysis Cluster
Complete event analysis solution
Data catalog and data management
Resource broker
PROOF
Components used: SAM catalog, dCache,
new global resource broker, Condor+COD,
PROOF
Multiple computing sites with independent
storage systems
December, 2003
ACAT'03
20
PEAC System Overview
December, 2003
ACAT'03
21
PEAC Status
Successful demo at SC’03
Four sites, up to 25 nodes
Real CDF StNtuple based analysis
COD tested with 45 slaves
Doing post mortem and plan for next
design and implementation phases
Available manpower will determine time line
Plan to use 250 node cluster at FNAL
Other cluster at UCSD
December, 2003
ACAT'03
22
Conclusions
PROOF maturing
Lot of interest from experiments with large data
sets
COD essential to share batch and interactive
work on the same cluster
Maximizes resource utilization
PROOF turns out to be powerful application to
use and show the power of Grid middleware to
its full extend
See tomorrows talk by Andreas Peters on PROOF and
AliEn
December, 2003
ACAT'03
23