Transcript Document

PROOF and Condor
Fons Rademakers
http://root.cern.ch
December, 2003
ACAT'03
1
PROOF – Parallel ROOT Facility


Collaboration between core ROOT group
at CERN and MIT Heavy Ion Group
Part of and based on ROOT framework
 Uses heavily ROOT networking and
other infrastructure classes
December, 2003
ACAT'03
2
Main Motivation





Design a system for the interactive analysis of very large
sets of ROOT data files on a cluster of computers
The main idea is to speed up the query processing by
employing parallelism
In the GRID context, this model will be extended from a
local cluster to a wide area “virtual cluster”. The
emphasis in that case is not so much on interactive
response as on transparency
With a single query, a user can analyze a globally
distributed data set and get back a “single” result
The main design goals are:

Transparency, scalability, adaptability
December, 2003
ACAT'03
3
#proof.conf
slave node1
slave node2
slave node3
slave node4
Remote
PROOF
Parallel Chain Analysis
Local PC
root
stdout/obj
ana.C
proof
proof
node1
Cluster
TFile
*.root
ana.C
proof
node2
$ root
root [0] tree.Process(“ana.C”)
root [1] gROOT->Proof(“remote”)
proof
root [2] chain.Process(“ana.C”)
proof = master server
proof = slave server
*.root
TNetFile
TFile
*.root
TFile
*.root
node3
proof
node4
December, 2003
ACAT'03
5
PROOF - Architecture

Data Access Strategies


Transparency



Input objects copied from client
Output objects merged, returned to client
Scalability and Adaptability



Local data first, also rootd, rfio, dCache, SAN/NAS
Vary packet size (specific workload, slave
performance, dynamic load)
Heterogeneous Servers
Support to multi site configurations
December, 2003
ACAT'03
6
Workflow For Tree Analysis –
Pull Architecture
Slave 1
Process(“ana.C”)
Master
Initialization GetNextPacket()
Process
Process
Process
GetNextPacket()
200,100
GetNextPacket()
340,100
GetNextPacket()
490,100
SendObject(histo)
Wait for next
command
Slave N
GetNextPacket() Initialization
Packet generator
Process
0,100
Process(“ana.C”)
100,100
Process
GetNextPacket()
300,40
GetNextPacket()
440,50
GetNextPacket()
590,60
Process
Process
Process
SendObject(histo)
Add
histograms
Wait for next
command
Display
histograms
December, 2003
ACAT'03
7
Data Access Strategies



Each slave get assigned, as much as
possible, packets representing data in
local files
If no (more) local data, get remote data
via rootd, rfiod or dCache (needs good
LAN, like GB eth)
In case of SAN/NAS just use round robin
strategy
December, 2003
ACAT'03
8
Additional Issues

Error handling



Authentication


Death of master and/or slaves
Ctrl-C interrupt
Globus, ssh, kerb5, SRP, clear passwd, uid/gid
matching
Sandbox and package manager

Remote user environment
December, 2003
ACAT'03
9
Running a PROOF Job

Specify a collection of TTrees or files with
objects
root[0] gROOT->Proof(“cluster.cern.ch”);
root[1] TDSet *set = new TDSet(“TTree”, “AOD”);
root[2] set->AddQuery(“lfn:/alice/simulation/2003-04”,“V0.6*.root”);
…
root[10] set->Print(“a”);
root[11] set->Process(“mySelector.C”);


Returned by DB or File Catalog query etc.
Use logical filenames (“lfn:…”)
December, 2003
ACAT'03
10
The Selector

Basic ROOT TSelector

Created via TTree::MakeSelector()
// Abbreviated version
class TSelector : public TObject {
Protected:
TList *fInput;
TList *fOutput;
public
void Init(TTree*);
void Begin(TTree*);
void SlaveBegin(TTree *);
Bool_t Process(int entry);
void SlaveTerminate();
void Terminate();
};
December, 2003
ACAT'03
11
PROOF Scalability
8.8GB, 128 files
1 node: 325 s
32 nodes in parallel: 12 s
32 nodes: dual Itanium II 1 GHz CPU’s,
2 GB RAM, 2x75 GB 15K SCSI disk,
1 Fast Eth
Each node has one copy of the data set
(4 files, total of 277 MB), 32 nodes:
8.8 Gbyte in 128 files, 9 million events
December, 2003
ACAT'03
12
PROOF and Data Grids

Many services are a good fit





Authentication
File Catalog, replication services
Resource brokers
Job schedulers
Monitoring
 Use abstract interfaces
December, 2003
ACAT'03
13
The Condor Batch System

Full-featured batch system


Flexible, distributed architecture



Dedicated clusters and/or idle desktops
Transparent I/O and file transfer
Based on 15 years of advanced research



Job queuing, scheduling policy, priority scheme,
resource monitoring and management
Platform for ongoing CS research
Production quality, in use around the world, pools
with 100’s to 1000s of nodes.
See: http://www.cs.wisc.edu/condor
December, 2003
ACAT'03
14
COD - Computing On Demand


Active, ongoing research and development
Share batch resource with interactive use




Most of the time normal Condor batch use
Interactive job “borrows” the resource for
short time
Integrated into Condor infrastructure
Benefits


Large amount of resource for interactive burst
Efficient use of resources (100% use)
December, 2003
ACAT'03
15
COD - Operations
Normal batch
Request claim
Activate claim
Suspend claim
Resume
Deactivate
Release
December, 2003
Batch
Batch
COD
Batch
COD
Batch
COD
Batch
Batch
Batch
ACAT'03
16
PROOF and COD

Integrate PROOF and Condor COD


Master starts slaves as COD jobs



Great cooperation with Condor team
Standard connection from master to slave
Master resumes and suspends slaves as
needed around queries
Use Condor or external resource manager
to allocate nodes (vm’s)
December, 2003
ACAT'03
17
PROOF and COD
Condor
Master
Slave
Client
Batch
Condor
Condor
Slave
Batch
Condor
Batch
December, 2003
ACAT'03
18
PROOF and COD Status

Status



Basic implementation finished
Successfully demonstrated at SC’03 with 45
slaves as part of PEAC
TODO


Further improve interface between PROOF
and COD
Implement resource accounting
December, 2003
ACAT'03
19
PEAC –
PROOF Enabled Analysis Cluster

Complete event analysis solution





Data catalog and data management
Resource broker
PROOF
Components used: SAM catalog, dCache,
new global resource broker, Condor+COD,
PROOF
Multiple computing sites with independent
storage systems
December, 2003
ACAT'03
20
PEAC System Overview
December, 2003
ACAT'03
21
PEAC Status

Successful demo at SC’03




Four sites, up to 25 nodes
Real CDF StNtuple based analysis
COD tested with 45 slaves
Doing post mortem and plan for next
design and implementation phases



Available manpower will determine time line
Plan to use 250 node cluster at FNAL
Other cluster at UCSD
December, 2003
ACAT'03
22
Conclusions



PROOF maturing
Lot of interest from experiments with large data
sets
COD essential to share batch and interactive
work on the same cluster


Maximizes resource utilization
PROOF turns out to be powerful application to
use and show the power of Grid middleware to
its full extend

See tomorrows talk by Andreas Peters on PROOF and
AliEn
December, 2003
ACAT'03
23