An SQL API for Object Oriented Perl

Download Report

Transcript An SQL API for Object Oriented Perl

An SQL API for
Object Oriented Perl
Tim Adye
BaBar Collaboration
Particle Physics Department
eScience Centre informal seminar
12th November 2004
12th November 2004
Tim Adye
1
Talk Plan
• Summary of BaBar and its bookkeeping system
• “large-scale distributed metadata catalogue” in Gridspeak
• The problem
• “Why not just write SQL?”
• The BaBar SQL API
•
•
•
•
User view: Perl classes and command-line tool
Behind the scenes
Table schema configuration classes
Summary of features
• Could this be generalised to other applications?
• Possible improvements
• Comparison with other DBIx packages
• Summary and references
12th November 2004
Tim Adye
2
Feedback and Discussion
• Would this be useful outside BaBar?
• Would it be worth making a public release?
• eg. on CPAN
• Does it need any improvements?
• New features
• Make it compatible with some other standards
• eg. sit on top of another abstraction like DBIx::Table
• A better name!
• or have I just reinvented the wheel?
• If so, maybe I should import some classic wheel
technology into BaBar, rather than try to export my own
“square wheel” idea to the rest of the world
12th November 2004
Tim Adye
3
The BaBar Collaboration
10 Countries
12th November 2004
77 Institutions
Tim Adye
593 Physicists
4
The BaBar Detector at SLAC
Linear Accelerator
26th May 1999: first events recorded by BaBar
2.2 km PEP-II ring
BaBar
The BaBar Bookkeeping
• New bookkeeping system created in parallel with
new data model
• Switch from OO database (Objectivity) to flat files
(ROOT)
• Much simpler to manage, but no longer have the
top-level Objectivity catalogue
• Actually the catalogue didn’t really help us, so random set
of bookkeeping tools had been build up
• The main aim of the project was to create a coherent
bookkeeping system to replace these ad hoc tools
• Successfully deployed Feb 2004. In use ever since.
• Started development in Jan 2003
• beta test in Autumn
12th November 2004
Tim Adye
6
Bookkeeping Concepts
• Currently BaBar analysis data consists of
•
•
•
•
43G events (event = results from one e+e- collision)
in 335k files
total size of 186 TB
156M rows in largest table
• Expect all to grow geometrically over the next few years
• Data grouped in an overlapping hierarchy
• many-to-many relations between tables, eg.
•
•
•
•
1.7M runs: unit of datataking
219k collections: unit of access to event store
335k files: each collection stored in a few physical files
18k datasets: groups of data that analysis users want
• Need to select runs, collections, etc based on attributes
from any of these tables
12th November 2004
Tim Adye
7
8
Features of the BaBar Bookkeeping
• Decoupled from event store
• Job configuration written to site-independent tcl file
• Worker nodes don’t need access to database server
• Mirrored to other sites
• From SLAC to 4 “Tier A”, ~15 “Tier C”, user laptops
• Supports Oracle and MySQL (mirror OracleMySQL)
• Remote access
• Used for mirroring and for sites who don’t want local db
• Connection parameter/key distribution system
• Eg. BookkeepingCommand –-site=slac or –-site=ral
transparently connects to correct Oracle server at SLAC or
MySQL at RAL.
• Allows fallback servers if there are problems
12th November 2004
Tim Adye
9
Other Features of the Bookkeeping
• Import of data to each site managed by local
bookkeeping system
• Each site knows what it has and what it wants
• New data imported automatically in cron job
• Task management system
• Controls job submission, checking, resubmission
• Currently just for “production jobs”, but soon for user
analysis jobs too
• Looking at interfacing this to LCG job submission
• Already use LCG for UK Monte Carlo production
12th November 2004
Tim Adye
10
User Access
• Users need to query database to find out what
data to process
• May also need other information
• eg. luminosity, run numbers, file sizes
• Mostly select by dataset, but may need to limit further
• eg. only data available locally, taken at “peak” energy,
excluding some problem datataking period
• Cannot expect users to know which tables to use,
how to join them, or even the SQL syntax
• Even worse if the schema change
• Cannot expect developers to code for all
combinations of queries with all possible selections
• Previous ad hoc tools (some mine!) tried to do this and it
was a nightmare, even for a simpler table structure
12th November 2004
Tim Adye
11
BaBar SQL API – user view
• Each column that users might want to query or
select on is given a unique logical name – regardless
of which table it lives in
• These names are used to specify query values
$query->addValues('collection', 'gbytes');
and selections
Each of these
happens to be in
a different table
$query->addSelector('dataset', 'Dilepton-*');
$query->addSelector('run', '10000-19999');
• Different types of data allow for different selection
syntax, eg. wildcards for names, or ranges for run numbers.
• Can also use SQL expressions (in terms of logical names)
$query->addValues('SUM(lumi)/1000');
and sorting, row limits, etc
12th November 2004
Tim Adye
12
SQL API – returning results
• That’s enough to generate a valid SQL SELECT
query. To return the results:my $sta = $query->execute();
while (my $row = $sta->fetch()) {
print $row->gbytes(), $row->collection(), "\n";
}
• The $query object collects the user requests
• $query->execute() returns a “statement accessor” (like a
DBI statement handle).
• $sta iterates over row objects, each of which has accessors
for each query value, gbytes and collection.
• That’s all there is to it!
• After the usual DBI connect, and $query object
instantiation (see later), these statements form a
working program
12th November 2004
Tim Adye
13
Command-Line Tools
• Standard BaBar tools use this API to create job
configuration, create datasets, calculate
luminosities, etc.
• Standard tasks, but optionally allowing additional
selections
• Also provide an “expert” tool that allows access to
full API functionality from the command line
• This has proved very popular, with many “non-experts”
making their own unique queries
12th November 2004
Tim Adye
14
Examples
$ BbkUser --dataset=A0-Run4-OnPeak-R14 \
--is_local=1 --file_status=0 \
dse_lumi events gbytes file
DSE_LUMI
========
1250.3
1250.3
1348.4
1348.4
...
156 rows
EVENTS GBYTES FILE
====== ====== ====================================================
526115
1.6 /store/PRskims/R14/14.4.0d/A0/02/A0_0239.01.root
526115
0.8 /store/PRskims/R14/14.4.0d/A0/02/A0_0239.02HBCA.root
576239
1.6 /store/PRskims/R14/14.4.0d/A0/02/A0_0240.01.root
576239
1.0 /store/PRskims/R14/14.4.0d/A0/02/A0_0240.02HBCA.root
returned
$ BbkUser –-collection-file=coll.lis \
tot_gbytes collection
12th November 2004
Tim Adye
15
What happens behind your back
• The SQL API
• translates the logical names to table columns
• selects the required tables and joins
• including otherwise unused tables required for the joins
• generates and executes a valid SQL SELECT statement
• creates a statement accessor object
• dynamically generates a class for the row objects with
accessors for each query value
12th November 2004
Tim Adye
16
BbkUser --dataset=A0-Run4-OnPeak-R14
--is_local=1 --file_status=0
dse_lumi events gbytes file
Our Example
• That first BbkUser command involved 5 tables
• including one that provides the join between dataset and
collection tables
SELECT dse.lumi_sum AS "dse_lumi",
dse.output_nev AS "events",
file.bytes,
dse.name AS "collection",
file.suffix AS "file_suffix", ds.id AS "ds_id",
dse.id AS "dse_id",
dtd.id AS "dtd_id",
dtd.link_status
FROM bbk_dataset ds, bbk_dsentities dse, data_files dfile,
bbk_files file, bbk_dstodse dtd
WHERE ds.id=dtd.ds_id
The SQL API can even
AND dtd.dse_id=dse.id
pretty-print it like this for you 
AND dse.id=file.dse_id
(What’s shown here is somewhat
AND file.id=dfile.file_id
abbreviated: actual command
AND ds.name='A0-Run4-OnPeak-R14'
includes full database and table
AND dse.is_local='1'
names in case of ambiguities)
AND dfile.status='0';
12th November 2004
Tim Adye
17
18
Table schema configuration classes
• Mapping between logical names and table columns is
defined in the configuration classes.
• One class per table
• Can also define special properties of each column (eg. whether
to allow ranges (“100-199”) for selection).
• Possible joins between tables defined here too
• Use logical column names for join conditions, so one table class
does not need to know about column names in other classes.
• In most cases it’s just a matter of listing logical vs.
column names
• with a little Perl syntactic sugar
• Inheritence of config classes expresses commonalities
• eg. common id and created, and modified columns
12th November 2004
Tim Adye
19
Example Table Configuration
sub table { return 'bbk_files' }
sub tableConfig {
return {
alias => 'file',
columns => [
bytes
uuid
checksum
file_suffix
nfiles
gbytes
tot_gbytes
file_dse_id
file
],
joins => [
dse_id
file_id
],
}}
=>
=>
=>
=>
=>
=>
=>
=>
=>
'bytes‘,
'uuid',
'checksum',
'suffix',
'COUNT(DISTINCT file_id)',
'(bytes/1073741824)',
'SUM(bytes)/1073741824',
{ column => 'dse_id', selectorType => 'range' },
{ valueAction => 'addLfnValue', selectorAction => 'lfnSelector' },
=> 'file_dse_id',
=> 'dfile_file_id',
12th November 2004
Tim Adye
20
Putting it all together
• Configuration classes must be registered with
$query object
my $query = new BbkSqlSelect;
$query->addModules(new MyTableClass($dbi));
but of course it is usually simpler to provide a
$query object pre-registered with all the table
configs as part of a specific API.
12th November 2004
Tim Adye
21
Overriding and Synthetic Columns
• A crutial advantage of this system is that it allows us to
override the default behaviour
• Allows us to hide complexities from user
• Make even complex schema changes transparent to users
• A logical name can refer to
•
•
•
•
ordinary database column name
SQL expression (in terms of database columns, or other logical names)
Perl method to pre- or post-process selection or query value
“synthetic” query value or selection
• can return calculated value or alter behaviour
• Global post-processing
• Can be triggered by value, selection, or table inclusion
• Allows global filtering of returned rows
12th November 2004
Tim Adye
22
What happens behind your back 2
•
We already used some of these features without
noticing!
1. Dataset names can be found in the bbk_dataset or the
bbk_aliases table
•
Requires a check and translation using the alias table
2. Datasets can evolve with time, with collection being
added or removed
•
•
Need to query dataset for any time in the past, or use
tagged dataset alias (like a CVS tag)
Implemented by automatically including date selection in
query, and post-processing returned results to remove
deleted collections
3. File names are made from a collection name + a suffix
•
$query->addSelector('file') splits the file name for the
query and the $row->file() accessor rejoins them
12th November 2004
Tim Adye
23
Features
• Supports Oracle 8 and MySQL 3.23
• Most queries that can be expressed in both these dialects
can be expressed by users via the API – without breaking the
paradigm of a flat namespace
•
•
•
•
•
aggregatation and grouping
sorting and distinct
MySQL’s LIMIT emulated in Oracle
inner and outer joins (generates Oracle or MySQL syntax)
Does not support UNION or subqueries
• Could be added, but not in MySQL 3.23
• Convenience features
• automatic Getopt specification
• query results display formatting and summary table generation
• Configuration class summary table generation
12th November 2004
Tim Adye
24
Limitations
• Assumes tables can be joined in a unique way
• ie. the joins form an acyclic graph
• can still select different joins with explicit switches
• Each column must have its own unique logical name
• This is usually a good thing
• but if the same data is held in different columns, it would
be more efficient to automatically select from tables
that are already included
12th November 2004
Tim Adye
25
A public version
• Current version has a few BaBar-specific pieces
•
•
•
•
BaBar Connection manager – can use DBI directly
BaBar Options manager – can use Getopt directly
BaBar base objects – borrow required methods
BaBar table formatting class – remove functionality or publish
this too
… otherwise just uses standard Perl modules
but with different table configs could be used
elsewhere
• Already do this in BaBar – used for QA and TM databases
• Maybe I’m making some other assumptions that are true of our
database and requirements, but not more generally so. I can’t
think of any.
• Needs a better name!
• This is really an SQL API creator
12th November 2004
Tim Adye
26
Possible improvements
• Tidy up code!
• User and config APIs are OK, but in between it’s pretty ugly
• Separate functionality that can be used on its own
• Already true of the DBI statement accessor class
• More SQL dialects: PostgreSQL, MS SQL?
• New SQL syntax: subqueries, UNIONs,…
• INSERT, UPDATE, etc
• these don’t need joins, so hand-coding not such a problem
• Automatic selection of different join possibilities
• Automatic generation of default table classes from SQL
schema
• Could use “The SQL Fairy”
• though not much work to do it by hand
12th November 2004
Tim Adye
27
Why not use another package?
• More than 100 DBIx and other SQL access packages
in CPAN
• Could not find any that do all (or even most) of
•
•
•
•
•
•
hide table structure from user
allow multi-table queries, taking care of joins automatically
do not impose their own conventions on table schema
allow query values and selections to be overridden
allow transparent post-processing of query results
provide accessor functions for query results
• I believe that taken together these features provide
a clear and easy to use abstraction
12th November 2004
Tim Adye
28
References
• The BaBar Experiment
http://hepunx.rl.ac.uk/BFROOT/
• BaBar Bookkeeping project
http://hepunx.rl.ac.uk/BFROOT/www/Computing/Distribu
ted/Bookkeeping/Documentation/
• BaBar Bookkeeping presentation and paper
http://indico.cern.ch/contributionDisplay.py?contrib
Id=338&sessionId=7&confId=0
D.A. Smith et al.,
BaBar Book Keeping project –
a distributed meta-data catalog of the BaBar event store,
Proc. Computing in High Energy and Nuclear Physics 2004 (CHEP04).
• CPAN Database Interfaces
(see particularly DBIx)
http://cpan.uwinnipeg.ca/chapter/Database_Interfaces
12th November 2004
Tim Adye
29