Slide 1

Transcript Slide 1

Analysis Model
A.Myagkov
IHEP
Derived Physics Data
A DPD is defined as a set of data which is a
subset of ESD or AOD data content with the
possible addition of analysis data, analysis data
being defined as quantities derived from data in
ESD or AOD.
 The importance of the DPD was reinforced by the
experience leading up to the Rome physics workshop,
when analysis directly on the AOD in Athena was found
to be too slow.
 The large content of the AOD was also regarded as
not suitable for most analyses.

Why DPD
•
Faster
– Repeated calculations done once and stored
– Can be faster if ‘lighter’
•
Smaller
– Store just what you want
– Store with the precision you need
•
More portable
– Partly because it is smaller
– Partly because work needing external services already
done and stored
•
Other
– DPD may span streams
RWL Jones
14 Sept 2007
CERN
3
Computing Model
AMF Ian Hinchliffe 11/28/07
● Tier 1 and Tier2 are a common resource available to the
entire collaboration.
● Tier 1 cloud
– Not all Tier1's are same size
– 10% of RAW on disk, the rest on tape
• 2 full copies of current ESD on disk in entire cloud
• A full AOD/TAG at each Tier 1
• Access is scheduled, through ANALYSIS and PHYSICS
groups, and for production
Computing Model
AMF Ian Hinchliffe 11/28/07
● Tier 2 cloud Approx 30 sites: large variations in sizes
• Some of ESD and RAW
• In 2008: 30% of RAW and 150% of ESD in Tier 2 cloud
• In 2009 and after: 10% of RAW and 30% of ESD in Tier 2 cloud
• This will largely be ‘pre-placed’ in early running
• recall of small samples through the group production at T1
• 10 copies of full AOD on disk
• User data (mainly in scratch)
• Additional access to ESD and RAW in CAF
 – 1/18 RAW and 10% ESD
 ● Note that data are streamed by trigger object
 – Order 6 streams; Might have less (more) copies of some streams
depending on demand
identified 3 different ways of
reducing/refining data:
◮ Skimming: selecting only interesting events
based on some event
level quantity
⋆ number of electrons
⋆ Missing ET
◮ Thinning: selecting only interesting objects
from a container
⋆ keep all electrons with pT > 40 GeV
◮ Slimming: selecting only interesting properties
of an object
⋆ drop some of the calorimeter informations out of
an electron
Skimming: selecting only interesting events
based on some event level quantity
ATLAS TAG database is intended to be an event-level metadata
system

TAGs are intended to support efficient identification and selection
of events of interest for a given analysis rather straightforward creation
from an AOD file:

◮ see: RecExCommon/aodtotag.py

selection of events with simple queries at jobO level:

EventSelector.InputCollections = ["tag1.pool"]

EventSelector.Query ="NElectrons>0 &&

abs(ElectronEta[0])<2.5"

Wiki Atlas/PhysicsAnalysisWorkBookTAGAnalysis

Wiki Atlas/TagForEventSelection
Main types of data





The three main types of data needed in the analysis
model are:
the AOD (Analysis Object Data) which is envisioned to
have 100kB/ev,
the TAG (small set of event-level meta-data) with
1kB/ev, and
the DPD (Derived Physics Data) with an average size of
10kB/ev suitable for the final analysis.
The storage format of the AOD and ESD is POOL
based.
AOD analysis:




the AOD is accessed with Athena locally or on the grid,
interactively or in batch mode and the output of the
analysis in form of histograms is stored.
Advantages of this approach are the availability of all
Athena tools and services with access to conditions
data and all information from the AOD.
Disadvantages are the longer development cycle (code,
compile, debug, run)
inside the Athena framework compared to other
options. Here the DPD (the histograms or a small
ntuple) contains derived data only.
CBNT:



a flat Ntuple (CBNT) derived from the AOD with
largely the same content as on the AOD and then
analyzed using ROOT.
Advantages are the portability of the ROOT Ntuples
and the speed of the development cycle for the analysis
in ROOT.
Disadvantages are that the ROOT based code is usually
written in a manner that is not back portable to Athena,
the duplication of data in AOD and the Ntuple, and the
need to re-run Athena to produce new Ntuples in the
event of bugs.
SAN:
structured Ntuple (SAN) is created from the AOD with
largely the same content as the AOD and analyzed using
ROOT. Compared to the CBNT approach this has the
additional advantage that the SAN contains objects with
similar access methods as the native objects on the
AOD but also the additional disadvantage of having to
maintain and validate an additional set of classes
mirroring the Athena based ones.

EventView based Ntuples:
flat Ntuples produced with EventView (an analysis framework
within Athena) and later analyzed with ROOT. This is a mixed
approach with major parts of the analysis already in Athena
and a refined analysis step on the produced Ntuples.
The advantage here is that the derived Ntuples are typically
smaller than CBNTs or SANs since they do not contain the
full AOD information but may have additional analysis data for
those events that are retained.
 Analysis may be helped by the preparation of several
consistent “views” of the same event.
 The disadvantages here stem from the perceived complexity
of the framework and the strict selection criteria applied
when the Ntuples are produced.
Analysis Forum in ATLAS
Six open meetings were held either by
phone or in person
 http://indico.cern.ch/categoryDisplay.py?ca
tegId=1543

Analysis Model Report draft 3.0
 Edited by D.Costanzo,I.Hinchliffe,S.Menke

Storage format of DPD

[Rec 1] It is recommended to write
out DPDs in the same storage
format as AODs and ESDs
thus ensuring portability of the code
between data files with different
levels of detail.
Storage format of DPD - 2



[Rec 2] If there are use cases for which it is
believed that the format cannot be used either
for performance or implementation reasons,
the case should be presented to the PAT.
Use of any alternative format should not occur
unless endorsed by the computing management.
[Rec 2bis] The code used to operate on the last
DPD and produce the final histograms used
for publication should be publicly available,
validated and should simple enough for the
analysis reviewer to reproduce the histograms
Distribution of and access to
DnPDs
[Rec 3] DPDs need to be available to all
collaborators. Primary DPDs shall be stored on
disk at Tier1/Tier2. Secondary DPDs, if any, that
are used as the basis of physics results shall be
exported to Tier 2 and be accessible to all
collaborators.
AthenaROOTAccess


[Rec 4] It is recommended to keep
and extend the AthenaROOTAccess
for Ntuple like analyseson
POOL/ROOT files like ESD, AOD,
and DPD
Since conditions data, detector description, magnetic field maps etc.
are not accessible from AthenaROOTAccess it depends on the analysis
whether or not AthenaROOTAccess can be used. Under no
circumstances should approximate conditions data should be used
instead of the data provided inside an Athena job
Code distribution and software
infrastructure
The distribution kit be optimized
and that a separate kit is provided
for analysis purpose with a total size
of no more than 1GB.
 Compilation tools are improved,
less than a few seconds.

Compression of containers
The use of double precision numbers in the
persistent representation should be avoided
Floating point numbers should be compressed
The read performance should be of the order of
10Mb/s on an average lxplus machine,
An average high-Pt analysis should be able to read
DPDs from disk at a rate of at least 200Hz.
EDM objects that are candidate for DPD should
have slimming methods, so users can remove
some of their content depending on their
analysis requirements.
Primary DPD production

Rec 13 The total data volume allocated to DPD from the
common ATLAS resource should be comparable to that of AOD.
The number of AOD replicas should be reduced to
accommodate them.The balance between ESD, AOD and DPD
will evolve during the life of the experiment. Initially, while the
detector is being understood and reconstruction software
debugged, the usage of DPD may be limited and more storage
should be allocated to ESD and AOD.

Rec 14 Once the AOD’s have a significant lifetime, i.e. the
reconstruction of complete data sets is stable, the DPD production
should be expected to occur on the order of once per month.
One validated and one “validating” version is expected to exist
concurrently on disk. Users who need to retain older versions
are expected to copy them to Tier3. Older versions in common
space should be retained for some period on tape.
Primary DPD production


Rec 15 A quota system needs to be established to allocate
common storage space to the various physics and
performance groups. Both the policy to set quotas and the
tools to implement them are needed.
Rec 16 DPD production should be a validated activity
occurring within the production system. Each group (or
set of groups, if a DPD is shared between groups) must
have persons dedicated to preparing and testing the code
for DPDmaking and for submitting the jobs to the
production system.This should be recognized as an
OTSMU task
Primary DPD production
Rec 17 The number of primary DPD is expected to be in the
range 10-20, less in the early stages of the experiment.The
data volume of each should therefore be in the range 5%
to 10% of the AOD volume on average. Physics and
performance groups are encouraged to share DPDs
particularly in the early stages of the experiment.
 Rec 18 The production may take place in several tasks
each making individual DPD or an a single task that
makes all of the DPD in a single pass over the AOD data
sets. Performance issues related to data access will need
to be considered. However we recommend that a
scheduled access (train) model be developed in case that
it is needed to improve processing time.
 Rec 19 The tools for primary DPD making should be
lightweight, transparent
EventView and analysis framework
The EventView tools and analysis data EDM
should be factorized from the EventView
framework and be provided as part of the
ATLAS toolkit. Whenever this is not possible,
new tools should be provided using EventView as
a prototype for this.
 R21 EventView, as an analysis framework, should
be a client of the tools as described above.
 R 22 The EventView framework should not be
used as a tool for primary DPD making (see
previous recommendation on DPD production).
It could be used by individuals or small groups
for making DnPD, perhaps in a flat Ntuple

Validation
Rec 23 When DnPDs are used, results must be
reproducible from a primary DPD or AOD using
code that runs in Athena or AthenaROOTAccess.
 Rec 24 EventView should not be used as a
validation tool for other packages.
 Rec 25 A strategy to validate EventView needs
to be identified

Conclusion

Interesting time will start very soon and
we have to be ready
T2 Data on Disk
•~30
Tier 2 sites of very, very different size contain:
•Some
of ESD and RAW
•Additional
–
In 2007: 10% of RAW and 30% of ESD in Tier 2 cloud
–
In 2008: 30% of RAW and 150% of ESD in Tier 2 cloud
–
In 2009 and after: 10% of RAW and 30% of ESD in Tier 2 cloud
–
This will largely be ‘pre-placed’ in early running
–
recall of small samples through the group production at T1
access to ESD and RAW in CAF
–
•10
•A
1/18 RAW and 10% ESD
copies of full AOD on disk
full set of official group DPD (production area)
•Lots
of small group DPD (in production area)
•User
data (in ‘SCR$MONTH’)
•Access
is ‘on demand’
Tier 2 Disk share 2008
Raw
General ESD (curr.)
AOD
TAG
RAW Sim
ESD Sim (curr.)
AOD Sim
Tag Sim
User Group
User Data
RWL Jones
14 Sept 2007
CERN
26

Slide 1

Transcript Slide 1

Directory