Transcript Slide 1
Analysis Model A.Myagkov IHEP Derived Physics Data A DPD is defined as a set of data which is a subset of ESD or AOD data content with the possible addition of analysis data, analysis data being defined as quantities derived from data in ESD or AOD. The importance of the DPD was reinforced by the experience leading up to the Rome physics workshop, when analysis directly on the AOD in Athena was found to be too slow. The large content of the AOD was also regarded as not suitable for most analyses. Why DPD • Faster – Repeated calculations done once and stored – Can be faster if ‘lighter’ • Smaller – Store just what you want – Store with the precision you need • More portable – Partly because it is smaller – Partly because work needing external services already done and stored • Other – DPD may span streams RWL Jones 14 Sept 2007 CERN 3 Computing Model AMF Ian Hinchliffe 11/28/07 ● Tier 1 and Tier2 are a common resource available to the entire collaboration. ● Tier 1 cloud – Not all Tier1's are same size – 10% of RAW on disk, the rest on tape • 2 full copies of current ESD on disk in entire cloud • A full AOD/TAG at each Tier 1 • Access is scheduled, through ANALYSIS and PHYSICS groups, and for production Computing Model AMF Ian Hinchliffe 11/28/07 ● Tier 2 cloud Approx 30 sites: large variations in sizes • Some of ESD and RAW • In 2008: 30% of RAW and 150% of ESD in Tier 2 cloud • In 2009 and after: 10% of RAW and 30% of ESD in Tier 2 cloud • This will largely be ‘pre-placed’ in early running • recall of small samples through the group production at T1 • 10 copies of full AOD on disk • User data (mainly in scratch) • Additional access to ESD and RAW in CAF – 1/18 RAW and 10% ESD ● Note that data are streamed by trigger object – Order 6 streams; Might have less (more) copies of some streams depending on demand identified 3 different ways of reducing/refining data: ◮ Skimming: selecting only interesting events based on some event level quantity ⋆ number of electrons ⋆ Missing ET ◮ Thinning: selecting only interesting objects from a container ⋆ keep all electrons with pT > 40 GeV ◮ Slimming: selecting only interesting properties of an object ⋆ drop some of the calorimeter informations out of an electron Skimming: selecting only interesting events based on some event level quantity ATLAS TAG database is intended to be an event-level metadata system TAGs are intended to support efficient identification and selection of events of interest for a given analysis rather straightforward creation from an AOD file: ◮ see: RecExCommon/aodtotag.py selection of events with simple queries at jobO level: EventSelector.InputCollections = ["tag1.pool"] EventSelector.Query ="NElectrons>0 && abs(ElectronEta[0])<2.5" Wiki Atlas/PhysicsAnalysisWorkBookTAGAnalysis Wiki Atlas/TagForEventSelection Main types of data The three main types of data needed in the analysis model are: the AOD (Analysis Object Data) which is envisioned to have 100kB/ev, the TAG (small set of event-level meta-data) with 1kB/ev, and the DPD (Derived Physics Data) with an average size of 10kB/ev suitable for the final analysis. The storage format of the AOD and ESD is POOL based. AOD analysis: the AOD is accessed with Athena locally or on the grid, interactively or in batch mode and the output of the analysis in form of histograms is stored. Advantages of this approach are the availability of all Athena tools and services with access to conditions data and all information from the AOD. Disadvantages are the longer development cycle (code, compile, debug, run) inside the Athena framework compared to other options. Here the DPD (the histograms or a small ntuple) contains derived data only. CBNT: a flat Ntuple (CBNT) derived from the AOD with largely the same content as on the AOD and then analyzed using ROOT. Advantages are the portability of the ROOT Ntuples and the speed of the development cycle for the analysis in ROOT. Disadvantages are that the ROOT based code is usually written in a manner that is not back portable to Athena, the duplication of data in AOD and the Ntuple, and the need to re-run Athena to produce new Ntuples in the event of bugs. SAN: structured Ntuple (SAN) is created from the AOD with largely the same content as the AOD and analyzed using ROOT. Compared to the CBNT approach this has the additional advantage that the SAN contains objects with similar access methods as the native objects on the AOD but also the additional disadvantage of having to maintain and validate an additional set of classes mirroring the Athena based ones. EventView based Ntuples: flat Ntuples produced with EventView (an analysis framework within Athena) and later analyzed with ROOT. This is a mixed approach with major parts of the analysis already in Athena and a refined analysis step on the produced Ntuples. The advantage here is that the derived Ntuples are typically smaller than CBNTs or SANs since they do not contain the full AOD information but may have additional analysis data for those events that are retained. Analysis may be helped by the preparation of several consistent “views” of the same event. The disadvantages here stem from the perceived complexity of the framework and the strict selection criteria applied when the Ntuples are produced. Analysis Forum in ATLAS Six open meetings were held either by phone or in person http://indico.cern.ch/categoryDisplay.py?ca tegId=1543 Analysis Model Report draft 3.0 Edited by D.Costanzo,I.Hinchliffe,S.Menke Storage format of DPD [Rec 1] It is recommended to write out DPDs in the same storage format as AODs and ESDs thus ensuring portability of the code between data files with different levels of detail. Storage format of DPD - 2 [Rec 2] If there are use cases for which it is believed that the format cannot be used either for performance or implementation reasons, the case should be presented to the PAT. Use of any alternative format should not occur unless endorsed by the computing management. [Rec 2bis] The code used to operate on the last DPD and produce the final histograms used for publication should be publicly available, validated and should simple enough for the analysis reviewer to reproduce the histograms Distribution of and access to DnPDs [Rec 3] DPDs need to be available to all collaborators. Primary DPDs shall be stored on disk at Tier1/Tier2. Secondary DPDs, if any, that are used as the basis of physics results shall be exported to Tier 2 and be accessible to all collaborators. AthenaROOTAccess [Rec 4] It is recommended to keep and extend the AthenaROOTAccess for Ntuple like analyseson POOL/ROOT files like ESD, AOD, and DPD Since conditions data, detector description, magnetic field maps etc. are not accessible from AthenaROOTAccess it depends on the analysis whether or not AthenaROOTAccess can be used. Under no circumstances should approximate conditions data should be used instead of the data provided inside an Athena job Code distribution and software infrastructure The distribution kit be optimized and that a separate kit is provided for analysis purpose with a total size of no more than 1GB. Compilation tools are improved, less than a few seconds. Compression of containers The use of double precision numbers in the persistent representation should be avoided Floating point numbers should be compressed The read performance should be of the order of 10Mb/s on an average lxplus machine, An average high-Pt analysis should be able to read DPDs from disk at a rate of at least 200Hz. EDM objects that are candidate for DPD should have slimming methods, so users can remove some of their content depending on their analysis requirements. Primary DPD production Rec 13 The total data volume allocated to DPD from the common ATLAS resource should be comparable to that of AOD. The number of AOD replicas should be reduced to accommodate them.The balance between ESD, AOD and DPD will evolve during the life of the experiment. Initially, while the detector is being understood and reconstruction software debugged, the usage of DPD may be limited and more storage should be allocated to ESD and AOD. Rec 14 Once the AOD’s have a significant lifetime, i.e. the reconstruction of complete data sets is stable, the DPD production should be expected to occur on the order of once per month. One validated and one “validating” version is expected to exist concurrently on disk. Users who need to retain older versions are expected to copy them to Tier3. Older versions in common space should be retained for some period on tape. Primary DPD production Rec 15 A quota system needs to be established to allocate common storage space to the various physics and performance groups. Both the policy to set quotas and the tools to implement them are needed. Rec 16 DPD production should be a validated activity occurring within the production system. Each group (or set of groups, if a DPD is shared between groups) must have persons dedicated to preparing and testing the code for DPDmaking and for submitting the jobs to the production system.This should be recognized as an OTSMU task Primary DPD production Rec 17 The number of primary DPD is expected to be in the range 10-20, less in the early stages of the experiment.The data volume of each should therefore be in the range 5% to 10% of the AOD volume on average. Physics and performance groups are encouraged to share DPDs particularly in the early stages of the experiment. Rec 18 The production may take place in several tasks each making individual DPD or an a single task that makes all of the DPD in a single pass over the AOD data sets. Performance issues related to data access will need to be considered. However we recommend that a scheduled access (train) model be developed in case that it is needed to improve processing time. Rec 19 The tools for primary DPD making should be lightweight, transparent EventView and analysis framework The EventView tools and analysis data EDM should be factorized from the EventView framework and be provided as part of the ATLAS toolkit. Whenever this is not possible, new tools should be provided using EventView as a prototype for this. R21 EventView, as an analysis framework, should be a client of the tools as described above. R 22 The EventView framework should not be used as a tool for primary DPD making (see previous recommendation on DPD production). It could be used by individuals or small groups for making DnPD, perhaps in a flat Ntuple Validation Rec 23 When DnPDs are used, results must be reproducible from a primary DPD or AOD using code that runs in Athena or AthenaROOTAccess. Rec 24 EventView should not be used as a validation tool for other packages. Rec 25 A strategy to validate EventView needs to be identified Conclusion Interesting time will start very soon and we have to be ready T2 Data on Disk •~30 Tier 2 sites of very, very different size contain: •Some of ESD and RAW •Additional – In 2007: 10% of RAW and 30% of ESD in Tier 2 cloud – In 2008: 30% of RAW and 150% of ESD in Tier 2 cloud – In 2009 and after: 10% of RAW and 30% of ESD in Tier 2 cloud – This will largely be ‘pre-placed’ in early running – recall of small samples through the group production at T1 access to ESD and RAW in CAF – •10 •A 1/18 RAW and 10% ESD copies of full AOD on disk full set of official group DPD (production area) •Lots of small group DPD (in production area) •User data (in ‘SCR$MONTH’) •Access is ‘on demand’ Tier 2 Disk share 2008 Raw General ESD (curr.) AOD TAG RAW Sim ESD Sim (curr.) AOD Sim Tag Sim User Group User Data RWL Jones 14 Sept 2007 CERN 26