Transcript Slide 1

Data Acquisition at CBM(FAIR)

Outline : -Conventional FEE-DAQ-Trigger Layout in HEP -Data rates in ALICE in LHC -ALICE DAQ Architecture -What drives FEE-DAQ Architecture in CBM experiment -Self-Triggered FEE -Way-out : Data Push Architecture -Common Data Flow -Event Building and its network -Event Selection and its Processing -Some Results from GSI

Conventional FEE-DAQ-Trigger Layout in HEP

L0 Trigger f bunch FEE Detector Especially instrumented detectors Trigger Primitives Dedicated connections Buffer Cave DAQ L2 Trigger Archive Limited capacity Modest bandwidth Limited L1 trigger latency Shack L1 Trigger Specialized trigger hardware

Cross Section and Production rates in one ALICE year ( 10 6 s )

Final State Pb-Pb

Cross-section rate(per 10 6 s)

Ca-Ca

Cross-section rate(per 10 6 s)  0 J/  γ γ + g  cc + X γ + g  bb + X 5200 mbarn 2.1 x 10

9

120 mbarn 5.2 x10 9 32 mbarn 1.3 x 10

7 150-500 µbarn 0.6-2.0x 10 5

1790 mbarn 7.2 x 10

8

718

µ

barn 2.9 x 10

5

390

µ

barn 1.7 x 10

7

2-8

µ

barn 0.9-3.4 x

10 5

23 mbarn 1.0 x 10

9

107 mbarn 4.6 x 10

6

Data taking scenarios in Pb-Pb in ALICE Scenario 1 Rates (Hz) Scenario 2 Rates (Hz) Scenario 3 Rates (Hz) Scenario 4 Rates (Hz) Central

Minimum-Bias

Dielectron Maximu m 10 3 DAQ 20 Level2 DAQ 10 10 Level2 DAQ 20 20 10 4 20 10 10 20 20 100 100 200 200 Dimuon Level2 DAQ 20 20 20 20 200 200 1000 650 1600 1600 1600 1600 1600 1600 Total throughput (MB/s) 1250 1400 1400 700

ALICE DAQ-ARCHITECTURE

Data Acquisition at CBM(FAIR)

Hadrons • measure: π, K • measure: K,  ,  ,  ,  • measure: D 0 , D ± , D s ,  c Leptons • measure: J/  ,  '  e + e or μ + μ • measure :  ,  ,   e + e or μ + μ Photons • measure: γ offline offline >10 AGeV trigger <10 AGeV trigger trigger μ + μ trigger e + e offline for e + e trigger for μ + μ ?

offline ?

assume archive rate: few GB/sec 20 kevents/sec trigger on displaced vertex drives FEE/DAQ architecture μ identification trigger on high p t e + - e pair

Data Acquisition at CBM(FAIR)

 D and J/ Ψ signal drives the rate capability requirements  D signal drives FEE and DAQ/Trigger requirements  Problem similar to B detection, like in LHCb or BTeV Adopted approach: -displaced vertex '

trigger

' in first level, like in BTeV Additional Problem: DC beam→ interactions at random times → time stamps with ns precision needed → explicit event association needed Current design for FEE and DAQ/Trigger: -Self-triggered FEE -Data-push architecture

Typical Self-Triggered Front-End

• Average 10 MHz interaction rate • Not periodic like in collider • On average 100 ns event spacing Use sampling ADC on each detector channel running with appropriate clock a: 126 t: 5.6

a: 114 t: 22.2

100 50 Time is determined to a fraction of the sampling period threshold 0 5 10 15 20 25 30 time

Limits of Conventional Architecture

Decision time for first level trigger limited.

typ. max. latency 4 μs for LHC Not suitable for complex global triggers like secondary vertex search Only especially instrumented detectors can contribute to first level trigger Large variety of very specific trigger hardware Limits future trigger development High development cost

The way out .. use Data Push Architecture

L0 Trigger f f bunch Time distribution FEE Detector Especially instrumented detectors Trigger Primitives Dedicated connections Buffer Cave DAQ Limited L1 trigger latency Shack L1 Trigger Specialized trigger hardware

The way out .. use Data Push Architecture

Detector f clock FEE Buffer Cave Shack DAQ L1 Trigger High bandwidth

The way out .. use Data Push Architecture

Detector Self-triggered front-end Autonomous hit detection No dedicated trigger connectivity All detectors can contribute to L1 f clock FEE Buffer Cave Shack Large buffer depth available System is throughput-limited and not latency-limited Use term: Event Selection DAQ L1 Select High bandwidth

Front-End for Data Push Architecture

• Each channel detects autonomously all hits • An absolute time stamp, precise to a fraction of the sampling period, is associated with each hit • All hits are shipped to the next layer (usually concentrators) • Association of hits with events done later using time correlation • Typical Parameters: – with few 1% occupancy and 10 7 interaction rate: • some 100 kHz channel hit rate • few MByte/sec per channel • whole CBM detector: 1 Tbyte/sec

Basic n-XYTER Readout Chain

Detector Front-End Board Read-Out Controller Active Buffer Board

FEB ROC ABB

Tag data Tag data ADC data

MGT MGT

Tag data Tag data control clock

MGT SFP SFP MGT

Bond or cable connection up to 8 N-XYTER 1024 ch.

LVDS signal cable 2.5 Gbps optical link 1-4 lane PCIe interface

Scalable n-XYTER Readout Chain

Detector Front-End Board

FEB

Read-Out Controller

ROC

Data Combiner Board

DCB

Tag data Tag data ADC data Tag data Tag data control clock

MGT SFP SFP MGT MGT SFP SFP MGT MGT

to other ROC's to ABB

READOUT EXPERIENCE ALICE-PMD

MANAS MANAS MARC MANAS MANAS ADC ADC

FEE BOARD

LVTTL DATA and CONTROL lines TRANSLATOR BOARD 80MB/s LVDS LINK CONCENTRATOR BOARD 2GB/s Max DDL DAQ

Total no of cells : 221184 Total no of Modules : 48 1 module = 4608 cells.

CHAIN Connection of a Chain – ALICE-PMD

With ROC

1 CROCUS - 50 Patch Buses 6 CROCUS – 300 Patch Buses

Patch Bus F R O N F R O N B O A R D B O A R D F R O N B O A R D E N T R E N T R C O N C T O R B O A R

DDL CROCUS/ DCB

40 meters (??) LVDS LINKPORTS TRIGGER BUSY, and L0

LDC/ABB

CTP LTU TRIGGER L0 BUSY F F

GDC

VME TRIGGER DISPATCHING

Where to re-sort data ?

• Token ring scheme produces locally unsorted data • Big advantage of token ring schema is the fair distribution of bandwidth in case of local overload. The system is robust against hot channels etc.

• n-XYTER doesn't even produce epoch markers – the reading stage needs a clock cycle precise replica of the time stamp • counter to interpret the data correctly. That clearly only works if there are no additional elasticity buffers.

– some form of 'time stamp expansion' and epoch marking needed – re-sort data early ? Or use a form of fuzzy epoch boundaries ?

• How to build concentrators ?

– conceptually easy of output bandwidth > sum of input bandwidth – at least not feasible in early stages read-out ASIC is in fact the first concentrator stage • total bandwidth will always be smaller than sum of channel bandwidth • In other words: when and where to drop data in case of overload ?

Think Big or Throttling ?

• Conventional triggered systems handle overload gracefully – there is some form of 'common' dead time – in case of overload, whole events are discarded – loosely speaking: one gets 100% of the data for 90% of the events • With self-triggered front-end the converse might happen – data is dropped in an uncorrelated fashion where FIFOs overfill – loosely speaking: one gets 90% of the data of 100% of the events – quite unpleasant perspective • tracking systems might tolerate a few % data loss without major performance drop • in other detectors, like an ECAL, this leads immediately to a loss of efficiency • What is the proper solution ?

– Build and operate the system with 'enough' bandwidth headroom ?

• Note: extracted beams from synchrotrons are notoriously non-poissonian !

• Can that be handled with large-enough channel FIFO's alone ?

– Or introduce some form of 'global throttling', to drop data in a correlated fashion.

• • The time distribution system can easily distribute 'XOFF' and 'XON' messages Problem is to find an easy to evaluate throttle criterion.

Proposed N-XYTER Readout scheme

N X Y T E R N X Y T E R N X Y T E R N X Y T E R

FEE BOARD

ADC To other FEE boards ASIC based ROC LVDS CONTROL and DATA lines DCB with 10 Gbps SFP link OFC link DAQ

Logical Data Flow

Concentrators: multiplex channels to high-speed links Time distribution Buffers Build Network Processing resources for first level event selection structured in small farms Connection to 'high level' selection processing

Bandwidth Requirements

Data flow: ~ 1 TB/sec Gilder helps Moore helps 1 st level selection: ~ 10 14-15 operation/sec ~ 100 Sub-Farms Data flow: few 10 GB/sec to archive: few 1 GB/sec

Focus on BNet

Event Building

Fast Event Building Networks

• Very tempting to look into InfiniBand – used in many HPC clusters as interconnect – offers large bandwidth at low CPU overhead • Available since some time – SDR systems: 4 x 2.5 Gbps per link – 1 GByte/sec bandwidth per port and direction – 288 port Switches • based on 24 port switch chips (288=24*12) • • • non-blocking switch 288 GByte/sec switching bandwidth modest cost: ~ 400 EUR/port • Perspectives – DDR just became available – QDR likely to come • One 288 port QDR switch does 1 TByte/sec • A few could do CBM network adapter (HCA) small and low power compared to 10 Gbit Ether

Conventional Networking

network switch

Hardware

network adapter network adapter

Kernel

driver driver

data flow control flow User

library application library application

Use Zero-Copy RDMA

network switch

Hardware

network adapter network adapter

Kernel

driver driver

data flow control flow User

library application library application

How does Zero-copy RDMA

• User side requirements – all buffers used for I/O • must be locked in memory •

work ?

made known to the network adapter, which stores virtual to physical mapping – this setup involves OS and a driver (expensive) • Network adapter requirements – export two types of interfaces • for kernel interactions • for user interactions – the interface for user interactions • is memory mapped • replicated for each connected process • mapped into the user process address space • Chain of events for zero-copy RDMA – user process writes request descriptor directly into the network adapter (mapped interface) – adapter validates, builds scatter-gather list, and transfers to/from user address space driver network adapter library

Hardware

application

Kernel User

and Real World problems ...

• Usually some application framework is used – ROOT, XDAQ, ....

• It usually has its own 'buffer management' • Remember: – making a user buffer eligible for RDMA is quite expensive (locking, driver calls) – thus create/delete of a buffer is expensive • A framework design with a very 'dynamic' handling of buffers, which often creates / deletes buffers will not work well with RDMA.

• Adapting the underlying buffer management in an existing framework can be quite cumbersome: – ... basic execution logic problems – – methods may not be virtual ...

successfully done to adapt XDAQ, the CMS DAQ framework to uDAPL. (J.Adamczewski, GSI) driver network adapter

Hardware

library framework application

Kernel User

Event Building

• Barrel-shift is in practice a too rigid scheme – e.g. works only when processing always takes same time • Questions are: – How much scheduling is needed ?

→ Works chaotic transfer & many buffers ?

– What is an 'optimal' scheme ?

→ Precise timing and sizing of each transfer – What is a simple and robust scheme ?

→ Get close to optimal with simple means • A 4 node mini-cluster is nice to develop software • Go to a larger cluster for real tests – Done: 24 nodes at FZ Karlsruhe – Later: >100 nodes at Paderborn cluster • First results from FZK tests in March 2007 – 23 nodes, Opteron's with DDR InfiniBand HCA's and switches – surprise: peer-to-peer bandwidth: 1160 MB/sec uni and 730 MB/sec bidirectional – memory or PCIe is apparently here the limiting factor

Event Building – Scheduled Transfers

• First results from FZK tests in March 2007 (cont.) – 23 nodes – strictly timed transfers: MB/S ec 2k 8k 2 255 718 queue length 8 271 695 32 272 704 Best throughput: 718 MB/sec/node total 16.5 GB/sec 32k 699 698 696 – now 128k 732 727 • Peak throughput same as before • Tests on realistic size cluster needed • Buffer size and number now uncritical (big doesn't hurt, at least...) data by S. Linev, GSI

Focus on PNet

Event Selection

Event Selection Processing

• In CBM we'll have a tracking trigger – certainly for open charm • requires reconstruction of tracks in STS of all events • search for displaced vertices • identification of open charm candidates – possibly also for muon identification • again reconstruction of tracks in STS of all events • forward tracking though muon absorbers • So we need high throughput STS tracking • Two routes followed – Cellular automaton / Kalman filter tracker • l ots for floating point arithmetic • better performance (simply because cuts can be narrower) – Hough tracker • algorithm 'bit-oriented' and parallelizable • can be implemented in programmable logic Is it feasible to do CA/KF in L1 event selection ?

Does Hough has the required performance ?

needed for D

CBM DAQ and Online Event Selection

• More than 50% of total data volume might be relevant for first level event selection • Aim for simplicity • Ansatz: do (almost) all processing done after the build stage • Simple two layer approach: 1. event building 2. event processing needed for J/μ MVD,STS, and TRD data used in first level event selection • Other scenarios are possible, putting more emphasis on: • do all processing as early as possible • transfer data only then necessary

... and Real World throughput

• • Small InfiniBand test cluster at GSI • 4 dual-dual Opteron server Mellanox MHES18-XT HCA (PCIe) Mellanox MTS2400 24X 24 port switch data by J. Adamczewski, GSI Test case • XDAQ peer transport via uDAPL (an RDMA access library for IB and iWRAP) Results • for large (100 kB) buffers throughput approaches IB limit of 1 GB/sec • ~30 kB buffers needed to get 500 MB/sec

Conclusions: -Self Triggered FEE is necessary -High Bandwidth Network -Zero-Copy RDMA in Network -Multi-terabyte buffer memory -Faster Processing Farms -Efficient Event Selection method and more …

Game Processors as Supercomputers ?

2005 Slide from CHEP'04 Dave McQueeney IBM CTO US Federal

The Cell Processor

PPE: 'normal' PowerPC CPU • running Linux • used to orchestrate the SPU's 8 SPE: Synergistic Processing Elements, each with • 256 kB local memory • 128 x 128 bit registers • 4 SP floating ops/cycle (SIMD) Peak performance • 32 single precision multiply/add per clock cycle • runs at ~3 GHz