Performance Validation framework

Download Report

Transcript Performance Validation framework

Performance Modeling and Validation of C66x DSP multilevel cache memory system

Rama Venkatasubramanian, Pete Hippleheuser, Oluleye Olorode, Abhijeet Chachad, Dheera Balasubramanian, Naveen Bhoria, Jonathan Tran, Hung Ong and David Thompson

Texas Instruments Inc, Dallas TX

1

Pre-silicon Performance Validation

• Improved IPC – Better energy efficiency • Processor Memory systems becoming more and more complex – Multicore vs clock speed - trend seen in industry. Memory systems are becoming difficult to validate.

• Cost of bug fix : Exponentially increases as left undetected through the design flow.

• Performance validation Goal: Identify and fix all performance bugs during design development phase. – Modeling and validation of a multi-level memory system is complex • Novelty of this work: – Unique latency crediting scheme allows pre-silicon performance validation with minimal CPU simulation time increase – Reusable performance validation framework across the DV stack (multiple levels of design verification) 2

C66x DSP Memory system architecture

L1P SRAM/Cache 32KB L1P Emulation Dispatch C66x DSP Fetch Exectute L M S D Register file A L M S D Register file B L1D SRAM/Cache 32KB L1D Embedded Debug Interrupt controller Power Management L2 Prefetch DMA L2 SRAM/Cache 1MB • Two levels of on-die caches – 32KB direct mapped L1I cache – 32KB 2-way set associative writeback L1D cache – 1MB 4-way private unified L2 cache • L1/L2 configurable as SRAM (or) cache (or) both • Controllers operate at CPU clock rate – Minimize CPU read latency • DMA – Slave DMA Engine – Internal DMA Engine • Stream based prefetch engine • Coherency: All-inclusive coherent memory system.

Performance bottlenecks

Typical architectural constraints in a Processor memory system: • Memory system pipeline stalls – Stalls due to movement of data (controlled by the availability of buffer space) – Stall conditions to avoid a hazard scenario.

• Arbitration points – Memory access arbitrated between multiple requestors (or) data arbitration in a shared bus.

• FIFO’s • Bank stalls and Bank conflicts – Bank Stalls – Bank conflicts: Burst mode SRAMs used to implement the memories • Bandwidth Management Architecture – Bandwidth requirement dictated by an application (real-time applications) – A minimum bandwidth may have to be guaranteed.

• Miscellaneous Stalls 4

Performance Validation framework

• Implementation:

– Theoretical analysis based on System micro architecture. – The model framework was developed in “Specman-e” language – Overlaid on top of the functional verification environment – Complex, but scalable architecture

• Overall Goal:

– Identify any performance bottlenecks in the memory system pipeline – Measure worst case latency for all the transactions – Ensure there are no blocking/hang scenarios at arbitration points – Re-usable framework across DV stack 5

Performance Validation framework (contd.)

• Model probes into the designs – All controllers, the internal interfaces etc., • Measures the number of cycles for each transaction – Initiated by CPU (or) DMA (or) cache operations • Stalls in arbitration points, bandwidth management etc are tracked. • Novel latency credit based transaction modeling system developed to determine the true latency incurred by a transfer in the system.

6

Example 1 – Single traffic stream

• CPU load from L2 SRAM. • Miss in L1I cache and request reaches unified L2 controller. – Ex: Transaction goes through A 3 , P 0 , and P 1 and then reads the data from L2SRAM and the data is returned to the program memory controller. • Flight time for entire transfer calculated inside the memory system.

•Pipeline stages – rectangles •Arbitration points - circles. •FIFO shown for illustration purposes.

7

Latency Crediting methodology

• The model tracks the transfer through the system and buffer space availability in the pipeline stages and arbitration points.

• Assume: – Total flight time for the transfer within the L2 controller =

t L2lat

– Each pipeline stall cycle =

t I0 , t I1 , t P0 , t P1

– Arb stall cycles =

t A0 , t

– Unused Arb stall cycles = – Adjusted latency

t AdjLat A3 etc., t A1 =t A3 etc., =0 (arbitration path not taken) inside L2 controller for this transfer is:

• Ideally, adjusted latency for the transfer should equal to the pipeline depth inside the controller. – Measuring the latency to that level of accuracy would require a cycle-accurate performance validation model  impractical. • Hence the adjusted latency for each transfer was measured and ensured to be within an acceptable latency defined by the architecture.

8

Example 2: Multiple concurrent traffic streams

• Three concurrent streams: – CPU program read from L2SRAM, – CPU data read from MDMA path (through the fifo) – A coherence transaction – say a writeback invalidate operation that arbitrates for the L2 cache, checks for hit or miss, writebacks the data (through the MDMA path) and invalidates the cache entry.

• Model has to be aware of : – Interactions of the pipeline stages – Apply credits accordingly • Millions of functional tests in the regression suite. • Every conceivable traffic type inferred by the model and tracked 9

Performance bugs identification

• The data collected for each transaction type is plotted for each memory controller interface (or) transaction type. • If latency value is above a certain value (a checker value), it is either a design bug (or) an incorrect modeling in the performance validation environment, which is fixed and re-analyzed. – Checkers modeled based on theoretical analysis. Outliers analyzed.

• Over a period of time, resulting plot shows the minimum and maximum number of cycles spent by any given transfer type in that particular memory controller for the various stimuli provided by the testbench. 10

Bandwidth analysis and validation

• C66x DSP supports various bandwidth programmations.

CPU • Theoretical expectations for the various bandwidth settings when multiple requestors are arbitrating for a resource is calculated: – Example: CPU and DMA traffic are arbitrating for the L2SRAM resource. – Various configurations and throughput is tabulated as shown below: DMA Arb (BW Mgmt) Bandwidth configuration L2 SRAM CPU Priority > DMA Priority. Bank conflict occurs if same bank is accessed within 4 cycles 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 CPU BW config 1 1 A 2 B 3 A 1 B 2 A 3 B 1 A 2 B 3 A 1 B 2 A 3 B 1 A 2 B 3 A 1 B Expected Throughput DMA 50% 50% BW config 2 1 2 A BW config 3 1 2 3 3 1 B A 2 3 1 2 3 A B 1 2 B 3 1 1 2 3 A A 2 3 B 1 2 3 1 2 B A 3 1 1 2 3 B BW config 4 1 2 3 BW config 5 1 2 3 BW config 6 1 2 3

Legend:

1 2 3 1 2 3 1 2 3 CPU transfers marked in Yellow A 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 A B 1 2 3 1 2 3 A 1 2 3 1 2 3 DMA transfers marked in Blue 1 2 3 A 2 3 A 1 2 1 2 3 1 2 3 1 2 3 1 2 3 B B 3 1 1 2 1 2 3 1 2 3 1 2 3 A 66% 60% 66% 70% 72% 33% 20% 11% 5% 3% 11

Bandwidth validation (contd..)

• Efficiency of bandwidth allocation – To improve energy efficiency in the system – Targeted stress tests written to exercise the full bandwidth scenarios in all the interfaces. With bandwidth arbitration enabled, total bandwidth utilized is plotted per requestor.

– Ex: Bandwidth that each of the requestors - CPU, DMA and the coherence engine get when they access the same resource L2SRAM concurrently • Total available L2SRAM bandwidth is 32 Bytes/cycle. • But when all the three requestors are accessing L2SRAM, the L2 controller provides a maximum of only 24 Bytes/cycle, which may or may not be the architecture intent. • Scenarios like this are highlighted to the design team for review • Architecture revised during the design phase accordingly.

12

Validation of Cache coherency operations

• The C66x DSP Core memory system supports block and global cache operations.. – The global cache operations – The block cache operations • The DSP core supports a snoop interface between the data memory controller and the L2 controller to support all-inclusive coherent memory system.

– For the snoop operations, the latency for each snoop transaction is tracked and reviewed against architectural expectations.

• The latency of each of the cache coherency operations is a function of the size of the cache, the number of clean/dirty/valid lines in the cache, the block word count, and the number of empty lines in the cache. • For different step sizes of cache size and block sizes, the total number of cycles taken for each operation is determined and a formula is derived. – Used by the performance validation model with random stimuli such that whenever a cache operation is initiated, the latency and credits are checked against their respective formulae. Example: 13

Conclusion

• Post-silicon performance validation to identify performance issues is very late in the design development cycle and can prove very costly. – There is an ever increasing need for detection of performance issues early on so that the cost of the design fixes needed to fix performance issues is minimized. • The “Specman-e” model overlays on top of the functional simulation framework – Collates traffic information for every transfer in the system. – Computes the total latency incrementally and also calculates the expected latency either by using the theoretical equations or default values based on pipeline depth. • Numerous performance bugs identified and fixed during design development phase • The performance validation model probes into the design and is reused across all the levels of the DV stack with minimal simulation time overhead. – The framework can guarantee that performance is validated across the entire design/system instead of just the unit level functional verification environment.

• Furthermore, between different revisions of the processor cores, if feature addition at a later stage (or) a functional bug fix introducing a performance bug, the performance model checker will fail and can thus determine any performance issue created by design changes. 14

Q&A

Thank you

[email protected]

15