Part VI: 1hr: Discussions of using MacSim and Ocelot

Download Report

Transcript Part VI: 1hr: Discussions of using MacSim and Ocelot

1 MacSim Tutorial (In ICPADS 2013)

2/8

|    The Structural Simulation Toolkit: A Parallel Architectural Simulator (for HPC) A parallel simulation environment based on MPI Fully modular design that enables extensive exploration of an individual system parameter without the need for intrusive changes to the simulator Includes parallel simulation core, configuration, power models, basic network and processor models, and interface to detailed memory model  SST-download link: http://sst.sandia.gov/ MacSim Tutorial (In ICPADS 2013)

MacSim Tutorial (In ICPADS 2013)

3/8

| | |   Processor Components MacSim Gem5    Memory Components DRAMSim2 VaultSim (3D memory model) MemHierarchy   Network Components Merlin Iris MacSim Tutorial (In ICPADS 2013)

4/8

| | Multiple MacSim components can be instantiated     Each of which can act as An entire GPU node (composed of multiple SMs) A heterogeneous computing node (CPU + GPU) A GPU/CPU core Any combination of listed above

5/8

MacSim Tutorial (In ICPADS 2013)

| | |  MacSim can talk to memHierarchy  MacSim can make use of memHierarchy’s cache hierarchy.

Which means, whatever memory system is connected to memHierarchy, MacSim can be configured with them.

DRAMSim2 or VaultSim.

Pipeline Stages with memHierarchy Front-end Decode Rename Schedule Execution Retire SST Link I-Cache (MH) VaultSim MacSim Tutorial (In ICPADS 2013) D-Cache (MH) VaultSim

6/8

7/8

| | |   MacSim can directly talk to DRAMSim2 VaultSim Using MacSim’s highly versatile memory controller interface, it can directly talk to DRAMSim2 and VaultSim.

Pipeline Stages with external memory component Front-end Decode Rename Schedule Execution Retire SST Link I-Cache (MS) VaultSim D-Cache (MS) VaultSim MacSim Tutorial (In ICPADS 2013)

8/8

| |  A SST component which models a memory hierarchy, such as multiple cache levels Sub component: Cache, Bus, Memory Controller  Usage Processor Component(s) + memHierarchy(s) + Memory Component(s)    MacSim + L1/L2 cache + DRAMSim2 MacSim + L1/L2 cache + (3D memory model) (MacSim + private L1 cache) + (Gem5 + private L1 cache) + shared L2 cache + (DRAMSim2 or 3D memory model) MacSim Tutorial (In ICPADS 2013)

9 | | Encapsulated MacSim as a SST Component, SST feeds clocks into MacSim and provides communication channels.

By talking to memHierarchy, MacSim indirectly can communicate with bunch of memory components without bothering to modify its interface.

MacSim MacSim SST::Component SST::Link L1 (memHierarchy) SST::Link SST::Component L2 (memHierarchy) SST::Link L1 (memHierarchy) SST::Link SST::Link SST::Link DRAMSim2 SST::Component VaultSim MacSim Tutorial (In ICPADS 2013)

10 MacSim MacSim SST::Component SST::Component SST::Link SST::Link L1 SST::Link L1 L2 (memHierarchy) SST::Link SST::Link LLC (VaultSim) SST::Link Gem5 MacSim SST::Component SST::Link L1 (memHierarchy) SST::Link SST::Component L2 (memHierarchy) L1 SST::Link (memHierarchy) SST::Link SST::Link SST::Link DRAMSim2 SST::Component DRAMSim2 SST::Component VaultSim MacSim Tutorial (In ICPADS 2013)

 Make sure macsimComponent doesn’t have

.ignore

file, otherwise SST build system will ignore the component  How to build: See the instruction from SST website  How to execute: Pay special attention to the following files  SDL (or XML) : SST component configuration   trace_file_list: Which trace to execute. Can be specified in the aforementioned SDL file params.in: MacSim configuration, in which you can specify…   Whether MacSim uses its internal cache or memHierarchy as cache Which DRAM controller to use amongst its internal FCFS/FRFCFS-based controller, DRAMSim2 controller and VaultSim controller.

 Specific examples will be elaborated in the following slides

11/8

| |   params.in

use_memhierarchy = 0 dram_scheduling_policy = FRFCFS or FCFS   SDL (or XML) Nothing except macsimComponent configuration In this case, link configuration will not be used

12/8

MacSim Tutorial (In ICPADS, 2013)

| |  params.in

 use_memhierarchy = 1 Note, when use_memhierarchy is set to 1, MacSim’s DRAM controller configuration has no effect at all   SDL (or XML) Specify memHierarchy’s cache configuration like the following Similar configuration for D-cache as well

13/8

MacSim Tutorial (In ICPADS, 2013)

| |  params.in

 use_memhierarchy = 1 Note, when use_memhierarchy is set to 1, MacSim’s DRAM controller configuration has no effect at all  SDL (or XML)  Specify MemController configuration for DRAMSim2 like the following Note, DRAMSim2 configurations should be appended

14/8

MacSim Tutorial (In ICPADS, 2013)

| |  params.in

 use_memhierarchy = 1 Note, when use_memhierarchy is set to 1, MacSim’s DRAM controller configuration has no effect at all  SDL (or XML)  Specify MemController configuration for VaultSim like the following Note, VaultSim configurations should be appended

15/8

MacSim Tutorial (In ICPADS, 2013)

| |   params.in

use_memhierarchy = 0 dram_scheduling_policy = DRAMSIM  SDL (or XML) Specify configurations for DRAMSim2 like the following

16/8

MacSim Tutorial (In ICPADS, 2013)

| |   params.in

use_memhierarchy = 0 dram_scheduling_policy = VAULTSIM  SDL (or XML) Nothing special but to set macsimComponent’s VaultSim’s toCPU link mem_link matches to

17/8

18 MacSim Tutorial (In ICPADS 2013) MacSim Tutorial (In ICPADS 2013)

Front-end • • Thread fetch policies Branch predictor • • • • Memory System Software and Hardware prefetcher Cache studies (sharing, inclusion) DRAM scheduling Interconnection studies Misc.

• Power model MacSim Tutorial (In ICPADS 2013)

19/8

MacSim

Trace Generator (PIN, GPUOCelot)

Software prefetch instructions

PTX  prefetch, prefetchu x86  prefetcht0, prefetcht1, prefetchnta Frontend Memory System

Stream, stride, GHB, … Hardware prefetch requests

Hardware Prefetcher • • • Many-thread Aware Prefetching Mechanism [Lee et al. MICRO-43, 2010] When prefetching works, when it doesn’t, and why [Lee et al. ACM TACO, 2012] Spare Register Aware Prefetching for Graph Algorithms on GPUs [Lakshminarayana, HPCA 2014] MacSim Tutorial (In ICPADS 2013)

20/8

21/8

| | Cache studies – sharing, inclusion property On-chip interconnection studies $ $ $ $ $ $ Interconnection Shared $ • TLP-Aware Cache Management Policy [Lee and Kim, HPCA-18, 2012] MacSim Tutorial (In ICPADS 2013) $

Private Caches Interconnection Shared Cache

| Heterogeneous link configuration GPU CPU

Ring Network

L3 MC C0 C1 C2 G0 G1 G2 M1 M0 L3 L3 L3 L3 C0 G0 C2 G1 C1 G2 M1 M0 L3 L3 L3 L3 Different topologies C C C C M M M M C C C C G G G G • On-chip Interconnection for CPU-GPU Heterogeneous Architecture [Lee et al. JPDC2013] MacSim Tutorial (In ICPADS 2013)

22/8

Trace Generator (GPUOCelot) Frontend RR, ICOUNT, FAIR, LRF, … Execution DRAM FCFS, FRFCFS, FAIR, … • Effect of Instruction Fetch and Memory Scheduling on GPU Performance [Lakshminarayana and Kim, LCA-GPGPU, 2010] MacSim Tutorial (In ICPADS 2013)

23/8

DRAM Bank W0 W1 W2 W3 RH RM RM RM RH RM RM RH RM RM RM RM Qs for Core-0 W0 W1 W2 W3 RH RM RM RM RH Qs for Core-1 Tolerance(Core-0) < Tolerance(Core-1)  DRAM Controller = 4 α α + |W1| α + 3 α + 5 α + |W2| α (α < 1) + |W3| α Servicing row hit from W1 (of Core-0) results in greatest reduction in potential, so service row hits from W1 next Reduction in potential if: row hit from queue of length L is serviced next  row hit from queue of length L is serviced next  L L α α – (L – 1) α – (L – 1/m) m = cost of servicing row miss/cost of servicing row hit α

24/8

• Core-0 Core-1 Tolerance(Core-0) < Tolerance(Core-1) DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function [Lakshminarayana et al. IEEE CAL, 2011] MacSim Tutorial (In ICPADS 2013)

Trace Generator (PIN, GPUOcelot) Out-of-The-Box MacSim Cache Hierarchy Frontend • • CPU Traces (X86) GPU Traces (CUDA) Off-Chip Memory

Memory System Memory Requests

3D Stacked DRAM Model ( New Module ) • •

Configure 3-D Stack as

DRAM caches Part of main memory

DRAM Stacks

• • Resilient Die-stacked DRAM Caches, [Sim et al.,ISCA-40, 2013] A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch [Sim et al., MICRO, 2012] MacSim Tutorial (In ICPADS 2013)

1.2 1 0.8 0.6 0.4 | | | 0.2 Modeled Measured 0 m b1 4s am e m b1 2s am e sh ar ed m b1 1s am e m b1 0s am e in t fp cm em m ad d m m ul t SharedMem 1% ConstCache 1% TextureCache 1% Fetch 3% Decode 1% L1 27% RF 5% EX_alu 6% Verifying simulator and GTX580 Modeling X86-CPU power  Modeling GPU power Still on-going research MMU 0% Execution 0% EX_LD/ST 3% EX_SFU 1% EX_fpu 48% Schedule 3% MacSim Tutorial (In ICPADS 2013)

26/8

Power/Energy Model ARM Architecture Mobile Platform 2013 ~ 2014 MacSim Tutorial (In ICPADS 2013)

27/8

28/8

29 MacSim Tutorial (In ICPADS 2013)