Overview of MacSim

Download Report

Transcript Overview of MacSim

HPArch Research Group

|   Part 2. Overview of MacSim Introduction For black box approach users |  Part 3: Details of MacSim For computer architecture researchers |     Part 4.

MacSim-SST case studies Ocelot-MacSim case studies Research using Ocelot Research using MacSim MacSim Tutorial (In ISCA-39, 2012)

| Heterogeneous architecture simulator (x86+PTX) | Developed from Georgia Tech |   Trace driven simulator Internal RISC style micro-op generation module X86 traces – using Pin, PTX traces – using GPUOcelot |  Cycle-level simulator Cores, caches, memory systems are modeled | Support various simulations - single/multi-threaded application, multi-program, heterogeneous (CPU+GPU) MacSim Tutorial (In ISCA-39, 2012)

| Flexible design to support various platforms | Integration with a parallel simulator (SST) to support high performance computing systems | From mobile to Exascale computing systems MacSim Tutorial (In ISCA-39, 2012)

CUDA code (.cu) X86 binaries NVCC (Compiler) PTX code Prof. Yalamanchili (Georgia Tech) GPUOcelot Trace Generator PIN Trace Generator Instruction Thread information Heterogeneous Architecture Timing & Power Simulator Open GL code PIN (API Generator) Attila (OpenGL Emulator) Ongoing Work MacSim Tutorial (In ISCA-39, 2012)

|   Getting MacSim Stable version – google code project http://macsim.googlecode.com/files/macsim-1.0.tar.gz

Latest code from SVN repository | Directions are explained in http://code.google.com/p/macsim/wiki/GettingMacsim |    How to build http://code.google.com/p/macsim/wiki/BuildingMacsim Chapter 2 of manual provides an instruction to build README file in the simulator directory MacSim Tutorial (In ISCA-39, 2012)

|    Macsim package IRIS (NoC simulator from Prof. Yalamanchili’s group) is included CPU trace generator  Download PIN separately. Trace generator tool is in the MacSim Package GPU trace generator  Download Ocelot Separately. Trace generator is in the Ocelot’s package |  MacSim-SST SST needs to be downloaded separately |  Energy Introspector (From Prof. Yalamanchili’s group) EI is a power model based on McPAT, HotSpot. Because of McPAT license issue, currently EI cannot be distributed, but we will resolve this issue soon MacSim Tutorial (In ISCA-39, 2012)

MacSim Tutorial (In ISCA-39, 2012)

|  Once build process is successful, binary will be created in macsim-top/trunk/bin/macsim | Screenshot of a simulation | Now, How to configure simulation models ? MacSim Tutorial (In ISCA-39, 2012)

|    Knob variables need to set up (3 ways) Default value in the source code Params.in

Command line Core type 1 Core type 1 Core type 1 Core type 1 Core type 1 Core type 2 Core type 2 Core type 2 Core type 2 Core type 2 Core type 3 Core type 3 Core type 3 Core type 3 Core type 3 Memory MacSim Tutorial (In ISCA-39, 2012)

|   Configuration 4 cores 2-way SMT

.def

param num_sim_cores 4 // 4 cores num_sim_small_cores 0 num_sim_medium_cores 0 num_sim_large_cores 4 max_threads_per_large_core 2 large_core_type x86 repeat_trace 1 ./macsim –num_sim_cores=4

commandline params.in

MacSim Tutorial (In ISCA-39, 2012)

|  To configure CPU+GPU arch.

Set up number of cores and type accordingly num_sim_cores 8 // 4 CPUs + 4 GPUs num_sim_small_cores 4 // 4 GPU num_sim_medium_cores 0 num_sim_large_cores 4 // 4 CPUs core_type ptx large_core_type x86 cpu_frequency 3 // specify small cores gpu_frequency 1.5

repeat_trace 1 | Usually, we use small core for GPU and large for CPU | GPU has internally multiple processing elements (N-wide SIMD) MacSim Tutorial (In ISCA-39, 2012)

|  Multiple Applications Set up from trace_file_list 4 <-- number of applications /sample/mcf/trace.txt <- appl 1 /sample/gcc/trace.txt <- appl 2 /sample/mm/trace.txt <- appl 3 /sample/blackscholes/trace.txt <- appl 4 MCF GCC MM thread 1 MM thread 2 MacSim Tutorial (In ISCA-39, 2012)

| | Execution time for each application is different. Provide an option to enable repeat short traces until the longest trace ends mcf Program 1 Program 2 Program 3 gcc bfs bfs gcc bfs gcc bfs gcc bfs | Whether it’s the right way to simulate? MacSim Tutorial (In ISCA-39, 2012)

File name

params_8800gt params_gtx280 params_gtx465 params_x86 params_hetero_4c_4g |  Sample configuration files in

macsim-top/trunk/params

MacSim Tutorial (In ISCA-39, 2012)

Contents

GeForce 8800 GT (G80) GeForce GTX 280 (GT200) params_gtx465 NVIDIA GeForce GTX 465 (Fermi) GeForce GTX 465 (Fermi) Intel’s Sandy Bridge (CPU part only) Intel’s Sandy Bridge (CPU + GPU)

| | Thread spawn is modeled.

Lock is not modeled. Main thread Host thread

Threads spawn GPU Kernel invocation

Barrier core core core core MacSim Tutorial (In ISCA-39, 2012)

| It will be covered in Part-III | Trace generator will generate thread execution information is automatically. | Users do not need to worry about this. MacSim Tutorial (In ISCA-39, 2012)

|      MacSim has 5 different clock domains CPU GPU Last-level cache Interconnection network DRAM # Clock clock_cpu 3 clock_gpu 1.5

clock_l3 1 clock_noc 1 clock_mc 1.6

MacSim Tutorial (In ISCA-39, 2012)

| | X86 instructions are mapped to uops PTX instructions are mapped to uops (almost 1-1 mapping)

MacSim

Pin XED Macro instructions with decoded information from Pin’s XED | Pipeline stages Trace decoder uops Timing/ power simulator Front-end Decode Rename Schedule Execution Memory Retire MacSim Tutorial (In ISCA-39, 2012)

|      Front-end, DEC/Rename: Just a simple FIFO queue. fetch_latency 5 // front-end depth alloc_latency 5 // decode/allocation depth width // pipeline width (same width for all the pipeline) bp_dir_mech gshare bp_hist_length 14 // branch history length |  Rename: create RAW dependency (map structure) rob_size 96 // ROB size |  Scheduler // in-order scheduler, ooo scheduler schedule io, ooo // instruction scheduling policy MacSim Tutorial (In ISCA-39, 2012)

|   Execution latency Fixed uop latency (macsim-top/def/uop_latency_[x86,ptx].def) Variable latency: Cache/Memory latency |    Instruction scheduling rates isched_rate 4 // # of integer inst. that can be executed per cycle msched_rate 2 // # of memory inst. that can be executed per cycle fsched_rate 2 // # of FP inst. That can be executed per cycle MacSim Tutorial (In ISCA-39, 2012)

|  Cache configuration # of sets, # of associativity, line size, # of banks, etc. (See manual) | Cache size = # of sets x assoc x line_size x # of tiles L3 only |    DRAM configuration Frequency, bus width, column/activate/precharge latency # of Memory controllers, # banks, # channels, row buffer size, DRAM scheduling policy Simple, but fast DRAM model that models key features |  MacSim is connected with DRAM-SIM2 Users can use DRAM-SIM2 for a detailed DRAM timing simulation MacSim Tutorial (In ISCA-39, 2012)

|   Statistics Simulation outputs: *.stat.out

macsim/trunk/def

file has stat definition (more details in Part-III) |   Important Stats IPC = INST_COUNT_TOT/CYC_COUNT_TOT CPI = CYC_COUNT_TOT/INST_COUNT_TOT |  Per Core stats IPC for core 0  INST_COUNT_CORE_0/CYC_COUNT_CORE_0 |   Multiple applications stats *.stat.out. e.g.) memory.stat.out.0, bp.stat.out.1

Each stat file contains stats only for the first running (repeated simulations are ignored) MacSim Tutorial (In ISCA-39, 2012)

|   Memory Systems L[1-3]_HIT_CPU/L[1-3]_HIT_GPU L[1-3]_MISS_CPU/L[1-3]_MISS_GPU |  Front-end BP_ON_PATH_[CORRECT/MISPREDICT/MISFETCH ] |  Instruction profiling Based on instruction category. inst.stat.out

| More details regarding statistics are in the

documentation

| We will provide simple script file to fetch stat data MacSim Tutorial (In ISCA-39, 2012)

MacSim Tutorial (In ISCA-39, 2012)

| | | | | Multi-threading support is already there.

Different ISAs: using micro-ops   Warp ? One warp is treated as one thread. Each thread generates its own trace file. Active bit information is included Trace format will be explained in Part-III   Thread and block scheduling Block-level barrier, block-level scheduling/retirement More details will be explained in Part-III  Different memory structures Memory systems MacSim Tutorial (In ISCA-39, 2012)

Addr 0 Addr 1 Addr 2 Addr 3 Addr 4 Addr 5 Addr 6 Addr 7 SIMD load instruction Coalesced Mem inst with 128B size Uncoalesced 64B Request 32B Req.

32B Req.

Trace file TraceInst Trace file TraceInst_begin TraceMem1 TraceMem2 TraceMem3 TraceInst_end start of memory instruction marker end of memory instruction marker | | Include the memory access by each thread of a warp as a separate instruction in the trace In trace, mark these accesses as coming from the same warp MacSim Tutorial (In ISCA-39, 2012)

Trace file TraceInst_begin TraceMem1 TraceMem2 TraceMem3 … TraceMemN TraceInst_end start of memory instruction marker end of memory instruction marker Parent uop uop

MacSim

Mem_type: ld #children: 8 addr0 addr4 Children uops addr1 addr2 addr5 … addr3 addrN | | During simulation, form a “parent” uop that holds all the individual memory accesses as its child uops  Parent uop flows through the pipeline, only in the memory stage, the individual children uops are issued to the memory Parent uop is ready for retirement when all children have completed MacSim Tutorial (In ISCA-39, 2012)

MacSim Tutorial (In ISCA-39, 2012)

|    IRIS (From Prof. Yalamanchili’s group) Flit-level interconnection network simulator Virtual channel, credit-based flow control deadlock-avoidance, … Part-IV will cover more. Node Node |  MacSim-SST Parallel simulation Node Node router router Topology (Ring, Mesh, Torus, ..) MacSim Tutorial (In ISCA-39, 2012)