PTX Emulator - CompArch - Georgia Institute of Technology

Download Report

Transcript PTX Emulator - CompArch - Georgia Institute of Technology

OCELOT: PTX EMULATOR
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
1
Overview
 Ocelot
PTX Emulator
 Multicore-Backend
 NVIDIA
 AMD
GPU Backend
GPU Backend
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
2
3
Execution Model
NVIDIA’s PTX Execution Model


Parallel thread Execution (PTX)

Explicit memory hierarchy

Cooperative thread arrays (CTAs) and
ordering constraints

Array of multiprocessors each
executing a CTA (coarse grain)
SIMD multiprocessors (fine grain)


Single instruction multiple thread (SIMT) execution
Enables hardware to exploit control
flow uniformity and data locality
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
4
The Ocelot PTX Emulator
Abstract machine model
 Performs
functional simulation of
the PTX execution model

Access to complete machine state
 Enables
detailed performance
evaluation/debugging

Program correctness checking

Alignment checks, out of bounds etc.
 Workload
modeling
characterization and
 Trace
generation to drive
architecture simulators
PTX 3.0 (Fermi) support

Timing information not available
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
4
Emulator Implementation
 Serialized
execution of CTAs
Implements the CUDA execution model semantics
 CUDA does not guarantee concurrent CTA execution!

 Implements

Multiple reconvergence mechanisms are implemented:

IPDOM, Barrier, Thread frontiers
 Software

abstract reconvergence mechanism
support for special functions:
Texture sampling
1D, 2D, 3D, cube
 Nearest, linear

 New
instructions may be prototyped
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot Source Code: PTX Emulator Device Backend
• ocelot/
• executive/
•
•
•
•
•
interface/EmulatorDevice.h
interface/EmulatedKernel.h
interface/EmulatorCallStack.h
interface/CooperativeThreadArray.h
interface/TextureOperations.h
• trace/
• interface/TraceEvent.h
• interface/TraceGenerator.h
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
6
7
Trace Generator Interface
 Emulator
broadcasts events to event trace analyzers
during execution
Events provide detailed device state: PC, activity mask, operand data,
thread ID, etc.
 Used for error checking, instrumentation, and simulation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
7
8
Trace Generator Interface
// ocelot/trace/interface/TraceGenerator.h
//
// Base class for generating traces
class TraceGenerator {
public:
TraceGenerator();
virtual ~TraceGenerator();
// called when a traced kernel is launched to
//
retrieve some
parameters from the kernel
virtual void initialize(
const executive::ExecutableKernel& kernel);
// Called whenever an event takes place.
virtual void event(const TraceEvent & event);
// called when an event is committed
virtual void postEvent(const TraceEvent & event);
// Called when a kernel is finished. There will
//
be no more events for this kernel.
virtual void finish();
};
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
8
9
TraceEvent Object
 Captures
execution of a dynamic instruction, including

Device pointer to access device state

Kernel grid, CTA dimensions, parameter values

PC and internal representation of PTX instruction

Set of active threads executing instruction

Memory addresses and size of transfer

Branch target(s) and diverging threads
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
9
trace::TraceEvent
// \file ocelot/trace/interface/TraceEvent.h
class TraceEvent {
public:
// ID of the block that generated the event
ir::Dim3 blockId;
// PC index into EmulatedKernel's packed instruction sequence
ir::PTXU64 PC;
// Depth of call stack [i.e. number of contexts on the runtime
stack]
ir::PTXU32 contextStackSize;
// Instruction const pointer to instruction pointed to by PC
const ir::PTXInstruction* instruction;
// Bit mask of active threads that executed this instruction
BitMask active;
// Taken thread mask in case of a branch
BitMask taken;
// Fall through thread mask in case of a branch
BitMask fallthrough;
// Vector of memory addresses possibly generated for this
instruction
U64Vector memory_addresses;
// Vector of sizes of memory operations possibly issued by this
// instruction
ir::PTXU32 memory_size;
// Dimensions of the kernel grid that generated the event
ir::Dim3 gridDim;
// Dimensions of the kernel block that generated the event
ir::Dim3 blockDim;
// Captures just events related to thread reconvergence
ReconvergenceTraceEvent reconvergence;
};
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
11
PTX Emulator – CUDA Debugging – Race Detection
// file: raceCondition.cu
__global__ void raceCondition(int *A) {
__shared__ int SharedMem[64];
SharedMem[threadIdx.x] = A[threadIdx.x];
// no synchronization barrier!
A[threadIdx.x] = SharedMem[64 - threadIdx.x];
// line 9 - faulting load
}
. . .
raceCondition<<< dim3(1,1), dim3(64, 1) >>>( validPtr );
. . .
==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z13raceConditionPi" with exception:
==Ocelot== [PC 15] [thread 0] [cta 0] ld.shared.s32 %r14, [%r13 + 252]
- Shared memory race condition, address 0xfc was previously written by thread 63
without a memory barrier in between.
==Ocelot== Near raceCondition.cu:9:0
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
11
12
PTX Emulator – CUDA Debugging- Illegal Memory Accesses
// file: memoryCheck.cu
__global__ void badMemoryReference(int *A) {
A[threadIdx.x] = 0;
// line 3 - faulting store
}
int main() {
int *invalidPtr = 0x0234;
// arbitrary pointer does not refer
//
int *validPtr = 0;
to an existing memory allocation
cudaMalloc((void **)&validPtr, sizeof(int)*64);
badMemoryReference<<< dim3(1,1), dim3(64, 1) >>>( invalidPtr );
return 0;
}
==Ocelot== Ocelot PTX Emulator failed to run kernel
"_Z18badMemoryReferencePi" with exception:
==Ocelot== [PC 5] [thread 0] [cta 0] st.global.s32 [%r4 + 0], %r0
- Global memory access 0x234 is not within any allocated or mapped range.
==Ocelot==
==Ocelot== Nearby Device Allocations
==Ocelot== [0x12fa2e0] - [0x12fa3e0] (256 bytes)
==Ocelot==
==Ocelot== Near memoryCheck.cu:3:0
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
12
Interactive Debugger
 Interactive
command-line debugger implemented using
TraceGenerator interface
• Enables
• Inspection of application
state
• Single-stepping of
instructions
• Breakpoints and
watchpoints
$ ./TestCudaSequence
A_gpu = 0x16dcbe0
(ocelot-dbg) Attaching debugger to kernel
'sequence'
(ocelot-dbg)
(ocelot-dbg) watch global address 0x16dcbe4 s32[3]
set #1: watch global address 0x16dcbe4 s32[3]
- 12 bytes
(ocelot-dbg)
(ocelot-dbg) continue
st.global.s32 [%r11 + 0], %r7
watchpoint #1 - CTA (0, 0)
thread (1, 0, 0) - store to 0x16dcbe4 4 bytes
old value = -1
new value = 2
thread (2, 0, 0) - store to 0x16dcbe8 4 bytes
old value = -1
new value = 4
thread (3, 0, 0) - store to 0x16dcbec 4 bytes
old value = -1
new value = 6
break on watchpoint
(ocelot-dbg)
• Faults in MemoryChecker and
RaceDetector invoke ocelotdbg automatically
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Performance Tuning
Identify critical regions
• Memory demand
• Floating-point intensity
• Shared memory bank conflicts
•
....
for (int offset = 1; offset < n;
offset *= 2)
{
// line 61
pout = 1 - pout;
pin
= 1 - pout;
__syncthreads();
temp[pout*n+thid] = temp[pin*n+thid];
if (thid >= offset) {
temp[pout*n+thid] +=
temp[pin*n+thid - offset];
}
}
....
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
15
Example: Inter-thread Data Flow
• Which kernels exchange computed
results through shared memory?
• Track id of producer thread
• Ensure threads are well
synchronized
• Optionally ignore uses of shared
memory to transfer working sets
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
15
16
Current Trace Generators
Trace Generators
Branch
Instruction
Integrated Debugger
Kernel Dimension
Machine Attributes
Memory
Memory Checker
Memory Race Detector
Parallelism
Performance Bound
Shared Computation
Warp Synchronous
Function
Measures control flow uniformity and branch
divergence
Static and dynamic instruction count
GDB-like interface
Kernel grid and block dimensions
Observe and record machine characteristics
Working set size, memory intensity, memory efficiency
i) Bounds checks, ii) alignment checks, and iii)
uninitialized loads (shared memory)
Race conditions on shared memory
MIMD and SIMD parallelism limits
Compute and memory throughput
Extent of data flow among threads
Hot-paths/regions for warp synchronous execution
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
16
17
Using Trace Generators
 Implement

TraceGenerator interface
override methods:

 Add
initialize( ), event( ),
postEvent( ), finish( )
to Ocelot runtime
explicitly: ocelot::addTraceGenerator( )
 or, add to trace::TraceConfiguration

 Online
 Link
analysis or serialize event traces
applications with
libocelotTrace.so
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
17
18
Configuring & Executing Applications


Controls Ocelot’s initial state

Located in application’s startup directory




trace: {
Edit configure.ocelot
memoryChecker: {
enabled: true,
checkInitialization: false
trace specifies which trace generators are initially
attached
},
raceDetector: {
executive controls device properties
enabled: true,
ignoreIrrelevantWrites: true
trace:

memoryChecker – ensures

raceDetector - enforces synchronized access to .shared

debugger - interactive debugger
},
debugger: {
enabled: true,
kernelFilter: "_Z13scalarProdGPUPfS_S_ii",
executive:


devices:

List of Ocelot backend devices that are enabled

emulated – Ocelot PTX emulator (trace generators)
alwaysAttach: false
},
},
executive: {
devices: [ "emulated" ],
Additional devices:

nvidia – execution on NVIDIA GPUs

llvm – efficient execution of PTX on multicore CPU

amd – translation to AMD IL for PTX on AMD RADEON GPU
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
}
}
19
Dynamic Instruction
Count
Example: Thread Load Imbalance
//! Computes number of dynamic instructions for each thread
class ThreadLoadImbalance: public trace::TraceGenerator {
•Mandelbrot (CUDA SDK)
public:
std::vector< size_t > dynamicInstructions;
// For each dynamic instruction, increment counters of each
// thread that executes it
virtual void event(const TraceEvent & event) {
if (!dynamicInstructions.size())
dynamicInstructions.resize(event.active.size(), 0);
for (int i = 0; i < event.active.size(); i++) {
if (event.active[i])
dynamicInstructions[i]++;
}
}
};
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
19
EXAMPLE: CONTROL-FLOW
DIVERGENCE
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
20
Control-Flow Divergence in PTX Emulator


PTX Emulator facilitates customizable handlers for control
flow divergence
Currently implements:




Immediate post-dominator (ipdom)
Barrier divergence
Thread frontiers, sorted stack
Thread frontiers, GEN6

Assumes warp is CTA wide

Abstract handlers implement potentially
divergent control instructions


eg. Bra, Bar, Exit, Vote
Executing instructions drives TraceGenerators


Reconvergence affects active threads, dynamic instruction count, and
instruction trace
Analysis tools can group threads into warps of arbitrary size
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
21
class ReconvergenceMechanism
// ocelot/executive/interface/ReconvergenceMechanism.h
//
virtual void eval_Reconverge(CTAContext &context,
class ReconvergenceMechanism {
const ir::PTXInstruction &instr) = 0;
public:
ReconvergenceMechanism(CooperativeThreadArray *cta);
virtual void eval_Exit(CTAContext &context,
virtual ~ReconvergenceMechanism();
const ir::PTXInstruction &instr) = 0;
virtual void initialize() = 0;
virtual void eval_Vote(CTAContext &context,
const ir::PTXInstruction &instr);
virtual void evalPredicate(CTAContext &context) = 0;
virtual bool nextInstruction(CTAContext &context,
virtual bool eval_Bra(CTAContext &context,
const ir::PTXInstruction &instr,
const ir::PTXInstruction &instr,
const boost::dynamic_bitset<> & branch,
const ir::PTXInstruction::Opcode &) = 0;
const boost::dynamic_bitset<> & fallthrough) = 0;
virtual CTAContext& getContext() = 0;
virtual void eval_Bar(CTAContext &context,
}
const ir::PTXInstruction &instr) = 0;
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
22
Example: Immediate Post-Dominator Reconvergence
// ocelot/executive/interface/ReconvergenceMechanism.h
//
CTAContext& getContext();
size_t stackSize() const;
class ReconvergenceIPDOM: public ReconvergenceMechanism {
void push(CTAContext&);
public:
void pop();
ReconvergenceIPDOM(CooperativeThreadArray *cta);
~ReconvergenceIPDOM();
std::vector<CTAContext> runtimeStack;
std::vector<int> pcStack;
void initialize();
void evalPredicate(CTAContext &context);
bool eval_Bra(CTAContext &context,
unsigned int reconvergeEvents;
};
PTXInstruction &instr,
dynamic_bitset<> & branch,
dynamic_bitset<> & fallthrough);
void eval_Bar(CTAContext &context, PTXInstruction &instr);
void eval_Reconverge(CTAContext &context, PTXInstruction &instr);
void eval_Exit(CTAContext &context, PTXInstruction &instr);
bool nextInstruction(CTAContext &context,
PTXInstruction &instr, PTXInstruction::Opcode &);
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
23
Example: Immediate Post-Dominator Reconvergence
// ocelot/executive/implementation/ReconvergenceMechanism.cpp
//
void executive::ReconvergenceIPDOM::eval_Bar(executive::CTAContext &context,
const ir::PTXInstruction &instr) {
if (context.active.count() < context.active.size()) {
// deadlock - not all threads reach synchronization barrier
#if REPORT_BAR
report(" Bar called - " << context.active.count() << " of "
<< context.active.size() << " threads active");
#endif
std::stringstream message;
message << "barrier deadlock:\n";
throw RuntimeException(message.str(), context.PC, instr);
}
}
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
24
Example: Immediate Post-Dominator Reconvergence
void executive::ReconvergenceIPDOM::eval_Reconverge(
executive::CTAContext &context, const ir::PTXInstruction &instr) {
if(runtimeStack.size() > 1)
{
if(pcStack.back() == context.PC) {
pcStack.pop_back();
runtimeStack.pop_back();
++reconvergeEvents;
}
else {
context.PC++;
}
}
else {
context.PC++;
}
}
void executive::ReconvergenceIPDOM::eval_Exit(executive::CTAContext &context,
const ir::PTXInstruction &instr) {
eval_Bar(context, instr);
context.running = false;
}
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
25
Applications: Thread Frontiers


Evaluate the impact of novel warp reconvergence mechanisms on
unstructured control-flow graphs
Approach:




Control the layout of basic blocks
Select threads with highest priority PC
Model priority queue in hardware
Evaluation: measure impact on



Activity factor - SIMD utilization
Dynamic instruction count
Effective memory bandwidth
Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng Wu, Sudhakar
Yalamanchili. SIMD Reconvergence at Thread Frontiers. 44th International Symposium on Microarchitecture
(MICRO 44). December 2011.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
26
Applications: Thread Frontiers
// ocelot/executive/interface/ReconvergenceMechanism.h
//
class ReconvergenceTFSortedStack: public ReconvergenceMechanism {
public:
ReconvergenceTFSortedStack(CooperativeThreadArray *cta);
~ReconvergenceTFSortedStack();
// ... omitted
typedef std::map<int, CTAContext> RuntimeStack;
typedef std::vector<RuntimeStack> StackVector;
StackVector stack;
unsigned int reconvergeEvents;
};
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
27
Applications: Thread Frontiers
// ocelot/executive/implementation/ReconvergenceMechanism.cpp
//
executive::CTAContext&
executive::ReconvergenceTFSortedStack::getContext() {
bool executive::ReconvergenceTFSortedStack::nextInstruction(
executive::CTAContext &context, PTXInstruction &instr,
return stack.back().begin()->second;
PTXInstruction::Opcode &opcode) {
}
// advance to next instruction if the current instruction
void executive::ReconvergenceTFSortedStack::eval_Exit(
// wasn't a branch
executive::CTAContext &context, PTXInstruction &instr) {
if (opcode != ir::PTXInstruction::Bra
if (stack.back().size() == 1) {
&& opcode != ir::PTXInstruction::Call
context.running = false;
&& opcode != ir::PTXInstruction::Ret) {
}
context.PC++;
else {
}
throw RuntimeException("not all threads hit the exit: ",
context.PC, instr);
}
return context.running;
}
}
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
28
Applications: Thread Frontiers
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
29
Summary: Ocelot PTX Emulator
 Used
 Can

for instruction level analysis
be attached to trace generators and trace analyzers
Simple processing and filtering of machine state
 Forms
the basis for a range of productivity tools
Correctness tools
 Debugging tools
 Workload characterization

 Drives

instruction and address traces to MACSIM
(Part 2 of this tutorial)
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
30