PTX Emulator - CompArch - Georgia Institute of Technology
Download
Report
Transcript PTX Emulator - CompArch - Georgia Institute of Technology
OCELOT: PTX EMULATOR
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
1
Overview
Ocelot
PTX Emulator
Multicore-Backend
NVIDIA
AMD
GPU Backend
GPU Backend
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
2
3
Execution Model
NVIDIA’s PTX Execution Model
Parallel thread Execution (PTX)
Explicit memory hierarchy
Cooperative thread arrays (CTAs) and
ordering constraints
Array of multiprocessors each
executing a CTA (coarse grain)
SIMD multiprocessors (fine grain)
Single instruction multiple thread (SIMT) execution
Enables hardware to exploit control
flow uniformity and data locality
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
4
The Ocelot PTX Emulator
Abstract machine model
Performs
functional simulation of
the PTX execution model
Access to complete machine state
Enables
detailed performance
evaluation/debugging
Program correctness checking
Alignment checks, out of bounds etc.
Workload
modeling
characterization and
Trace
generation to drive
architecture simulators
PTX 3.0 (Fermi) support
Timing information not available
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
4
Emulator Implementation
Serialized
execution of CTAs
Implements the CUDA execution model semantics
CUDA does not guarantee concurrent CTA execution!
Implements
Multiple reconvergence mechanisms are implemented:
IPDOM, Barrier, Thread frontiers
Software
abstract reconvergence mechanism
support for special functions:
Texture sampling
1D, 2D, 3D, cube
Nearest, linear
New
instructions may be prototyped
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Ocelot Source Code: PTX Emulator Device Backend
• ocelot/
• executive/
•
•
•
•
•
interface/EmulatorDevice.h
interface/EmulatedKernel.h
interface/EmulatorCallStack.h
interface/CooperativeThreadArray.h
interface/TextureOperations.h
• trace/
• interface/TraceEvent.h
• interface/TraceGenerator.h
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
6
7
Trace Generator Interface
Emulator
broadcasts events to event trace analyzers
during execution
Events provide detailed device state: PC, activity mask, operand data,
thread ID, etc.
Used for error checking, instrumentation, and simulation
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
7
8
Trace Generator Interface
// ocelot/trace/interface/TraceGenerator.h
//
// Base class for generating traces
class TraceGenerator {
public:
TraceGenerator();
virtual ~TraceGenerator();
// called when a traced kernel is launched to
//
retrieve some
parameters from the kernel
virtual void initialize(
const executive::ExecutableKernel& kernel);
// Called whenever an event takes place.
virtual void event(const TraceEvent & event);
// called when an event is committed
virtual void postEvent(const TraceEvent & event);
// Called when a kernel is finished. There will
//
be no more events for this kernel.
virtual void finish();
};
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
8
9
TraceEvent Object
Captures
execution of a dynamic instruction, including
Device pointer to access device state
Kernel grid, CTA dimensions, parameter values
PC and internal representation of PTX instruction
Set of active threads executing instruction
Memory addresses and size of transfer
Branch target(s) and diverging threads
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
9
trace::TraceEvent
// \file ocelot/trace/interface/TraceEvent.h
class TraceEvent {
public:
// ID of the block that generated the event
ir::Dim3 blockId;
// PC index into EmulatedKernel's packed instruction sequence
ir::PTXU64 PC;
// Depth of call stack [i.e. number of contexts on the runtime
stack]
ir::PTXU32 contextStackSize;
// Instruction const pointer to instruction pointed to by PC
const ir::PTXInstruction* instruction;
// Bit mask of active threads that executed this instruction
BitMask active;
// Taken thread mask in case of a branch
BitMask taken;
// Fall through thread mask in case of a branch
BitMask fallthrough;
// Vector of memory addresses possibly generated for this
instruction
U64Vector memory_addresses;
// Vector of sizes of memory operations possibly issued by this
// instruction
ir::PTXU32 memory_size;
// Dimensions of the kernel grid that generated the event
ir::Dim3 gridDim;
// Dimensions of the kernel block that generated the event
ir::Dim3 blockDim;
// Captures just events related to thread reconvergence
ReconvergenceTraceEvent reconvergence;
};
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
11
PTX Emulator – CUDA Debugging – Race Detection
// file: raceCondition.cu
__global__ void raceCondition(int *A) {
__shared__ int SharedMem[64];
SharedMem[threadIdx.x] = A[threadIdx.x];
// no synchronization barrier!
A[threadIdx.x] = SharedMem[64 - threadIdx.x];
// line 9 - faulting load
}
. . .
raceCondition<<< dim3(1,1), dim3(64, 1) >>>( validPtr );
. . .
==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z13raceConditionPi" with exception:
==Ocelot== [PC 15] [thread 0] [cta 0] ld.shared.s32 %r14, [%r13 + 252]
- Shared memory race condition, address 0xfc was previously written by thread 63
without a memory barrier in between.
==Ocelot== Near raceCondition.cu:9:0
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
11
12
PTX Emulator – CUDA Debugging- Illegal Memory Accesses
// file: memoryCheck.cu
__global__ void badMemoryReference(int *A) {
A[threadIdx.x] = 0;
// line 3 - faulting store
}
int main() {
int *invalidPtr = 0x0234;
// arbitrary pointer does not refer
//
int *validPtr = 0;
to an existing memory allocation
cudaMalloc((void **)&validPtr, sizeof(int)*64);
badMemoryReference<<< dim3(1,1), dim3(64, 1) >>>( invalidPtr );
return 0;
}
==Ocelot== Ocelot PTX Emulator failed to run kernel
"_Z18badMemoryReferencePi" with exception:
==Ocelot== [PC 5] [thread 0] [cta 0] st.global.s32 [%r4 + 0], %r0
- Global memory access 0x234 is not within any allocated or mapped range.
==Ocelot==
==Ocelot== Nearby Device Allocations
==Ocelot== [0x12fa2e0] - [0x12fa3e0] (256 bytes)
==Ocelot==
==Ocelot== Near memoryCheck.cu:3:0
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
12
Interactive Debugger
Interactive
command-line debugger implemented using
TraceGenerator interface
• Enables
• Inspection of application
state
• Single-stepping of
instructions
• Breakpoints and
watchpoints
$ ./TestCudaSequence
A_gpu = 0x16dcbe0
(ocelot-dbg) Attaching debugger to kernel
'sequence'
(ocelot-dbg)
(ocelot-dbg) watch global address 0x16dcbe4 s32[3]
set #1: watch global address 0x16dcbe4 s32[3]
- 12 bytes
(ocelot-dbg)
(ocelot-dbg) continue
st.global.s32 [%r11 + 0], %r7
watchpoint #1 - CTA (0, 0)
thread (1, 0, 0) - store to 0x16dcbe4 4 bytes
old value = -1
new value = 2
thread (2, 0, 0) - store to 0x16dcbe8 4 bytes
old value = -1
new value = 4
thread (3, 0, 0) - store to 0x16dcbec 4 bytes
old value = -1
new value = 6
break on watchpoint
(ocelot-dbg)
• Faults in MemoryChecker and
RaceDetector invoke ocelotdbg automatically
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Performance Tuning
Identify critical regions
• Memory demand
• Floating-point intensity
• Shared memory bank conflicts
•
....
for (int offset = 1; offset < n;
offset *= 2)
{
// line 61
pout = 1 - pout;
pin
= 1 - pout;
__syncthreads();
temp[pout*n+thid] = temp[pin*n+thid];
if (thid >= offset) {
temp[pout*n+thid] +=
temp[pin*n+thid - offset];
}
}
....
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
15
Example: Inter-thread Data Flow
• Which kernels exchange computed
results through shared memory?
• Track id of producer thread
• Ensure threads are well
synchronized
• Optionally ignore uses of shared
memory to transfer working sets
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
15
16
Current Trace Generators
Trace Generators
Branch
Instruction
Integrated Debugger
Kernel Dimension
Machine Attributes
Memory
Memory Checker
Memory Race Detector
Parallelism
Performance Bound
Shared Computation
Warp Synchronous
Function
Measures control flow uniformity and branch
divergence
Static and dynamic instruction count
GDB-like interface
Kernel grid and block dimensions
Observe and record machine characteristics
Working set size, memory intensity, memory efficiency
i) Bounds checks, ii) alignment checks, and iii)
uninitialized loads (shared memory)
Race conditions on shared memory
MIMD and SIMD parallelism limits
Compute and memory throughput
Extent of data flow among threads
Hot-paths/regions for warp synchronous execution
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
16
17
Using Trace Generators
Implement
TraceGenerator interface
override methods:
Add
initialize( ), event( ),
postEvent( ), finish( )
to Ocelot runtime
explicitly: ocelot::addTraceGenerator( )
or, add to trace::TraceConfiguration
Online
Link
analysis or serialize event traces
applications with
libocelotTrace.so
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
17
18
Configuring & Executing Applications
Controls Ocelot’s initial state
Located in application’s startup directory
trace: {
Edit configure.ocelot
memoryChecker: {
enabled: true,
checkInitialization: false
trace specifies which trace generators are initially
attached
},
raceDetector: {
executive controls device properties
enabled: true,
ignoreIrrelevantWrites: true
trace:
memoryChecker – ensures
raceDetector - enforces synchronized access to .shared
debugger - interactive debugger
},
debugger: {
enabled: true,
kernelFilter: "_Z13scalarProdGPUPfS_S_ii",
executive:
devices:
List of Ocelot backend devices that are enabled
emulated – Ocelot PTX emulator (trace generators)
alwaysAttach: false
},
},
executive: {
devices: [ "emulated" ],
Additional devices:
nvidia – execution on NVIDIA GPUs
llvm – efficient execution of PTX on multicore CPU
amd – translation to AMD IL for PTX on AMD RADEON GPU
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
}
}
19
Dynamic Instruction
Count
Example: Thread Load Imbalance
//! Computes number of dynamic instructions for each thread
class ThreadLoadImbalance: public trace::TraceGenerator {
•Mandelbrot (CUDA SDK)
public:
std::vector< size_t > dynamicInstructions;
// For each dynamic instruction, increment counters of each
// thread that executes it
virtual void event(const TraceEvent & event) {
if (!dynamicInstructions.size())
dynamicInstructions.resize(event.active.size(), 0);
for (int i = 0; i < event.active.size(); i++) {
if (event.active[i])
dynamicInstructions[i]++;
}
}
};
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
19
EXAMPLE: CONTROL-FLOW
DIVERGENCE
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
20
Control-Flow Divergence in PTX Emulator
PTX Emulator facilitates customizable handlers for control
flow divergence
Currently implements:
Immediate post-dominator (ipdom)
Barrier divergence
Thread frontiers, sorted stack
Thread frontiers, GEN6
Assumes warp is CTA wide
Abstract handlers implement potentially
divergent control instructions
eg. Bra, Bar, Exit, Vote
Executing instructions drives TraceGenerators
Reconvergence affects active threads, dynamic instruction count, and
instruction trace
Analysis tools can group threads into warps of arbitrary size
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
21
class ReconvergenceMechanism
// ocelot/executive/interface/ReconvergenceMechanism.h
//
virtual void eval_Reconverge(CTAContext &context,
class ReconvergenceMechanism {
const ir::PTXInstruction &instr) = 0;
public:
ReconvergenceMechanism(CooperativeThreadArray *cta);
virtual void eval_Exit(CTAContext &context,
virtual ~ReconvergenceMechanism();
const ir::PTXInstruction &instr) = 0;
virtual void initialize() = 0;
virtual void eval_Vote(CTAContext &context,
const ir::PTXInstruction &instr);
virtual void evalPredicate(CTAContext &context) = 0;
virtual bool nextInstruction(CTAContext &context,
virtual bool eval_Bra(CTAContext &context,
const ir::PTXInstruction &instr,
const ir::PTXInstruction &instr,
const boost::dynamic_bitset<> & branch,
const ir::PTXInstruction::Opcode &) = 0;
const boost::dynamic_bitset<> & fallthrough) = 0;
virtual CTAContext& getContext() = 0;
virtual void eval_Bar(CTAContext &context,
}
const ir::PTXInstruction &instr) = 0;
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
22
Example: Immediate Post-Dominator Reconvergence
// ocelot/executive/interface/ReconvergenceMechanism.h
//
CTAContext& getContext();
size_t stackSize() const;
class ReconvergenceIPDOM: public ReconvergenceMechanism {
void push(CTAContext&);
public:
void pop();
ReconvergenceIPDOM(CooperativeThreadArray *cta);
~ReconvergenceIPDOM();
std::vector<CTAContext> runtimeStack;
std::vector<int> pcStack;
void initialize();
void evalPredicate(CTAContext &context);
bool eval_Bra(CTAContext &context,
unsigned int reconvergeEvents;
};
PTXInstruction &instr,
dynamic_bitset<> & branch,
dynamic_bitset<> & fallthrough);
void eval_Bar(CTAContext &context, PTXInstruction &instr);
void eval_Reconverge(CTAContext &context, PTXInstruction &instr);
void eval_Exit(CTAContext &context, PTXInstruction &instr);
bool nextInstruction(CTAContext &context,
PTXInstruction &instr, PTXInstruction::Opcode &);
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
23
Example: Immediate Post-Dominator Reconvergence
// ocelot/executive/implementation/ReconvergenceMechanism.cpp
//
void executive::ReconvergenceIPDOM::eval_Bar(executive::CTAContext &context,
const ir::PTXInstruction &instr) {
if (context.active.count() < context.active.size()) {
// deadlock - not all threads reach synchronization barrier
#if REPORT_BAR
report(" Bar called - " << context.active.count() << " of "
<< context.active.size() << " threads active");
#endif
std::stringstream message;
message << "barrier deadlock:\n";
throw RuntimeException(message.str(), context.PC, instr);
}
}
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
24
Example: Immediate Post-Dominator Reconvergence
void executive::ReconvergenceIPDOM::eval_Reconverge(
executive::CTAContext &context, const ir::PTXInstruction &instr) {
if(runtimeStack.size() > 1)
{
if(pcStack.back() == context.PC) {
pcStack.pop_back();
runtimeStack.pop_back();
++reconvergeEvents;
}
else {
context.PC++;
}
}
else {
context.PC++;
}
}
void executive::ReconvergenceIPDOM::eval_Exit(executive::CTAContext &context,
const ir::PTXInstruction &instr) {
eval_Bar(context, instr);
context.running = false;
}
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
25
Applications: Thread Frontiers
Evaluate the impact of novel warp reconvergence mechanisms on
unstructured control-flow graphs
Approach:
Control the layout of basic blocks
Select threads with highest priority PC
Model priority queue in hardware
Evaluation: measure impact on
Activity factor - SIMD utilization
Dynamic instruction count
Effective memory bandwidth
Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng Wu, Sudhakar
Yalamanchili. SIMD Reconvergence at Thread Frontiers. 44th International Symposium on Microarchitecture
(MICRO 44). December 2011.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
26
Applications: Thread Frontiers
// ocelot/executive/interface/ReconvergenceMechanism.h
//
class ReconvergenceTFSortedStack: public ReconvergenceMechanism {
public:
ReconvergenceTFSortedStack(CooperativeThreadArray *cta);
~ReconvergenceTFSortedStack();
// ... omitted
typedef std::map<int, CTAContext> RuntimeStack;
typedef std::vector<RuntimeStack> StackVector;
StackVector stack;
unsigned int reconvergeEvents;
};
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
27
Applications: Thread Frontiers
// ocelot/executive/implementation/ReconvergenceMechanism.cpp
//
executive::CTAContext&
executive::ReconvergenceTFSortedStack::getContext() {
bool executive::ReconvergenceTFSortedStack::nextInstruction(
executive::CTAContext &context, PTXInstruction &instr,
return stack.back().begin()->second;
PTXInstruction::Opcode &opcode) {
}
// advance to next instruction if the current instruction
void executive::ReconvergenceTFSortedStack::eval_Exit(
// wasn't a branch
executive::CTAContext &context, PTXInstruction &instr) {
if (opcode != ir::PTXInstruction::Bra
if (stack.back().size() == 1) {
&& opcode != ir::PTXInstruction::Call
context.running = false;
&& opcode != ir::PTXInstruction::Ret) {
}
context.PC++;
else {
}
throw RuntimeException("not all threads hit the exit: ",
context.PC, instr);
}
return context.running;
}
}
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
28
Applications: Thread Frontiers
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
29
Summary: Ocelot PTX Emulator
Used
Can
for instruction level analysis
be attached to trace generators and trace analyzers
Simple processing and filtering of machine state
Forms
the basis for a range of productivity tools
Correctness tools
Debugging tools
Workload characterization
Drives
instruction and address traces to MACSIM
(Part 2 of this tutorial)
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
30