12JAN07 Talk for I/UCRC Annual Meeting

Transcript 12JAN07 Talk for I/UCRC Annual Meeting

Debugging and Optimizing
RC Applications
Seth Koehler
John Curreri
Presentation Overview


Introduction
Background





Project overview
Project details





Reconfigurable computing (RC) applications
Debug
Performance analysis
ReCAP framework & tool
Special features
HLL-based debug & performance analysis
Case studies
Conclusions
Introduction

Debugging and optimization are an integral part of
application development


Typically at end of development cycle (after formulation
and design phases)
Designers often spend longer debugging the application
than designing it! *



Optimization is often just left for a later version, if ever
Every optimization made has to re-pass through debug phase
To improve productivity in application design, it is
critical to address debug and optimization
Formulation
Design
Translation
Execution
* Debugging FPGA systems ftp://ftp.altera.com/outgoing/download/education/events/
highspeed/Tek_ALTERAFPGADEBUG_IPIntegration_final.pdf
Background – RC Applications

Why reconfigurable computing (RC)?



General-purpose architectures can be wasteful in
terms of performance and power
Impractical to have an ASIC for every application
RC ~= FPGAs (Field-Programmable Gate Arrays)



RC applications typically
employ CPUs and FPGAs



Leverage strengths of both types of processors
Potential for higher performance using less power
Programmed using Hardware Description Languages
(HDLs) or High-Level Languages (HLLs)


Application-specific hardware and parallelism
Retain flexibility and programmability
CPU is programmed with whatever conventional HLL is desired
(C, C++, MPI, UPC, etc.)
System and application complexity can make it difficult to achieve a
correct, well-performing application
Background - Debug


Debug: to detect and remove errors from a program *
Debugging methods

Stare at code


Insert printf statements


Much better – instant access to all data and support for indicating
where/why a program crashed
Use simulator


Requires some good guessing, can be tedious if more than a few printf's
Use debugger (e.g., gdb)


At least it helps you "wrap your mind around your code"
Can provide more flexibility and information than debugger, but
simulators can be inaccurate and slow, not to mention hard to make
Write assertions


Best – application designer documents situations that are impossible
Formal and dynamic verification methods check whether assertions hold
* http://dictionary.reference.com
Background – Performance Analysis

Performance analysis – investigate program behavior
using information gathered during execution *


Aides designer in locating and remedying application bottlenecks, reducing
guesswork in optimization
Replaces tedious, error-prone manual analysis methods (timing routines and
printf statements)
Performance
Analysis
Optimization Flow
Unassisted
Optimization
Flow
Optimization and Performance Analysis
Target
System
Application
Instrument
Optimized
Application
???
Analyze
(Automatically)
Measure
Optimize
Analyze
(Manually)
Present
Instrumented
Application
Performance
Data
Visualizations
* http://en.wikipedia.org/wiki/Performance_analysis
Potential
Bottlenecks
Project Overview
RC systems and applications are even more complex than in HPC



Optimizing applications is crucial for effective use of these systems


Objective: expand the notion and benefits of software debugging and
performance analysis into the software-hardware realm of RC
Traditional Processor
Communication
FPGA Communication
Primary Interconnect
CPU
Network
Primary Interconnect
FPGA
CPU & Primary Interconnect
App core
App core
...
Top-level App
FPGA
Device
Interface
On-board FPGA
FPGA
...
CPU
CPU
FPGA
Board
FPGA / Device
Board Interface
Legend
...
Main
Memory
Board Interface
Network
Secondary Interconnect
Secondary Interconnect
On-board
Memory
Secondary
Interconnect
Board
Main Memory
...
Node
CPU
Interconnect
Machine
...
System
Node

Debug and performance tools are relied on heavily in HPC to productively verify and
optimize applications
Debug and performance tools are even more essential in RC due to additional
system and application complexity, and yet research is lacking
Node

Heterogeneous components
Hierarchy of parallelism among components
Lack of visibility inside RC devices
Node

App core
Embedded CPU(s)
ReCAP Framework

Reconfigurable-computing application performance
(ReCAP) framework



Adds assertion-based verification and performance analysis
capabilities to FPGA portion of application
Builds upon existing assertions in HLL languages AND Parallel
Performance Wizard (PPW) for performance analysis of CPU
portion of application
Three main components



HDL Instrumenter
Hardware Measurement
Module (HMM)
RC-enhanced version of
PPW (PPW+RC)


Backend (instrumentation and
measurement)
Frontend (analysis and visualization)
ReCAP
HDL Instrumentation
& Measurement
CPU Instrumentation
& Measurement
HDL Instrumenter
PPW+RC
Backend
HMM
CPU-FPGA Analysis
& Visualization
PPW+RC
Frontend
ReCAP: HDL Instrumenter

Modifies HDL design files to monitor application data at runtime

User can define "events" that are of interest


User can define "monitors" that determine what to record when event occurs


e.g., buffer full, cycles spent in a state
e.g., summary statistics, full trace
User can enable a number of automatic analyses

e.g., decision coverage, assertions, profiling, automatic bottleneck detection
User Application (HLL)
PPW+RC backend
Original Application
CPU(s)
Module Query Thread
FPGA Access Methods
(Wrapper)
Lock
Data Manager
Measurement and Interface
Hardware Measurement
Module (HMM)
Top-level component
Component B
Component D
Additions by
Instrumentation
Modified
Ports /
Interfaces
Component A
FPGA(s)
Legend
Original RC
Application
Component C
Component E
Component E
User Application (HDL)
Instrumentation Process
HDL Instrumenter
ReCAP: Hardware Measurement Module
Hardware necessary to record, store, and
retrieve data at runtime

User Application (HLL)
PPW+RC backend
Original Application
CPU(s)
Module Query Thread
Lock
Data Manager
Hardware Measurement
Module (HMM)
Top-level component
1
2
Component C
Component E
Legend
Original RC
Application
Profile Data
...
P-1
Component E
Cycle Counter
Triggers
Signal
Signals Analysis
Block
Additions by
Instrumentation
Trace records
Module Statistics
Trace Data
Data
Trace Data
...
Component D
0
Modified
Ports /
Interfaces
Component A
Component B
Trigger
Sample
Data
FPGA Access Methods
(Wrapper)
Measurement and Interface
FPGA(s)
combinatorial
or sequential
logic
Profile
Trace Data
Collector

Out bitwidth

Profiling, tracing, and sampling
Trace
Cycle counter and other module statistics (trace
records dropped, counter overflow, etc.)
Buffers for storing trace data
Module control for performance data retrieval and
miscellaneous control (e.g., clear and stop)
Signal(s)

In bitwidth

Module
Control
Perf. Data
Request
Collector Memory
User Application (HDL)
Instrumentation Process
Hardware Measurement Module (HMM)
ReCAP: PPW+RC

PPW+RC backend adds thread to software to query HMM at runtime




Requires lock (since we now have shared FPGA access)
Handles FPGA performance data storage and migration to PPW data structures
Monitors FPGA API calls in addition to normal PPW software performance monitoring
PPW+RC frontend analyzes and presents measured data for CPUs / FPGAs


Table and chart views across multiple experiments
Export to Jumpshot for timeline views
User Application (HLL)
PPW+RC backend
Original Application
CPU(s)
Module Query Thread
FPGA Access Methods
(Wrapper)
Lock
Data Manager
Measurement and Interface
Hardware Measurement
Module (HMM)
Top-level component
Component B
Component D
Additions by
Instrumentation
Modified
Ports /
Interfaces
Component A
FPGA(s)
Legend
Original RC
Application
Component C
Component E
Component E
User Application (HDL)
Instrumentation Process
PPW+RC front-end
ReCAP Tool-Flow


HDL source files are instrumented, then synthesized/implemented normally
HLL source files are instrumented during compilation



Use ppwcc instead of gcc or ppwupcc instead of upcc
Program is executed normally on system
Performance data file produced can be viewed and analyzed with PPW+RC
Synthesis &
Implementation
HDL
Instrumenter
Visualize with
PPW+RC
Instrumented
HDL source
HDL source
Analyze with
PPW+RC
Instrumented
FPGA Binary
Configuration
HLL
source
Compile
with
PPW+RC
Instrumented
CPU executable
Execute with
PPW+RC
Performance
Data Files
Program
Results
Common RC Bottleneck Detection
Automatically search for
common RC bottlenecks


RC bottleneck
taxonomy
Software /
hardware


Inefficient
transfer type
Control
Infrequent, large
transfers
Full
Buffering
Empty
Application
Excessive HW
stalling
We attempt to minimize the
amount of information
requested
All detected bottlenecks,
Potential optimization
strategies for each
Peak/ideal speedup if
bottleneck resolved
Polling
Frequent, small
transfers
Sub-par channel
efficiency
Interface
Barrier
Synchronization
Currently produces text file
containing

Traditional HPC
bottleneck categories
Inefficient
transfer size
Requires some information
from user


Reduces time and
knowledge needed to find
bottlenecks
Software
...

Late sender /
receiver
Contention
Excessive comm.
time
Clear / flush
Excessive idle
time
Control
Stall
Hardware
Excessive
overhead time
Sub-par
computation time
Full
Buffering
Empty
Load
Imbalance
Stage
Legend
HLL-based
bottleneck
HDL-based
bottleneck
Architecture-Aware Visualization

Architecture-aware visualization
Visualization within application & system context,
with integrated common-bottleneck data



System level
Network
Must be scalable to large systems
Allow user to experiment with different
optimization scenarios to see what provides
best performance
1.98GB/s
99%
CPU Interconnect
691MB/s
67%
CPU 0
904MB/s
88%
CPU 1
1.79GB/s
72%
121MB/s
12%
CPU 2
210MB/s
10%
FPGA 0
CPU 3
6MB/s
0%
FPGA 1
Node level
67%
5%
8%
20% 26%
HT0
Core 1
37%
11%
7%
Core 2
7%
22%
66%
80%
CPU 0
PCI-e
88%
25%
HT1
13%
21%
Core 1
71%
58%
2%
28%
CPU 1
Buffer 2
10%
2%
GCmp
88%
S
28%
40%
2%
K2,1
78%
4%
K2,2
10%
GTx
7%
3%
K3,2
FPGA
10%
87%
90%
1%
K3,1
Time (sec)
FPGA 2
18%
68%
21%
Number
Elements
80%
K1,2
S
5%
8%
Core 2
0MB/s
0%
CPU 5
2.50GB/s
100%
82%
4%
55%
10% 33%
57%
K1,1
15%
Buffer 1
2%
39%
3%
CPU 4
Legend
Idle
Overhead
Work
External Send
External Recv.
Idle Ext. Send
Idle Ext. Recv.
SRAM
21% 24%
812MB/s
79%
0MB/s
0%
41%
52%
41%
10MB/s
1%
2.76GB/s
69%
914MB/s
89%
3%
FPGA 0-1
channel
HLL Performance Analysis
• High-level languages
– Impulse-C and Carte C
• Convert subset of C to HDL
• Employ DMA and streaming
communication
– Speedup gained by
• Pipelining loops
• Library functions
• Replicated functions
– Impulse C
• Pipelining of loops
– Determined by pragmas in code
– Carte (SRC)
• Pipelining of loops
– Automatic pipelining of inner
most loop
• Library functions
– Called as C function
– HDL coded
• Automated instrumentation
– Computation
• State machines
– Used for preserving execution
order in C functions
– Used to control pipelines
• Control and status signals used by
library functions
– Communication
• Control and status signals
– Streaming communication
– DMA transfers
• User-assisted instrumentation
– Application-specific variables
• Monitor meaningful values
selected by user
• Measurement
– Employ HMM from HDL
framework
HLL Instrumentation & Measurement
CPU(s)
HLL Tool Flow
Application
(C source)
Instrumentation Compile
software
FPGA(s)
HLL Hardware Wrapper
Instrumented
Signals
Instrumentation
Software
-hardware
mapping
HLL API Wrapper
Application
(C(HDL)
source)
C source
Measurement
Extraction
Process/Thread
Implement
hardware
Hardware
Measurement
Loopback
Loopback
Module
(C
(HDL)
source)
C
Instrumentation
source
Instrumentation
Uninstrumented
Implement
for FPGA
added
hardware
added
mapped
Project
to to
C source
HDL
to HDL
Finished
design
HLL Analysis & Visualizations

Bottleneck detection (currently user-assisted)





Load-balancing of replicated functions
Monitoring for pipeline stalls
Detecting streaming communication stalls
Finding shared-memory contention
Integration with performance analysis tool

Profiling data


C Source
Main MD loop
Input stream
Pipeline transition
Output steam
HDL State
Machine
b4s0
b4s1
b4s2
b4s3
b4s4
b6s0
b6s1
Pie charts showing time utilization
Tree view of CPU and FPGA timing
?
HLL Assertion Debugging

Based off of ANSI C assert function
int num, i, x[10];
while(num==0) {
num=x[i++];
assert(i<10);
}

Failure will halt application, displaying an error
test.c:7: main: Assertion `i<10' failed.


Assertions can be disabled via #define NDEBUG
Most HLLs do not synthesize standard C library functions on the FPGA


Convert assertion function to if statement (renamed via Perl script)
Send line number of failed assertions on the FPGA to the CPU


Communication stream created and routed between hardware functions with
assertion statements and software function
Perform failure actions via a software function (added via Perl script)
Case Study: N-Queens



Q
Overview
Q
Find number of distinct ways n queens can be placed on an n×n
board without attacking each other (via backtracking algorithm)
Multi-CPU/FPGA application (UPC/VHDL)
Q
Q
Overhead



<= 6% area (sixteen 32-bit profile counters for state machines)
Application speedup
<= 2% memory (96-bit-wide trace buffer for core finish time)
over single 3.2GHz Xeon
Negligible frequency degradation observed
40
N-Queens results for
board size of 16
Slices
XD1
Original
Instr.
Original
Instr.
9,041
9,901
23,086
26,218
(+4%)
(% relative to device)
Block RAM
11
21
22
(0%)
33.9
35
37.1
30
25
FPGAs
20
15
10
7.9
5
124
123
101
(-1%)
(% relative to orig.)
Communication (KB/s)
15
(+6%)
(+2%)
(% relative to device)
Frequency (MHz)
Xeon-H101
Speedup

<1
33
101
0
(0%)
<1
30
8-node 3.2GHz Xeon
8-node H101
Optimized 8-node H101
Case study: 2D-PDF estimation*
Application

Estimate a 2D probability density function (i.e., nearly smooth histogram) given set of
(x, y) coordinate data
3.2GHz Xeon, Virtex-4 LX100 FPGA, PCI-X


Results

Automatic bottleneck detection results showed problematic communication and control
Based on tool suggestion, increased buffer sizes and restructuring of control logic was
achieved in a day, providing up to a 5.5x speedup for the 10-core design


Software functions
Execution Time (seconds)
35
31.0
30
25
21.6
20
18.7
16.7
Original design
Improved design
15.1
15
FPGA
Write
13.8
10
9.7
5.4
5
4.0
2.5
0
1
2
3
4
5
6
7
8
9
10
Number of FPGA Cores
* 2D-PDF code written by Karthik Nagarajan
FPGA
Read
Case Study: Molecular Dynamics
–
•
•
Simulates interaction of molecules
over discrete time steps
Impulse C version 2.2
XD1000 platform
–
Stream buffer
–
•
Increased buffer size by 32 times
Speedup change
–
–
6.2 vs. serial baseline before enhancements
7.8 vs. serial baseline after enhancements
Dual-processor motherboard
•
•
•
•
Molecular Dynamics
Opteron 2.2GHz
Stratix-II EP2S180 XD1000 module
MD communication architecture
–
–
–
MD Kernel Runtime
Chunks of MD data read from SRAM
Data streamed to multiple pipelined
MD kernels
Results stored back to SRAM
M
ut
Inp
e
ry
mo
SRAM
Storage
Ac
c
s
es
ut
Inp
e
St
Distributor
Inp
1
ut
am
S
MD kernel 1
Ou
tpu
tM
em
Ou
tpu
t
ory
Ac
ce
St
rea
ss
m
tre
am
16
MD kernel 16
1
tS
tpu
u
O
Collector
te
am
16
Other
Output stream
Pipeline
0.8
FPGA runtime (seconds)
•
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
128
256
512
1024
Stream buffer size (bytes)
2048
4096
HLL Debug Case Study

Impulse C performs 32 bit comparison with 64 bit values
void Logcontrol (…
{
…
co_int64 big, test, update;
small_1=321;
32 bits
small_2=123;
big=5000000000;
100101010000001011111001000000000
test=1073741824;
1000000000000000000000000000000
IF_SIM(printf("HW big:%lld\n",big);)
IF_SIM(printf("HW test:%lld\n",test);)
i=0;
while(big<test)
{
co_stream_write(small_stream, &small_1, sizeof(co_int32));
IF_SIM(printf("HW if passed\n");)
small_1=big&4294967295;
small_2=big>>32;
i++;
assert(i<10);
}
Impulse C code
VHDL
1073741824
705032704
ni192_suif_tmp <= … & cmp_less_s(r_big(31 downto 0), r_test(31 downto 0));
HLL Debug Case Study (cont)

Results




In simulation, loop does not execute and
assertion is never called
In hardware loop executes infinitely
In hardware with assert, loop executes and
assertion fails
Overhead


Streaming overhead generated per process
Additional FPGA resource usage < 0.1%
EP2S180
Original
Modified
Difference
Logic Used
(143520)
13927
(9.71%)
13974
(9.74%)
+37
(+0.03%)
Comb. ALUT
(143520)
7930
(5.53%)
8073
(5.63%)
+143
(+0.10%)
Registers
(143520)
10013
(6.98%)
10063
(7.01%)
+50
(+0.03%)
Block RAM
(9383040 bits)
222912
(2.37%)
223488
(2.38%)
+576
(+0.01%)
Frequency
(MHz)
143.68
142.03
-1.65
(-1.15%)
Simulation
C:\hwr\test4-assert>memtest.exe
Small stream Open
HW big:5000000000
HW test:1073741824
Big stream Open
Small lower read:321
Small upper read:123
…
Hardware execution
[root@xd1000-3 test4]# ./run_sw
Small stream Open
Big stream Open
memtest_hw.c:31: Assertion 'i<10' failed.
Small lower read:705032704
Small upper read:1
…
Conclusions

Debug and performance analysis of RC applications is critical for
improving productivity in obtaining a correctly functioning, wellperforming application
 ReCAP framework/tool aides designers with
verification and performance analysis




Records and monitors application data on CPU
and FPGA at runtime while minimizing
overhead and user effort
Can perform a number of automated analyses
including common bottleneck detection,
decision coverage, and assertion monitoring
Provides analysis and presentation of
CPU/FPGA debug and performance data
ReCAP represents the first RC
application performance framework
and tool (per extensive literature review)

Debug capabilities are also not currently
found in other tools

12JAN07 Talk for I/UCRC Annual Meeting

Transcript 12JAN07 Talk for I/UCRC Annual Meeting

Directory