09 Accelerators ppt

Download Report

Transcript 09 Accelerators ppt

Digital Design:
An Embedded Systems
Approach Using Verilog
Chapter 9
Accelerators
Portions of this work are from the book, Digital Design: An Embedded
Systems Approach Using Verilog, by Peter J. Ashenden, published by Morgan
Kaufmann Publishers, Copyright 2007 Elsevier Inc. All rights reserved.
Verilog
Performance and Parallelism

A processor core performs steps in sequence


Accelerating performance



Perform steps in parallel
Takes less time overall to complete an operation
Instruction-level parallelism



Performance limited by the instruction rate
Within a processor core
Pipelining, multiple-issue
Accelerators

Custom hardware for parallel operations
Digital Design — Chapter 9 — Accelerators
2
Verilog
Achievable Parallelism


How many steps can be performed at
once?
Regularly structured data


Independent processing steps
Examples



Video and image pixel processing
Audio or sensor signal processing
Constrained by data dependencies

Operations that depend on results of
previous steps
Digital Design — Chapter 9 — Accelerators
3
Verilog
Algorithm Kernels

Algorithm: specification of the required
processing steps


Kernel: the part that involves the most
intensive, repetitive processing


Often expressed in a programming
language
“10% of operations take 90% of the time”
Accelerating a kernel with parallel
hardware gives the best payback
Digital Design — Chapter 9 — Accelerators
4
Verilog
Amdahl’s Law

Time for an algorithm is t



Fraction f is spent on a kernel
Accelerator speeds up
kernel by a factor s
Overall speedup factor s'


For large f, s'  s
For small f, s'  1
t  ft  (1  f )t
ft
t    (1  f )t
s
t
1
s  
t  f  (1  f )
s
Digital Design — Chapter 9 — Accelerators
5
Verilog
Amdahl’s Law Example

An algorithm with two kernels



Kernel 1: 80% of time, can be sped up 10 times
Kernel 2: 15% of time, can be sped up 100 times
Which speedup gives best overall improvement?

For kernel 1:
s 

For kernel 2:
s 
1
1

 3.57
0.8
 (1  0.8) 0.08  0.2
10
1
1

 1.17
0.15
 (1  0.15) 0.0015 0.85
100
Digital Design — Chapter 9 — Accelerators
6
Verilog
Parallel Architectures

An architecture for an accelerator
specifies



Processing blocks
Data flow between them
Parallelism through replication


Multiple identical block operating on
different data elements
Works well when elements can be
processed independently
Digital Design — Chapter 9 — Accelerators
7
Verilog
Parallel Architectures

Parallelism through pipelining



Break a computation into steps, performs them in
assembly-line fashion
Latency (time to complete a single operation) is
not increased
Throughput (rate of completion of operations) is
increased

data
in
Ideally by a factor equal to the number of pipeline stages
step 1
step 2
step 3
Digital Design — Chapter 9 — Accelerators
data
out
8
Verilog
Direct Memory Access (DMA)

Input/Output data for accellerators
must be transferred at high speed


Using the processor would be too slow
Direct memory access


I/O controller and accellerator transfer data
to and from memory autononously
Program supplies starting address and
length
Digital Design — Chapter 9 — Accelerators
9
Verilog
Bus Arbitration

Bus masters take turns to use bus to access
slaves


Controlled by a bus arbiter
Arbitration policies

Priority, round-robin,
…
request
grant
request
arbiter
request
processor
grant
grant
accelerator
controller
memory
bus
memory
Digital Design — Chapter 9 — Accelerators
10
Verilog
Block-Processing Accelerator

Data arranged in regular groups of
contiguous memory locations



Accelerator works block by block
E.g., images in blocks of 8 × 8 × 16-bit
pixels
Datapath comprises



Memory access: address generation,
counters
Computation section
Control section: finite-state machine(s)
Digital Design — Chapter 9 — Accelerators
11
Verilog
Stream-Processing Accelerator

Streams of data from an input source


E.g., high-speed sensors
Digital signal processing (DSP)


Analog sensor signal converted to stream
of digital sample values
Filtering, gain/attenuation, frequencydomain conversion (Fourier transform)
Digital Design — Chapter 9 — Accelerators
12
Verilog
Processor/Accelerator Interface

Embedded software controls an
accelerator



Providing control parameters
Synchronizing operations
Input/output registers and interrupts

Interact with the control sequencer
Digital Design — Chapter 9 — Accelerators
13
Verilog
Case Study: Edge Detection


Illustration of accelerator design
Edge detection in video processing




Application areas


Identify where image intensity changes abruptly
Typically at the boundary of objects
First step in identifying objects in a scene
Video surveillance, computer vision, …
For this case study



Monochrome images of 640 × 480 × 8-bit pixels
Stored row-by-row in memory
Pixel values: 0 (black) – 255 (white)
Digital Design — Chapter 9 — Accelerators
14
Verilog
Sobel Edge Detection

Compute derivatives of intensity in x
and y directions

Look for minima and maxima (where
intensity changes most rapidly)
Digital Design — Chapter 9 — Accelerators
15
Verilog
The Sobel Algorithm

Use convolution to approximate partial
derivatives Dx and Dy at each position



Weighted sum of value of a pixel and its eight
nearest neighbors
Coefficients represented using a 3×3 convolution
mask
Sobel masks for x and y derivatives
Gx
–1
0
+1
–2
0
+2
–1
0
+2
Dx (i, j )  O(i, j )  Gx
+1 +2 +1
Gy
0
0
0
–1
–2
–1
Dy (i, j)  O(i, j )  Gy
Digital Design — Chapter 9 — Accelerators
16
Verilog
The Sobel Algorithm

Combine partial derivatives
D  Dx2  Dy2

Since we just want maxima and minima
in magnitude, approximate as:
D  Dx  D y

Edge pixels don’t have eight neighbors


Skip computation of |D| for edges
Just set them to 0 using software
Digital Design — Chapter 9 — Accelerators
17
Verilog
The Algorithm in Pseudocode
for (row = 1; row <= 478; row = row + 1) begin
for (col = 1; col <= 638; col = col + 1) begin
sumx = 0; sumy = 0;
for (i = –1; i <= +1; i = i + 1) begin
for (j = –1; j <= +1; j = j + 1) begin
sumx = sumx + 0[row+i][col+j] * Gx[i][j];
sumy = sumy + 0[row+i][col+j] * Gy[i][j];
end
end
D[row][col] = abs(sumx) + abs(sumy);
end
end
Digital Design — Chapter 9 — Accelerators
18
Verilog
Data Formats and Rates

Pixel values: 0 to 255 (8 bits)






Coefficients are 0, ±1 and ±2
Partial products: –510 to +510 (10 bits)
Dx and Dy: –1020 to +1020 (11 bits)
|D|: 0 to 2040 (11 bits)
Final pixel value: scale back to 8 bits
Video rate: 30 frames/sec


640 × 480 = 307,200 pixels
307,200 × 30  10 million pixels/sec
Digital Design — Chapter 9 — Accelerators
19
Verilog
Data Dependencies


Pixels can be computed independently
For each pixel:
Digital Design — Chapter 9 — Accelerators
20
Verilog
System Architecture

Data dependencies suggest a pipeline

Coefficient multiplies are simple shift/negate, so
merge with adder stage
Digital Design — Chapter 9 — Accelerators
21
Verilog
Memory Bandwidth

Assume memory read/write takes 20ns
(2 cycles of 100MHz clock)



Memory is 32-bits wide, byte addressable
Bandwidth = 50M operations/sec
Camera produces 10Mpixels/sec



Accelerator needs to process at this rate
(8 reads + 1 write) × 10Mpixel/sec
= 90M operations/sec
Greater than memory bandwidth
Digital Design — Chapter 9 — Accelerators
22
Verilog
Memory Bandwidth

Read 4 pixels at once from each of previous,
current, and next rows


Store in accelerator to compute multiple derivative
image pixels
Produce derivative pixels row-by-row, left-toright



Read 3 × 32-bit words for every 4th derivative
pixel computed
Write 4 pixels at a time
(3 reads + 1 write) / 4 × 10Mpixel/sec
= 10M operations/sec
= 20% of available memory bandwidth
Digital Design — Chapter 9 — Accelerators
23
Verilog
Sobel Accelerator Architecture
Digital Design — Chapter 9 — Accelerators
24
Verilog
Accelerator Sequence

Steady state





Start of row


Write 4 result pixels
Read 4 pixels for previous,
current, next rows
Compute for 4 cycles
Repeat…
Omit writes until pipeline
full
End of row

Omit reads to drain
pipeline
Digital Design — Chapter 9 — Accelerators
25
Verilog
Memory Operation Timing

Steady state
Digital Design — Chapter 9 — Accelerators
26
Verilog
Pixel Datapath
// Computation datapath signals
reg
[31:0] prev_row, curr_row, next_row;
reg
[7:0] O [-1:+1][-1:+1];
reg signed [10:0] Dx, Dy, D;
reg
[7:0] abs_D;
reg
[31:0] result_row;
...
// Computational datapath
always @(posedge clk_i) // Previous row register
if (prev_row_load) prev_row
<= dat_i;
else if (shift_en) prev_row[31:8] <= prev_row[23:0];
... // Current row register
... // Next row register
function [10:0] abs (input signed [10:0] x);
abs = x >= 0 ? x : -x;
endfunction
...
Digital Design — Chapter 9 — Accelerators
27
Verilog
Pixel Datapath
always @(posedge clk_i) // Computation pipeline
if (shift_en) begin
D = abs(Dx) + abs(Dy);
abs_D <= D[10:3];
Dx <= - $signed({3'b000, O[-1][-1]})
+ $signed({3'b000, O[-1][+1]})
- ($signed({3'b000, O[ 0][-1]}) << 1)
+ ($signed({3'b000, O[ 0][+1]}) << 1)
- $signed({3'b000, O[+1][-1]})
+ $signed({3'b000, O[+1][+1]});
Dy <=
$signed({3'b000, O[-1][-1]})
+ ($signed({3'b000, O[-1][ 0]}) << 1)
+ $signed({3'b000, O[-1][+1]})
- $signed({3'b000, O[+1][-1]})
- ($signed({3'b000, O[+1][ 0]}) << 1)
- $signed({3'b000, O[+1][+1]});
...
Digital Design — Chapter 9 — Accelerators
28
Verilog
Pixel Datapath
O[-1][-1] <= O[-1][0];
O[-1][ 0] <= O[-1][+1];
O[-1][+1] <= prev_row[31:24];
O[ 0][-1] <= O[0][ 0];
O[ 0][ 0] <= O[0][+1];
O[ 0][+1] <= curr_row[31:24];
O[+1][-1] <= O[+1][ 0];
O[+1][ 0] <= O[+1][+1];
O[+1][+1] <= next_row[31:24];
end
always @(posedge clk_i) // Result row register
if (shift_en) result_row <= {result_row[23:0], abs_D};
Digital Design — Chapter 9 — Accelerators
29
Verilog
Address Generation

Given an image in memory at base
address B




Address for pixel in row r, column c is
B + r × 640 + c
Base address (B) is fixed
Offset (r × 640 + c) increments by 4 for
each group of 4 pixels read/written
Use word-aligned addresses


Two least-significant bits always 00
Increment word address by 1
Digital Design — Chapter 9 — Accelerators
30
Verilog
Address Generation
Digital Design — Chapter 9 — Accelerators
31
Verilog
Address Generation
always @(posedge clk_i) // O base address register
if (O_base_ce) O_base <= dat_i[21:2];
always @(posedge clk_i) // O address offset counter
if (offset_reset)
O_offset <= 0;
else if (O_offset_cnt_en) O_offset <= O_offset + 1;
always @(posedge clk_i) // D base address register
if (D_base_ce) D_base <= dat_i[21:2];
always @(posedge clk_i) // D address offset counter
if (offset_reset)
D_offset <= 0;
else if (D_offset_cnt_en) D_offset <= D_offset + 1;
...
Digital Design — Chapter 9 — Accelerators
32
Verilog
Address Generation
assign
assign
assign
assign
assign
O_prev_addr = O_base + O_offset;
O_curr_addr = O_prev_addr + 640/4;
O_next_addr = O_prev_addr + 1280/4;
D_addr = D_base + D_offset;
adr_o[21:2] = prev_row_load ? O_prev_addr :
curr_row_load ? O_curr_addr :
next_row_load ? O_next_addr :
D_addr;
assign adr_o[1:0] = 2'b00;
Digital Design — Chapter 9 — Accelerators
33
Verilog
Control/Status Registers
Register
Offset
Read/Write
Purpose
Int_en
0
Write-only
Interrupt enable (bit 0).
Start
4
Write-only
Write causes image processing to start
(value ignored).
O_base
8
Write-only
Original image base address.
D_base
12
Write-only
Derivative image base address + 640.
Status
0
Read-only
Processing done (bit 0). Reading clears
interrupt.
Digital Design — Chapter 9 — Accelerators
34
Verilog
Slave Bus Interface
assign start
= cyc_i && stb_i && we_i && adr_i == 2'b01;
assign O_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b10;
assign D_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b11;
always @(posedge clk_i) // Interrupt enable register
if (rst_i)
int_en <= 1'b0;
else if (cyc_i && stb_i && we_i && adr_i == 2'b00)
int_en <= dat_i[0];
always @(posedge clk_i) // Status register
if (rst_i)
done <= 1'b0;
else if (done_set)
// This occurs when last write is acknowledged,
// and so cannot coincide with a read of the status register.
done <= 1'b1;
else if (cyc_i && stb_i && we_i && adr_i == 2'b00 && ack_o)
done <= 1'b0;
assign int_req = int_en && done;
...
Digital Design — Chapter 9 — Accelerators
35
Verilog
Slave Bus Interface
always @(posedge clk_i) // Generate ack output
ack_o <= cyc_i && stb_i && !ack_o;
// Wishbone data output multiplexer
always @*
if (cyc_i && stb_i && !we_i)
if (adr_i == 2'b00)
dat_o = {31'b0, done}; // status register read
else
dat_o = 32'b0;
// other registers read as 0
else
dat_o = result_row;
// for master write
Digital Design — Chapter 9 — Accelerators
36
Verilog
Control Sequencing

Use a finite-state machine


Counters keep track of rows (0 to 477) and
columns (0 to 159)
See textbook for details of FSM output
functions
Digital Design — Chapter 9 — Accelerators
37
Verilog
State Transition Diagram
Digital Design — Chapter 9 — Accelerators
38
Verilog
Accelerator Verification

Simulation-based verification of each section
of the accelerator






Slave bus operations
Computation sequencing
Master bus operations
Address generation
Pixel computation
Testbench including the accelerator


Bus functional processor model
Simplified memory and bus arbiter models
Digital Design — Chapter 9 — Accelerators
39
Verilog
Sobel Verification Testbench
Processor
BFM
Arbiter
Sobel
Accelerator
Multiplexed Bus: Muxes and Connections
Memory
Model
Digital Design — Chapter 9 — Accelerators
40
Verilog
Processor Bus Functional Model
initial begin // Processor bus-functional model
cpu_adr_o <= 23'h000000;
cpu_sel_o <= 4'b0000;
cpu_dat_o <= 32'h00000000;
cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0;
@(negedge rst);
@(posedge clk);
// Write 008000 (hex) to O_base_addr register
bus_write(sobel_reg_base + sobel_O_base_reg_offset, 32'h00008000);
// Write 053000 + 280 (hex) to D_base_addr register
bus_write(sobel_reg_base + sobel_D_base_reg_offset, 32'h00053280);
// Write 1 to interrupt control register (enable interrupt)
bus_write(sobel_reg_base + sobel_int_reg_offset, 32'h00000001);
// Write to start register (data value ignored)
bus_write(sobel_reg_base + sobel_start_reg_offset, 32'h00000000);
// End of write operations
...
Digital Design — Chapter 9 — Accelerators
41
Verilog
Processor Bus Functional Model
cpu_cyc_o = 1'b0; cpu_stb_o = 1'b0; cpu_we_o = 1'b0;
begin: loop
forever begin
#10000;
@(posedge clk);
// Read status register
cpu_adr_o <= sobel_reg_base + sobel_status_reg_offset;
cpu_sel_o <= 4'b1111;
cpu_cyc_o <= 1'b1; cpu_stb_o <= 1'b1; cpu_we_o <= 1'b0;
@(posedge clk); while (!cpu_ack_i) @(posedge clk);
cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0;
if (cpu_dat_i[0]) disable loop;
end
end
end
Digital Design — Chapter 9 — Accelerators
42
Verilog
Memory Bus Functional Model
always begin // Memory bus-functional model
mem_ack_o <= 1'b0;
mem_dat_o <= 32'h00000000;
@(posedge clk);
while (!(bus_cyc && mem_stb_i)) @(posedge clk);
if (!bus_we)
mem_dat_o <= 32'h00000000; // in place of read data
mem_ack_o <= 1'b1;
@(posedge clk);
end
Digital Design — Chapter 9 — Accelerators
43
Verilog
Bus Arbiter

Uses sobel_cyc_o and cpu_cyc_o
as request inputs


If both request at the same time, give
accelerator priority
Mealy FSM
Digital Design — Chapter 9 — Accelerators
44
Verilog
Bus Arbiter
always @(posedge clk) // Arbiter FSM register
if (rst) arbiter_current_state <= sobel;
else
arbiter_current_state <= arbiter_next_state;
always @* // Arbiter logic
case (arbiter_current_state)
sobel: if (sobel_cyc_o) begin
sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state
end
else if (!sobel_cyc_o && cpu_cyc_o) begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state
end
else begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b0; arbiter_next_state
end
cpu:
if (cpu_cyc_o) begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state
end else if (sobel_cyc_o && !cpu_cyc_o) begin
sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state
end else begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b0; arbiter_next_state
end
endcase
Digital Design — Chapter 9 — Accelerators
<= sobel;
<= cpu;
<= sobel;
<= cpu;
<= sobel;
<= sobel;
45
Verilog
Simulation Results

See waveforms in textbook


But what about…



Demonstrates sequencing and address
generation
Data values computed correctly
Interactions between processor and
accelerator
Need to use more sophisticated
verification techniques

Due to complexity of the design
Digital Design — Chapter 9 — Accelerators
46
Verilog
Summary

Accelerators boost performance using
parallel hardware


Ahmdahl’s Law



Replication, pipelining, …
Best payback from accelerating a kernel
DMA avoids processor overhead
Verification requires advanced
techniques
Digital Design — Chapter 9 — Accelerators
47