09 Accelerators ppt
Download
Report
Transcript 09 Accelerators ppt
Digital Design:
An Embedded Systems
Approach Using Verilog
Chapter 9
Accelerators
Portions of this work are from the book, Digital Design: An Embedded
Systems Approach Using Verilog, by Peter J. Ashenden, published by Morgan
Kaufmann Publishers, Copyright 2007 Elsevier Inc. All rights reserved.
Verilog
Performance and Parallelism
A processor core performs steps in sequence
Accelerating performance
Perform steps in parallel
Takes less time overall to complete an operation
Instruction-level parallelism
Performance limited by the instruction rate
Within a processor core
Pipelining, multiple-issue
Accelerators
Custom hardware for parallel operations
Digital Design — Chapter 9 — Accelerators
2
Verilog
Achievable Parallelism
How many steps can be performed at
once?
Regularly structured data
Independent processing steps
Examples
Video and image pixel processing
Audio or sensor signal processing
Constrained by data dependencies
Operations that depend on results of
previous steps
Digital Design — Chapter 9 — Accelerators
3
Verilog
Algorithm Kernels
Algorithm: specification of the required
processing steps
Kernel: the part that involves the most
intensive, repetitive processing
Often expressed in a programming
language
“10% of operations take 90% of the time”
Accelerating a kernel with parallel
hardware gives the best payback
Digital Design — Chapter 9 — Accelerators
4
Verilog
Amdahl’s Law
Time for an algorithm is t
Fraction f is spent on a kernel
Accelerator speeds up
kernel by a factor s
Overall speedup factor s'
For large f, s' s
For small f, s' 1
t ft (1 f )t
ft
t (1 f )t
s
t
1
s
t f (1 f )
s
Digital Design — Chapter 9 — Accelerators
5
Verilog
Amdahl’s Law Example
An algorithm with two kernels
Kernel 1: 80% of time, can be sped up 10 times
Kernel 2: 15% of time, can be sped up 100 times
Which speedup gives best overall improvement?
For kernel 1:
s
For kernel 2:
s
1
1
3.57
0.8
(1 0.8) 0.08 0.2
10
1
1
1.17
0.15
(1 0.15) 0.0015 0.85
100
Digital Design — Chapter 9 — Accelerators
6
Verilog
Parallel Architectures
An architecture for an accelerator
specifies
Processing blocks
Data flow between them
Parallelism through replication
Multiple identical block operating on
different data elements
Works well when elements can be
processed independently
Digital Design — Chapter 9 — Accelerators
7
Verilog
Parallel Architectures
Parallelism through pipelining
Break a computation into steps, performs them in
assembly-line fashion
Latency (time to complete a single operation) is
not increased
Throughput (rate of completion of operations) is
increased
data
in
Ideally by a factor equal to the number of pipeline stages
step 1
step 2
step 3
Digital Design — Chapter 9 — Accelerators
data
out
8
Verilog
Direct Memory Access (DMA)
Input/Output data for accellerators
must be transferred at high speed
Using the processor would be too slow
Direct memory access
I/O controller and accellerator transfer data
to and from memory autononously
Program supplies starting address and
length
Digital Design — Chapter 9 — Accelerators
9
Verilog
Bus Arbitration
Bus masters take turns to use bus to access
slaves
Controlled by a bus arbiter
Arbitration policies
Priority, round-robin,
…
request
grant
request
arbiter
request
processor
grant
grant
accelerator
controller
memory
bus
memory
Digital Design — Chapter 9 — Accelerators
10
Verilog
Block-Processing Accelerator
Data arranged in regular groups of
contiguous memory locations
Accelerator works block by block
E.g., images in blocks of 8 × 8 × 16-bit
pixels
Datapath comprises
Memory access: address generation,
counters
Computation section
Control section: finite-state machine(s)
Digital Design — Chapter 9 — Accelerators
11
Verilog
Stream-Processing Accelerator
Streams of data from an input source
E.g., high-speed sensors
Digital signal processing (DSP)
Analog sensor signal converted to stream
of digital sample values
Filtering, gain/attenuation, frequencydomain conversion (Fourier transform)
Digital Design — Chapter 9 — Accelerators
12
Verilog
Processor/Accelerator Interface
Embedded software controls an
accelerator
Providing control parameters
Synchronizing operations
Input/output registers and interrupts
Interact with the control sequencer
Digital Design — Chapter 9 — Accelerators
13
Verilog
Case Study: Edge Detection
Illustration of accelerator design
Edge detection in video processing
Application areas
Identify where image intensity changes abruptly
Typically at the boundary of objects
First step in identifying objects in a scene
Video surveillance, computer vision, …
For this case study
Monochrome images of 640 × 480 × 8-bit pixels
Stored row-by-row in memory
Pixel values: 0 (black) – 255 (white)
Digital Design — Chapter 9 — Accelerators
14
Verilog
Sobel Edge Detection
Compute derivatives of intensity in x
and y directions
Look for minima and maxima (where
intensity changes most rapidly)
Digital Design — Chapter 9 — Accelerators
15
Verilog
The Sobel Algorithm
Use convolution to approximate partial
derivatives Dx and Dy at each position
Weighted sum of value of a pixel and its eight
nearest neighbors
Coefficients represented using a 3×3 convolution
mask
Sobel masks for x and y derivatives
Gx
–1
0
+1
–2
0
+2
–1
0
+2
Dx (i, j ) O(i, j ) Gx
+1 +2 +1
Gy
0
0
0
–1
–2
–1
Dy (i, j) O(i, j ) Gy
Digital Design — Chapter 9 — Accelerators
16
Verilog
The Sobel Algorithm
Combine partial derivatives
D Dx2 Dy2
Since we just want maxima and minima
in magnitude, approximate as:
D Dx D y
Edge pixels don’t have eight neighbors
Skip computation of |D| for edges
Just set them to 0 using software
Digital Design — Chapter 9 — Accelerators
17
Verilog
The Algorithm in Pseudocode
for (row = 1; row <= 478; row = row + 1) begin
for (col = 1; col <= 638; col = col + 1) begin
sumx = 0; sumy = 0;
for (i = –1; i <= +1; i = i + 1) begin
for (j = –1; j <= +1; j = j + 1) begin
sumx = sumx + 0[row+i][col+j] * Gx[i][j];
sumy = sumy + 0[row+i][col+j] * Gy[i][j];
end
end
D[row][col] = abs(sumx) + abs(sumy);
end
end
Digital Design — Chapter 9 — Accelerators
18
Verilog
Data Formats and Rates
Pixel values: 0 to 255 (8 bits)
Coefficients are 0, ±1 and ±2
Partial products: –510 to +510 (10 bits)
Dx and Dy: –1020 to +1020 (11 bits)
|D|: 0 to 2040 (11 bits)
Final pixel value: scale back to 8 bits
Video rate: 30 frames/sec
640 × 480 = 307,200 pixels
307,200 × 30 10 million pixels/sec
Digital Design — Chapter 9 — Accelerators
19
Verilog
Data Dependencies
Pixels can be computed independently
For each pixel:
Digital Design — Chapter 9 — Accelerators
20
Verilog
System Architecture
Data dependencies suggest a pipeline
Coefficient multiplies are simple shift/negate, so
merge with adder stage
Digital Design — Chapter 9 — Accelerators
21
Verilog
Memory Bandwidth
Assume memory read/write takes 20ns
(2 cycles of 100MHz clock)
Memory is 32-bits wide, byte addressable
Bandwidth = 50M operations/sec
Camera produces 10Mpixels/sec
Accelerator needs to process at this rate
(8 reads + 1 write) × 10Mpixel/sec
= 90M operations/sec
Greater than memory bandwidth
Digital Design — Chapter 9 — Accelerators
22
Verilog
Memory Bandwidth
Read 4 pixels at once from each of previous,
current, and next rows
Store in accelerator to compute multiple derivative
image pixels
Produce derivative pixels row-by-row, left-toright
Read 3 × 32-bit words for every 4th derivative
pixel computed
Write 4 pixels at a time
(3 reads + 1 write) / 4 × 10Mpixel/sec
= 10M operations/sec
= 20% of available memory bandwidth
Digital Design — Chapter 9 — Accelerators
23
Verilog
Sobel Accelerator Architecture
Digital Design — Chapter 9 — Accelerators
24
Verilog
Accelerator Sequence
Steady state
Start of row
Write 4 result pixels
Read 4 pixels for previous,
current, next rows
Compute for 4 cycles
Repeat…
Omit writes until pipeline
full
End of row
Omit reads to drain
pipeline
Digital Design — Chapter 9 — Accelerators
25
Verilog
Memory Operation Timing
Steady state
Digital Design — Chapter 9 — Accelerators
26
Verilog
Pixel Datapath
// Computation datapath signals
reg
[31:0] prev_row, curr_row, next_row;
reg
[7:0] O [-1:+1][-1:+1];
reg signed [10:0] Dx, Dy, D;
reg
[7:0] abs_D;
reg
[31:0] result_row;
...
// Computational datapath
always @(posedge clk_i) // Previous row register
if (prev_row_load) prev_row
<= dat_i;
else if (shift_en) prev_row[31:8] <= prev_row[23:0];
... // Current row register
... // Next row register
function [10:0] abs (input signed [10:0] x);
abs = x >= 0 ? x : -x;
endfunction
...
Digital Design — Chapter 9 — Accelerators
27
Verilog
Pixel Datapath
always @(posedge clk_i) // Computation pipeline
if (shift_en) begin
D = abs(Dx) + abs(Dy);
abs_D <= D[10:3];
Dx <= - $signed({3'b000, O[-1][-1]})
+ $signed({3'b000, O[-1][+1]})
- ($signed({3'b000, O[ 0][-1]}) << 1)
+ ($signed({3'b000, O[ 0][+1]}) << 1)
- $signed({3'b000, O[+1][-1]})
+ $signed({3'b000, O[+1][+1]});
Dy <=
$signed({3'b000, O[-1][-1]})
+ ($signed({3'b000, O[-1][ 0]}) << 1)
+ $signed({3'b000, O[-1][+1]})
- $signed({3'b000, O[+1][-1]})
- ($signed({3'b000, O[+1][ 0]}) << 1)
- $signed({3'b000, O[+1][+1]});
...
Digital Design — Chapter 9 — Accelerators
28
Verilog
Pixel Datapath
O[-1][-1] <= O[-1][0];
O[-1][ 0] <= O[-1][+1];
O[-1][+1] <= prev_row[31:24];
O[ 0][-1] <= O[0][ 0];
O[ 0][ 0] <= O[0][+1];
O[ 0][+1] <= curr_row[31:24];
O[+1][-1] <= O[+1][ 0];
O[+1][ 0] <= O[+1][+1];
O[+1][+1] <= next_row[31:24];
end
always @(posedge clk_i) // Result row register
if (shift_en) result_row <= {result_row[23:0], abs_D};
Digital Design — Chapter 9 — Accelerators
29
Verilog
Address Generation
Given an image in memory at base
address B
Address for pixel in row r, column c is
B + r × 640 + c
Base address (B) is fixed
Offset (r × 640 + c) increments by 4 for
each group of 4 pixels read/written
Use word-aligned addresses
Two least-significant bits always 00
Increment word address by 1
Digital Design — Chapter 9 — Accelerators
30
Verilog
Address Generation
Digital Design — Chapter 9 — Accelerators
31
Verilog
Address Generation
always @(posedge clk_i) // O base address register
if (O_base_ce) O_base <= dat_i[21:2];
always @(posedge clk_i) // O address offset counter
if (offset_reset)
O_offset <= 0;
else if (O_offset_cnt_en) O_offset <= O_offset + 1;
always @(posedge clk_i) // D base address register
if (D_base_ce) D_base <= dat_i[21:2];
always @(posedge clk_i) // D address offset counter
if (offset_reset)
D_offset <= 0;
else if (D_offset_cnt_en) D_offset <= D_offset + 1;
...
Digital Design — Chapter 9 — Accelerators
32
Verilog
Address Generation
assign
assign
assign
assign
assign
O_prev_addr = O_base + O_offset;
O_curr_addr = O_prev_addr + 640/4;
O_next_addr = O_prev_addr + 1280/4;
D_addr = D_base + D_offset;
adr_o[21:2] = prev_row_load ? O_prev_addr :
curr_row_load ? O_curr_addr :
next_row_load ? O_next_addr :
D_addr;
assign adr_o[1:0] = 2'b00;
Digital Design — Chapter 9 — Accelerators
33
Verilog
Control/Status Registers
Register
Offset
Read/Write
Purpose
Int_en
0
Write-only
Interrupt enable (bit 0).
Start
4
Write-only
Write causes image processing to start
(value ignored).
O_base
8
Write-only
Original image base address.
D_base
12
Write-only
Derivative image base address + 640.
Status
0
Read-only
Processing done (bit 0). Reading clears
interrupt.
Digital Design — Chapter 9 — Accelerators
34
Verilog
Slave Bus Interface
assign start
= cyc_i && stb_i && we_i && adr_i == 2'b01;
assign O_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b10;
assign D_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b11;
always @(posedge clk_i) // Interrupt enable register
if (rst_i)
int_en <= 1'b0;
else if (cyc_i && stb_i && we_i && adr_i == 2'b00)
int_en <= dat_i[0];
always @(posedge clk_i) // Status register
if (rst_i)
done <= 1'b0;
else if (done_set)
// This occurs when last write is acknowledged,
// and so cannot coincide with a read of the status register.
done <= 1'b1;
else if (cyc_i && stb_i && we_i && adr_i == 2'b00 && ack_o)
done <= 1'b0;
assign int_req = int_en && done;
...
Digital Design — Chapter 9 — Accelerators
35
Verilog
Slave Bus Interface
always @(posedge clk_i) // Generate ack output
ack_o <= cyc_i && stb_i && !ack_o;
// Wishbone data output multiplexer
always @*
if (cyc_i && stb_i && !we_i)
if (adr_i == 2'b00)
dat_o = {31'b0, done}; // status register read
else
dat_o = 32'b0;
// other registers read as 0
else
dat_o = result_row;
// for master write
Digital Design — Chapter 9 — Accelerators
36
Verilog
Control Sequencing
Use a finite-state machine
Counters keep track of rows (0 to 477) and
columns (0 to 159)
See textbook for details of FSM output
functions
Digital Design — Chapter 9 — Accelerators
37
Verilog
State Transition Diagram
Digital Design — Chapter 9 — Accelerators
38
Verilog
Accelerator Verification
Simulation-based verification of each section
of the accelerator
Slave bus operations
Computation sequencing
Master bus operations
Address generation
Pixel computation
Testbench including the accelerator
Bus functional processor model
Simplified memory and bus arbiter models
Digital Design — Chapter 9 — Accelerators
39
Verilog
Sobel Verification Testbench
Processor
BFM
Arbiter
Sobel
Accelerator
Multiplexed Bus: Muxes and Connections
Memory
Model
Digital Design — Chapter 9 — Accelerators
40
Verilog
Processor Bus Functional Model
initial begin // Processor bus-functional model
cpu_adr_o <= 23'h000000;
cpu_sel_o <= 4'b0000;
cpu_dat_o <= 32'h00000000;
cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0;
@(negedge rst);
@(posedge clk);
// Write 008000 (hex) to O_base_addr register
bus_write(sobel_reg_base + sobel_O_base_reg_offset, 32'h00008000);
// Write 053000 + 280 (hex) to D_base_addr register
bus_write(sobel_reg_base + sobel_D_base_reg_offset, 32'h00053280);
// Write 1 to interrupt control register (enable interrupt)
bus_write(sobel_reg_base + sobel_int_reg_offset, 32'h00000001);
// Write to start register (data value ignored)
bus_write(sobel_reg_base + sobel_start_reg_offset, 32'h00000000);
// End of write operations
...
Digital Design — Chapter 9 — Accelerators
41
Verilog
Processor Bus Functional Model
cpu_cyc_o = 1'b0; cpu_stb_o = 1'b0; cpu_we_o = 1'b0;
begin: loop
forever begin
#10000;
@(posedge clk);
// Read status register
cpu_adr_o <= sobel_reg_base + sobel_status_reg_offset;
cpu_sel_o <= 4'b1111;
cpu_cyc_o <= 1'b1; cpu_stb_o <= 1'b1; cpu_we_o <= 1'b0;
@(posedge clk); while (!cpu_ack_i) @(posedge clk);
cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0;
if (cpu_dat_i[0]) disable loop;
end
end
end
Digital Design — Chapter 9 — Accelerators
42
Verilog
Memory Bus Functional Model
always begin // Memory bus-functional model
mem_ack_o <= 1'b0;
mem_dat_o <= 32'h00000000;
@(posedge clk);
while (!(bus_cyc && mem_stb_i)) @(posedge clk);
if (!bus_we)
mem_dat_o <= 32'h00000000; // in place of read data
mem_ack_o <= 1'b1;
@(posedge clk);
end
Digital Design — Chapter 9 — Accelerators
43
Verilog
Bus Arbiter
Uses sobel_cyc_o and cpu_cyc_o
as request inputs
If both request at the same time, give
accelerator priority
Mealy FSM
Digital Design — Chapter 9 — Accelerators
44
Verilog
Bus Arbiter
always @(posedge clk) // Arbiter FSM register
if (rst) arbiter_current_state <= sobel;
else
arbiter_current_state <= arbiter_next_state;
always @* // Arbiter logic
case (arbiter_current_state)
sobel: if (sobel_cyc_o) begin
sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state
end
else if (!sobel_cyc_o && cpu_cyc_o) begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state
end
else begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b0; arbiter_next_state
end
cpu:
if (cpu_cyc_o) begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state
end else if (sobel_cyc_o && !cpu_cyc_o) begin
sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state
end else begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b0; arbiter_next_state
end
endcase
Digital Design — Chapter 9 — Accelerators
<= sobel;
<= cpu;
<= sobel;
<= cpu;
<= sobel;
<= sobel;
45
Verilog
Simulation Results
See waveforms in textbook
But what about…
Demonstrates sequencing and address
generation
Data values computed correctly
Interactions between processor and
accelerator
Need to use more sophisticated
verification techniques
Due to complexity of the design
Digital Design — Chapter 9 — Accelerators
46
Verilog
Summary
Accelerators boost performance using
parallel hardware
Ahmdahl’s Law
Replication, pipelining, …
Best payback from accelerating a kernel
DMA avoids processor overhead
Verification requires advanced
techniques
Digital Design — Chapter 9 — Accelerators
47