Lithographic Aerial Image Simulation with FPGA based Hardware Acceleration

Download Report

Transcript Lithographic Aerial Image Simulation with FPGA based Hardware Acceleration

Lithographic Aerial Image Simulation
with FPGA based Hardware Acceleration
Jason Cong and Yi Zou
UCLA Computer Science Department
Lithography Simulation
Simulation
(Application)
of the optical imaging process
 Computational intensive and quite slow for full-chip simulation
2
Xtremedata Inc’s XD1000TM Coprocessor System (Platform)

Socket-compatible :
Replace one Opetron CPU with the
XD1000 coprocessor

The module connects to the CPU's
HyperTransport bus and
motherboard DIMMs while utilizing
the existing power supply and heat
sink solution for the CPU.

Dedicated DIMM for FPGA (not
shared with CPU)

Coprocessor communicates with
CPU via hyper-transport link , has
similar behavior as a PCI device
3
Approach: Use of C to RTL Tools
 Used
two tools in our work
 Codeveloper (Impulse C ) by Impulse Accelerated Technologies
 AutoPilot by AutoESL Design Technologies
 Advantages
 Maintain the design at C level
 Shorten the development cycle
 Perform
several tuning and refinement at C level
• Loop interchange, loop unrolling and loop pipelining
• Data distribution and memory partitioning
• Data prefetching / overlapping computation and communication
4
Imaging Equations
K
N
(n)
 [k (x-x1(n) , y-y1(n) ) - k (x-x (n)
,
y-y
2
1 )
I(x,y) =  k |
k 1
n=1
2
|
(n)
(n)
(n)
+ k (x-x (n)
2 , y-y2 ) - k (x-x1 , y-y2 )]
Loop over different
rectangles
Loop over
pixels
k (x,y) = Q(x,y)  fk (x,y)
Loop over kernels
I(x,y)
image intensity at (x,y)
yk(x,y)
kth kernel
fk(x,y)
kth eigenvector
(x1,y1)(x2, y2) (x1,y2) (x2,y1) layout corners

mask transmittance
Pseudo code of the Imaging Equation
5
Loop Interchange
Loop over pixels
Loop over kernels
Loop over kernels
Loop interchange
Loop over layout corners

Different kernels do not have
much correlation, thus put to the
outer loop

Fix one specific layout corner,
loop over pixels for more regular
data access
Loop over layout corners
Loop over pixels
6
Interpretation of Inner Loop after Loop Interchange

Imaging equation:
K
N
(n)
 [k (x-x1(n) , y-y1(n) ) - k (x-x (n)
2 , y-y1 )
I(x,y) =  k |
k 1
2
|
(n)
(n)
(n)
+ k (x-x (n)
2 , y-y2 ) - k (x-x1 , y-y2 )]
n=1
 The loop over different layout corners and pixels:
-+
Image
Kernel Array
(partial sum)
+
Layout
corners
Object
 The partial image computed by the inner sum is the weighted sum of shifted
kernel, and how much is shifted is determined by layout corners
(one rectangle)
7
Loop Unrolling
 Loop
 The
unrolling is one option to express parallelism in those tools
improvement by loop unrolling is limited due to port conflicts
 Data access of the same array cannot be scheduled to the same
cycle due to port conflicts
 May increase the initial interval when both loop pipelining and
loop unrolling is used
Loop unrolling
8
Further Parallelization needs Memory Partitioning
Unrolling did not solve the problem completely due to port conflictions
Need a multi-port (on-chip) mem
with a large number of ports!
 Implement the multi-port mem via memory partitioning
Computing tasks
can be done in parallel once we get the multiple data in parallel
 Each PE is responsible for computing one partition of image
 Each PE composed of one partition of kernel and one partition of image partial sum
 Multiplexing logic gets the data from
different partitions of kernel and provides
the data for each PE
 To compute one partition of image,
might also need the kernel data in
other partitions
Kernel partition 1
Kernel partition 2
Computing Element
Computing Element
Image
Image
Multi
Partial Sum partition 1
Partial Sum partition 2
plexing
Logic
One partition
Kernel partition 3
of Kernel
Computing Element
Kernel partition 4
Computing Element
One partition
Imageof Image
Image
Sum
PartialPartial
Sum partition
3
Partial Sum partition 4
4-PE example
9
Choosing Partitioning Schemes
A less
optimal partitioning design ( here is 2 x 2 example)
 Block scheduling to avoid the data access contention ( at any time
each PE accesses a different kernel partition)
 Might face load balancing problem if required kernel data lie mostly in
some partitions
PE 1
PE 2
PE 3
PE 4
Using Kernel
Using Kernel
Using Kernel
Using Kernel
Partition 1
Partition 2
Partition 3
Partition 4
Compute Image Compute Image Compute Image Compute Image
Partition 1
Partition 2
Partition 3
Partition 4
Time
Using Kernel
Using Kernel
Using Kernel
Using Kernel
Partition 2
Partition 3
Partition 4
Partition 1
Compute Image Compute Image Compute Image Compute Image
Partition 1
Partition 2
Partition 3
Partition 4
 Computing tasks is partitioned into
blocks/stages
Using Kernel
Using Kernel
Using Kernel
Using Kernel
Partition 3
Partition 4
Partition 1
Partition 2
Compute Image Compute Image Compute Image Compute Image
Partition 1
Partition 2
Partition 3
Partition 4
Using Kernel
Using Kernel
Using Kernel
Using Kernel
Partition 4
Partition 1
Partition 2
Partition 3
Compute Image Compute Image Compute Image Compute Image
Partition 1
Partition 2
Partition 3
Partition 4
10
Choosing Partitioning Schemes (Cont)
Data partitioning for load balancing
 Here different colors different partitions
 Memory banking using lower bits
partition 1
partition 2
partition 3
partition 4
partition 1
partition 2
partition 3
partition 4
Kernel Array
Image Partial Sum Array
11
Address Generation and Data Multiplexing
Need Address Generation Logic to provide the address for the kernel data and
image partial sum as the memory is partitioned
Need data multiplexing
logic to deliver the data from multiple memory blocks to
the correct place
 Implemented as 2D ring based shifting (better than naïve Mux on larger
partitioning )
configuration 1
configuration 2
configuration 3
configuration 4
a
1
b
2
c
3
d
4
Start from:
Reg_1=array_a[..]
Reg_3
Reg_4
Y direction
Reg_2=array_b[..]
Reg_3=array_c[..]
Reg_4=array_d[..]
Shift 1 step in
Shift 0 step in
Reg_1
Reg_2
X direction
Wanted :
Reg_1=array_c[..]
Reg_2=array_d[..]
Reg_3=array_a[..]
Reg_4=array_b[..]
12
Loop Pipelining and Loop Unrolling

Loop pipelining can still be applied to the code after memory partitioning
 Can speed up the code by a factor of 10X

Loop unrolling can be used to compact the code via multi-dimension array
 One way to represent the memory partitioning
kernel[size];
kernel[4][4][size/16];
Loop body with unrolling pragma and
pipelining pragma
Loop body with unrolling pragma and pipelining pragma
{
{
…. +=kernel [i][j][…]…
…. +=kernel […]…
//computation
//if some index are constant
}
}
13
Overlapping Computation and Communication

Use ping-pong buffers at Input and Output.

Two ways of implementation
 Function / Block pipelining (AutoPilot) or Inter-Process Communication (Impulse C)
SW
HW
HW
DI1
Transferring Input
From software to
SRAM
DI2
DI1
DI2
Reading Input Data
Computation
Writing Output Data
Computation
Writing Output Data
Comp
DI2:
Comp
Transferring Input
From SRAM to
FPGA
DO2
Reading Input Data
DO1
Reading Input Data
Computation
Writing Output Data
Reading Input Data
DI1
DO2:
DI2
Computation
DO2
Writing Output Data
DI1:
Comp
DO1
Transferring Output
From FPGA to
SRAM
DO1:
DO2
DO1
Transferring Output
From SRAM to
Software
14
Implementation Flow
Original
Loop
code has nested loop
interchange (manual code refinement)
Multi-PE implementation :
add memory partitioning, address
generation and data multiplexing logics (manual code refinement)
Enable
loop pipelining for the refined code via specify pragmas
Use
Impulse C and AutoPilot to compile the refined code
Use
vendor tool to compile the RTL to bitstream
Run
the program on the target system
15
Experiment Results
15X speedup using a
5 by 5 partitioning over Opteron 2.2G 4G RAM
Logic utilization around 25K ALUT (and 8K is used in the interface framework
rather than design)
Power utilization less than 15W
Close to
in FPGA comparing with 86W in Opteron248
100X (5.8 x 15) improvement on energy efficiency
 Assuming similar performance
16
Experience on the Two Commercial Tools
 Impulse C
 Strong platform customization support
 Hardware software co-design
 Smaller subset of C
 Autopilot
 Support for both C/C++/System C
 Larger synthesizable subset
 Platform customization
17
Discussions
 The
performance without different optimizations
 Roughly 2~3X worse if we do not do memory partitioning
 Polygon
based versus image based approach
 Image based is 2D FFT
 Which one is faster depends on actual layout
 Implementation on GPU
 The nested loop itself is already data parallel
 G80 has very fast shared mem for thread blocks. But the size is only 16KB.
 We had to put the kernel array in the texture memory with caching
18
Acknowledgments
 Financial
supports from
 GRC
 GSRC(FCRP)
 NSF
 Industrial support and
collaboration from
 Altera-AMD-SUN-XDI consortium
 Altera, Magma, and Xilinx under the UC MICRO program
 Valuable
discussion and comments from
 Alfred Wong (Magma)
 Zhiru Zhang (AutoESL)
19
Q/A
20