システムLSIとアーキテクチャ技術 (part III:チップ間

Download Report

Transcript システムLSIとアーキテクチャ技術 (part III:チップ間

Reconfigurable Architectures
AMANO, Hideharu
hunga@am.ics.keio.ac.jp
Reconfigurable System
(Custom Computing Machine)

A target algorithm is executed directly with
a hardware on SRAM-style FPGA/PLDs.



High performance of special purpose machines.
High degree of flexibility of general purpose
machines.
A completely different execution
mechanism from a stored program
computers.
PLD(Programmable Logic Device)


Integrated Circuit whose logic function can be
defined by users.
Standard IC,ASIC(Application Specific IC)
SPLD(Simple PLD) / PLA(Programmable Logic
Array)


CPLD(Complex PLD)


Small scale IC with AND-OR array
Middle scale IC with AND-OR array
FPGA(Field Progarmmable Gate Array)

Large scale IC with LUT
Caution! Terms are not well defined!
Rapidly development of PLD
Gate number
10M
Increasing Performance
From 1991-2000
Amount of gate: X45
Speed: X12
Cost:1/100
1M
Anti-fuse
FPGA
SRAMFPGA
100K
CPLD
10K
1991-2004
X 200
X 40倍
X 1/500
FusePLA
1980
Hierarchical structure
Embedded Core
Low voltage
EEPROMSPLD
1990
2000
SPLD(Simple PLD:
AND-OR/Product-term)
OR
NOT
AND
Arbitrary logic is realized by
changing the AND-OR connection
AND/OR connection example
ABCD
A&B | C&D
OR
NOT
AND
A&B
C&D
LUT:Look Up Table
Address
Look Up Table
…
ROM/RAM
…
Data
A simple ROM/RAM can used as a
random logic.
C
ABC
000
001
010
011
100
101
110
111
Z
0
0
0
1
0
0
0
1
Z
0
0
0
1
0
0
0
1
B
A
A combination of memory and
multiplexers are commonly used.
An example using LUT:Look Up Table
1
C
ABC
000
001
010
011
100
101
110
111
Z
0
0
0
1
0
0
0
1
Z
0
0
0
1
0
0
0
1
1
0
B
A
1
AND-OR array vs. LUT

AND-OR array(product-term)




Efficient for logic with multiple outputs
There is a type of logic which cannot be realized.
Suitable for EEPROM and Flash-ROM
LUT



Any logic can be realized.
Efficient for logic with a single output
Suitable for Flash-ROM, Anti-fuse, and SRAM.
Sequential circuits
From AND/OR array
D
Q
Q
Feedback
Output Block
Input
AND・OR
ARRAY
or
LUT
D Q
Output
D Q
D Q
D Q
Feed
Back
Sequential circuit (state machine) can be built
by attaching Flip-flops and feed back loops.
CPLD (Complex PLD)
Multiple AND/OR logic blocks are connected with a switch.
AND
OR
AND
OR
Switch
AND
OR
Or
AND
OR
Wire region
AND
OR
AND
OR
Altera’s
MAX series
2 dimensional CPLD
I/O
Logic Block
Switch
SRAM(Configuration Memory)
FPGA
(Field Programmable Gate Array)
LUT with 5inputs
Switch set
2 F.F.
I/O
Logic Block
Switch
Configuration Memory
Look Up Table
Device for flexibility(1)

Anti-fuse type




Program by destruction of isolation with high
voltage
High speed but One-time
ACTEL、Quicklogic
EEPROM・Flash-ROM



Switches for connections are realized by
floating gates.
Re-programmable
Lattice、Altera’s MAX series
Device for flexibility(2)

SRAM







Data on SRAM represents look up table and wire
connection.
ISP (In System Programming) is available.
The configuration data is erased, when the power
turns off.
Suitable for a large scale FPGA. Recently, rapidly
advanced.
Xilinx XC、 Altera FLEX, Lucent ORCA
The advanced series: Xilinx Virtex, Altera APEX
Others


Magnetic memory
DRAM
Architectures and devices
SPLD
Anti-fuse
CPLD
EEPROM
FPGA
Flash-ROM
SRAM
High speed middle size
One-time
ACTEL,Quicklogic
High speed small/middle size
Re-programmable
Delay is predictable
Lattice,Altera,Xlinx
Large scale
Rapidly development
Xilinx、Altera
Recent PLDs

A large scale chip with hierarchical structure:


System on Programmable Device



Providing DLL,CPU、DSP, ROM, RAM, Multiplier, High
speed link, and other hard IPs.
Xilix’s Virtex II Pro,Virtex IV, Altera’s APEX20K、APEX‐II
Specialized for mass-production



Xilinx’s Virtex II、Virtex IV, Altera‘s APEX20K,APEX‐II
Low cost:Xilinx’s Spartan, Altera’s Cyclone
High speed:Altera’s Stratix
Low voltage, Multiple voltages, and Low power
consumption
Island style
FPGA
Structure of a switch
Technical advance of Xilinx’s FPGAs
Technology
Series
Product
LUT
Voltage
350nm
XC4000
XC4085KLA
7448
3.3V
250nm
XC4000
XC40250KV
20102
2.5V
220nm
Virtex
XCV1000
27648
2.5V
180nm
Virtex-E
XCV2000E
43200
1.8V
150nm
Virtex-II
XC2V800O
104882
1.5V
130nm
Virtex-II
Pro
XC2VP125
125136
1.5V
90nm
Virtex-4
XC4VLX200
200488
1.2V
An example of hierarchical structure
(APEX20K)
Column Interconnect
Row Interconnect
Mega LAB
Interconnect
EP20K1000C
100Mgates
38400LEs
327680bits
…
Local Interconnect
ESB
MegaLAB
Extended System
Block(CAM,RAM)
Altera Stratix II
DSP Blocks
PLL
Mega RAM
Blocks
M4K RAM
Blocks
M512 RAM
Blocks
LAB:Logic Array Block
4入力のLUTとF.F.から成る
LE 10個から構成される
ローカルコネクトと
グローバルコネクトにより
高速なデータ転送を実現
Xilinx Virtex II
LUT
LUT
Carry
Carry
D
D
Q
Slice X 2 → CLB (Configurable Logic Block)
Q
Global
Clock
MUX
DCM
IOB
Slice
100000CLBs
3Mbit
Configurable Logic
RAM Multiplier
Programmable IOs
SoPD (System on Programmable Device)
DCM
Rocket I/O, Multi-Gigabit Transceiver
Xilinx
Virtex-II Pro
Power-PC
Multiplier
Block RAM
CLBs
FPGA内に様々なコアを取り込む
QuickLogic
Lattice GAL
Altera FLEX10K
Xilinx Vertex
Qucklogic
Design of PLDs

Mostly designed with common HDL(Verilog-HDL,
VHDL)


C level entry is used recently: Handel-C(Ceroxca)
Synthesis, optimization, place and route is
automatically done by vendors’ tools.




Integration and combination of tools from various venders
are used recently.
For large circuit, a long time is required especially for place
and route.
Using IPs, clock/DLL adjustment is manually done.
Optimization techniques are different from
vendors/products.
Glossary 1






Reconfigurable System:再構成可能システム、リコンフィギャラブル
システム(リコンフィギャブルシステムでも良いと思うが、リコンフィギュ
ラブルは変だと思う)
PLD (Programmable Logic Device), FPGA(Field Programmable
Gate Array):書き換え可能なデバイスの名称、このままの形で使う
Configuration Data/Memory:PLDの機能、接続情報のこと。これを
格納するメモリがConfiguration Memory。この情報を変更することで
ハードウェア構成が変化する
Anti-fuse:アンチヒューズ、ヒューズの逆で、高圧により導通させるこ
とで接続をプログラムする
System on Programmable Device:システムをPLD上にまるごと搭
載するという考え方SoC(System on a Chip)の一種で、これに対応し
た言い方
ISP(In-System Programming): システムを稼動させたまま、再プロ
グラミングを行う。SRAM型や一部のEEPROM型FPGAで可能。
Reconfigurable System
(Custom Computing Machine)

A target algorithm is executed directly with
a hardware on SRAM-style FPGA/PLDs.



High performance of special purpose machines.
High degree of flexibility of general purpose
machines.
A completely different execution
mechanism from a stored program
computers.
Stand Alone Co-processor
1990
1992
1993
1995
2000
2002
2004
2005
The 1st
SPLASH
FPL
The 1st JapaneseSPLASH-2
FPGA/PLD Conf.
RM-I
The 1st FCCM
RM-II
RM-III
RM-IV
YARDS
RM-V
PRISM-I
PRISM-II
New Device
MPLD
WASMII
Cache Logic
DISC
DISC-II
Mult.Context
FPGA
HOSMII
ATTRACTOR
FIPSOC
Based on General PLDs
Cont.Switch.FPGA
RASH
PipeRench PCA
DRL
ACM
Reconfigurable System Research CHIMERA
Chameleon DRP DAPDNA
Group was established in IEICE
PCA-2
DAPDNA2
Dedicated Devices for
Reconfigurable systems
A short history of reconfigurable systems
Reconfigurable Systems
Hardwired logic:ASIC
Speed
Reconfigurable System
Design can be changed
High Speed but Flexible
Design
A
Design
B
Design
D
Design
C
Software on
General purpose
CPU
CPU
for i=0; i<K; i++
X[i]=X[i+j]
.....
Flexibility
How enhance the performance?

Performance enhancement by hardware
execution itself



The overhead of software execution (Instruction
fetch, data load to registers, and etc.)
The overhead of using fixed size data.
The overhead of using only two way branches.
However, these benefits are not so large, for embedded CPU and DSP
are highly optimized.
The key of performance improvement is parallel processing
Parallel processing in reconfigurable
systems

Various techniques can be used





SIMD execution
Pipelined structure
Systolic algorithm
Data driven control
Parallel execution other than calculation


Parallel data access using internal memory units
Parallel data transfer including I/O accesses
SIMD (Single Instruction-stream/
Multiple Data-stream)-like calculation
The same instruction is applied to different data stream
In Reconfigurable Systems, the operation is not required to be same
(SIMD-like calculation)
Stream Data in
Processing part
Internal
Memory module
Stream Data out
Pipelined structure
The stream is divided and inserted periodically.
StreamData
Data
1
Stream
Stream
53
Stream
Stream Data
Data
Data
42
Processing part
Internal
Memory module
Stream
Stream Data
Data12
Systolic Algorithm
Data x
Computational array
Data y
Data stream x,y are inserted with a certain interval.
When two stream meet each other, a calculation is executed.
→ Systolic: The beat of heart
Band matrix multiply y=Ax
y0
a11 a12 0
0
x0
y1
a21 a22 a23 0
x1
0
x2
y2
=
y3
a32 a33 a34
0 0
a43 a44
x3
a
yi
yo
x
X+
yo= a x + y i
Band matrix multiply y=Ax
a11 a12 0
0
a21 a22 a23 0
a23
a32
a22
a12
a21
a11
X+
x1
0
a32 a33 a34
0
0
a43 a44
Band matrix multiply y=Ax
a11 a12 0
0
a21 a22 a23 0
a33
a23
a32
a22
a12 y1=a11x1
a21
X+
x2
X+
x1
0
a32 a33 a34
0
0
a43 a44
Band matrix multiply y=Ax
a11 a12 0
0
a21 a22 a23 0
a34
a43
a33
a23
0
a32 a33 a34
0
0
a32
a22
y1=a11 x1+
a12 x2
y2=a21 x1
X+
x3
x2
x1
a43 a44
Band matrix multiply y=Ax
a11 a12 0
a21 a22 a23 0
a44
a34
a43
a33
a23 y2=a21 x1+
a32
a22 x2
X+
x3
0
X+
x2
0
a32 a33 a34
0
0
a43 a44
Band matrix multiply y=Ax
a11 a12 0
0
a21 a22 a23 0
a44
a34
0
a32 a33 a34
0
0
a43
y2=a21 x1+
a22 x2+
a23 x3
a33 y3= a32 x2
X+
x3
x2
a43 a44
Data flow algorithm
d
c
a
x
b
+
e
The process is activated
with the available of tokens
(data)
+
x
(a+b)x(c+(dxe))
The overhead of synchronization is large.
Data flow analysis and hardware generation
Data Flow Graph
Data Flow Language
Configuration
Data
HDL
Description
Graph Decomposition
Suitable for automatic generation of hardware
System Examples

Stand-alone type

Memory structure is a key for classification






Almost no memory : SPLASH / SPLASH II
Local memory for an FPGA: RM IV, ReCSiP
Shared memory for FPGAs: RASH
NUMA: CRAY-XD1, RASC
Homogeneous/Heterogeneous
Co-processor type → Next lesson
Splash-2 (Arnold et.al 92)




String matching, Image
processing, DNA
matching, 330 times
faster than the
supercomputer Cray-II.
Systolic algorithm
VHDL, Parallel C
Annapolis Micro
Systems(WILDFIRE)
RM-IV (Kobe Univ.)
mem.
FPGA
FPGA
mem.
mem.FPGA
FPGA
mem.
mem.
FPGA
FPGAmem.
mem.
FPGA
FPIC
FPGA
FPGA
mem.
FPGA
mem.
mem.FPGA
FPGA
mem.
mem.
mem.
FPGA
FPGAmem.
mem.
FPGA
FPGA
Interface
mem.

System with more than one unit
CompactPCI bus
EXEboard
CPUboard
Console
Ethernet
LAN
CompactPCI bus
EXEboard
CPUboard
CD

RASH unit
disk
CD
disk
RASH unit
Stand-alone PC or one of CPU-boards as System
console
This slide is supported by Dr.Nakajima of Mitsubishi.
&p
49
ATTRACTOR(NTT)
1-Gbps Serial Links (Front panel wiring)
OC-3 x 4
I/O
O/E SDH/ Buf
ATM
O/E SDH/ Buf
ATM
O/E SDH/ Buf
ATM
RISC
O/E SDH/ Buf
ATM
card
FPGA
LUT
ATM-SW
CAM
SRAM
FPGA
RISC
card
Ethernet
Ethernet
BUFFER
FIFO
FPGA
8x8 Buf
Buf
RISC
card
FPGA
RISC
card
Ethernet
RISC
RISC
card
card
Ethernet
Ethernet
Ethernet
Local BUS (32 bit)
CompactPCI (64 bit)
Backboard wiring
-System clock
-Point-to-point connect
-Power/Ground
MPU
Mem
Mem
Mem
Mem
-System Control
-User Interface
- Signal Probing
- Diagnosis
Serial
Ethernet
SCSI
User Terminal
(PC/WS)
Flexible System Structure
•Board-level function independency (RISC card/board)
•Flexible board functions constructed using FPGAs
•1-Gbps/ch inter-board connections using coax cables
This slide is supported by Dr.Miyazaki of NTT.
IP Cut-through Router on ATTRACTOR
This slide is supported by Dr.Miyazaki of NTT.
CRAY XD-1 (CRAY Inc.)
•
AMD Opteron 2.2GHz Dual Core +
XC2VP70 / XC4VLX160 / XC4VSX55
8GB
SW
8GB
8GB
8GB
SW
SW
RapidArray Interconnect System
SW
ReCSiP (Keio Univ.)
Accelerator for bioinformatics
Powerful simultaneous access facility of external RAMs
Local Clock
Generator
64MB SDRAM
Virtex-II
4MB SSRAM
XC2V6000
64bit Local Bus
Configuration
via USB
Configration
Control
via PCI
QuickPCI
64bit/66MHz PCI Bus
SSRAM
SSRAM
PCI Controller
Applications



No flexible program change
No IEEE standard floating point → Recently some systems
are used in scientific application.
Not memory bounded








Image processing, analysis, pattern matching,
Logic simulation, Fault simulation.
Neural network simulation.
Encryption /Decryption
Queuing Model、Markov Analysis
Electric Power Flow
Censer processing
Efficient use of on the fly processing.


Communication control、Protocol control
Software radio
Historical flow of computer systems
ENIAC
EDVAC、EDSAC
IBM machines
Reconfigurable
Machine
RISC, Intel’s microprocessors
Glossary 2



Systolic Algorithm/Array: Systolicは心臓の鼓動を意
味する。心臓の鼓動のように一定の間隔でデータを流し
込むことで並列処理を行う方法(アレイ)で、VLSIアルゴ
リズムの一種。80年代に研究が進み、後にリコンフィ
ギャラブルシステム上で頻繁に用いられるようになった。
ここではBand-matrix (帯行列)の処理を例に取っている
Data-flow Model: データフローモデル、データの流れ
によって処理が起動される並列処理の方法で、これも
80年代、日本を中心に研究が盛んに行われ、後にリコ
ンフィギャラブルシステム上で用いられるようになった。
Encryption/Decryption:暗号/復号化、リコンフィギャラ
ブルシステムの得意分野の一つ。
Exercise

There is a systolic array which multiplies 8 x
8 tri-diagonal matrix A with a size 8 vector x.
Compute the number of clock cycles for the
multiply. Here, the time when the first element
of x reaches to the left-most array is assumed
to be time 0.