Application domain specific processors (ADSP or ASIP)

Download Report

Transcript Application domain specific processors (ADSP or ASIP)

Platform Design
ASIP
Application Specific Instruction-set Processor
TU/e 5kk70
Henk Corporaal Bart Mesman
Application domain specific processors (ADSP or ASIP)
DSP
Programmable
CPU
Programmable
DSP
Application domain
specific
Application
specific processor
flexibility
efficiency
4/13/2015
Platform Design
H.Corporaal and B. Mesman
2
Application domain specific processors (ADSP or ASIP)
takes a well defined application domain as a starting point
• exploits characteristics of the domain (computation kernels)
• still programmable within the domain
e.g. MPEG2 coding uses 8*8 DCT transform, DECT, GSM etc ...
implementation
Appl. domain
GP
Appl. domain
performance: clock speed + ILP
flexible dev. (new apps.)
problems
manual design,
large effort
4/13/2015
ADSP
implementation
ILP,DLP, tuning to domain
cost effective (high volume)
- specification
- design time and effort
=> synthesized cores
Platform Design
H.Corporaal and B. Mesman
3
Size
Clock
ROM RAM
(gates
(MHz)
(Kbyte) (Kbyte)
)
www.adelantetech.com
Part
Description
Speech Components
ADPCM
Full duplex ITU-T G.726 compliant and 40 kbit/s speech-compression encoder/decoder.
4
5,100 1.3
0.128
ADPCM-16
Full duplex 16 Channel ITU-T G.726 compliant 16, 24, 32 and 40 kbit/s speech-compression encoder/decoder.
32
10,200 1.3
2.048
IW-ASR
Speech
Recognition
Template-based speaker-dependent, isolated-word automatic speech recognition
1.3
9,000 6
approx.
1kbyte/
word
G.723.1
Low bit-rate ITU-TG.723.1 compliant speech-compression at 6.3 kbit/s; can be combined with G.723.1A.
20
24,000 22
2.3
G.723.1A
Extended version of G.723.1 to reduce bit rate by a silence compression scheme. Uses voice activity detection and
comfort-noise generation. Fully compliant with Annex A of speech-compression standard CODEC G.723.1.
20
Yields no additional hardware cost.
24,000 22
2.3
Speech
Synthesis
Phrase-concatenated speech synthesis
Depends on compression
requirements
Telecommunications
Echo
Cancellation
High-performance Echo-cancellation and suppression processor.
4
6,000 2.80
0.15
DTMF
Full-duplex DTMF transceiver.
2
4,000 1.00
0.15
Caller-ID
On-hook and off-hook caller line identification. Includes DTMF and V.23.
3
6,000 2.10
0.15
Reed-Solomon Full-duplex Reed-Solomon codec
7,000 3.75
0.15
Viterbi
Decoder
Configurable rate, code and constraint-length. (depending on throughput) Configurable traceback depth. Supports
soft & hard decision making. Supports code puncturing.
5,000
to --9,000
---
V.23 modem
ITU-T V23 compliant 1200 baud FSK modem
6,000 0.80
0.15
Low-ripple pink noise filter with filter characteristic of -3 ± 0.08 dB per octave over the bandwidth 20Hz to 20kHz
4,000 0.10
0.10
1,500 none
none
Other
Pink Noise
Generator
CCIR 656/601 Digital video converter : CCIR to raw-video data and vice versa.
4/13/2015
Platform Design
H.Corporaal and B. Mesman
4
Design process
application(s)
instance
processor
model
e.g. VLIW with
shared RFs
parameters
SW (code
generation)
Estimations
cycles/alg
occupation
HW
design
Estimations
nsec/cycle,
area, power/instr
OK?
yes
yes
4/13/2015
more appl.?
no
no
Platform Design
H.Corporaal and B. Mesman
3 phases
1. exploration
2. hw design (layout)
+ processing
3. design appl. sw
Fast, accurate and
early feedback
go to phase 2
5
Problem statement
A compiler is retargetable if it can generate code for a ‘new’
processor architecture specified in a machine description file.
A guarded register transfer pattern (GRTP) is a register transfer
pattern (RTP) together with the control bits of the instruction word
that control the RTP.
a: = b + c | instr = xxxx0101
GRTPs contain all inter-RT-conflict information.
Instruction set extraction (ISE) is the process of generating all
possible GRTPs for a specific processor.
4/13/2015
Platform Design
H.Corporaal and B. Mesman
6
Problem statement
Algorithm
spec
Processor
spec (instance)
FE
ISE
in ch 4 this is
part of the code
generator
CDFG
GRTP
Code Generation
Machinecode
4/13/2015
Platform Design
H.Corporaal and B. Mesman
7
Example: Simple processor [Leupers]
Inp
RAM
I.(20:13)
PC
I.(12:5)
I.(4)
+1
I.(3:2)
IM
I.(1:0)
I.(20:0)
REG
outp
4/13/2015
Platform Design
H.Corporaal and B. Mesman
8
Example: Simple processor [Leupers]
Instruction
PC := PC + 1
REG := Inp
REG := IM  PC .(20..13)
REG := RAM  IM  PC  . (12..5 ) 
REG := REG - Inp
REG := REG - IM  PC .(20..13)
REG := REG - RAM  IM  PC  . (12..5 ) 
REG := REG + Inp
REG := REG + IM  PC .(20..13)
REG := REG + RAM  IM  PC  . (12..5 ) 
RAM  IM  PC  . (12..5 )  := REG
outp := REG
RAM_NOP
4/13/2015
Instruction bits
21111111111
098765432109876543210
xxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxx011x
xxxxxxxxxxxxxxxxx001x
xxxxxxxxxxxxxxxxx1x1x
xxxxxxxxxxxxxxxxx0101
xxxxxxxxxxxxxxxxx0001
xxxxxxxxxxxxxxxxx1x01
xxxxxxxxxxxxxxxxx0100
xxxxxxxxxxxxxxxxx0000
xxxxxxxxxxxxxxxxx1x00
xxxxxxxxxxxxxxxx1xxxx
xxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx0xxxx
Platform Design
H.Corporaal and B. Mesman
9
ASIP/VLIW architectures
A|RT designer template as an example (= set of rules, a model)
Differences with VLIW processors of ch. 4
1. // FUs
• ASUs = complex appl. Spec. FUs (beyond subword //)
e.g. biquad, median, DCT etc …
• larger grainsize, more heterogeneous, more pipelines
2. Rfiles
• many Rfiles (>5 vs 1 or 2)
• limited # ports (3 vs 15)
• limited size (<16 vs. 128)
3. Issue slots
• all in parallel vs. 5
4/13/2015
Platform Design
H.Corporaal and B. Mesman
10
RF1
RF2
FU1
RF3
RF4
FU2
RF5
RF6
FU3
RF7
RF8
FU4
flags
IR1
IR2
IR3
Instruction memory
4/13/2015
Platform Design
H.Corporaal and B. Mesman
IR4
Control
11
ASIP/VLIW architectures
Additional characteristics of the A|RT designer template
• interconnect network: busses + input multiplexers
mux control is part of the instruction
control can change every clock cycle
network can be incomplete
busses can be merged
• memories are modeled as FUs
separate data in and data out
2 inputs (data in and address) and 1 output
• Each FU can generate one or more flags
• instruction format (per issue slot)
read
write
read
write
mux 1 mux 2 address address address address
RF 1
RF 1
RF 2
RF 2
4/13/2015
Platform Design
H.Corporaal and B. Mesman
control
FU
output
drivers
12
ASIP/VLIW architectures: example
RF1
19
4/13/2015
RF3
ALU
bus1
mux read write
2 RF1 RF1
RF2
MAC
10
read
RF2
write
RF2
RF4
ALU instr.
bus2
0
9
mux
3
read
RF3
Platform Design
H.Corporaal and B. Mesman
write
RF3
read write
MAC instr.
RF4 RF4
13
ASIP/VLIW architectures : example
GRTP
RF1 = ALU (RF1, RF2)
RF2 = ALU (RF1, RF2)
RF3 = ALU (RF1, RF2)
RF3 = MAC (RF3, RF4)
RF4 = MAC (RF3, RF4)
RF2 = MAC (RF3, RF4)
4/13/2015
Instruction bits
1111111111
9876543210
xcc ccxxccc
xcx ccccccc
xcx ccxxccc
xxxxxxxxxx
xxxxxxxxxx
cxxxxccxxx
Platform Design
H.Corporaal and B. Mesman
9876543210
xxxxxxxxxx
xxxxxxxxxx
cxxccxxxxx
ccc cccx ccc
xccx xcc ccc
xccxxcx ccc
14
ASIP/VLIW
architectures:
design flow
assign ( a+b, ALU, fu_alu1)
assign ( a+_, ALU, fu_alu2)
assign ( _+_, ALU, fu_alu3)
Algorithm
spec
Datapath
synthesis
RF1 : x = RF2 : y, RF3 : z |
ALU = ADD
Inmux = bus2
Change
RTs
pragmas
Controller
synthesis
VLIW makes relatively
simple code selection
possible
4/13/2015
Estimations
area, power,
timing
no
OK?
yes
Platform Design
H.Corporaal and B. Mesman
15
ASIP/VLIW architectures: list scheduling
Candidate
LIST
IPB
*
+
1
*
2
4
+
0
*
3
*
*
OPB
0
+
1
1
1
*
5
2
*
3
*
*
*
1
*
4
Scheduled
Operation
*
3
*
3
+
1
2
*
4
+
3
6
*
2
3
2
Conflict &
Priority Comp.
4
*
6
+
3
6
MULT
+
*
7
3
3
*
+
5
8
*
7
*
8
*
5
*
8
+
8
7
ALU
*
IPB
+
9
10
OPB
4/13/2015
4
4
*
*
5
5
*
*
9
+
9
*
5
*
9
5
*
10
Platform Design
H.Corporaal and B. Mesman
+
9
16
10
ASIP/VLIW architectures: feedback
resource
resourceload
load
architecture
architectureview
view
cycle-count
cycle-count
bus
busload
load
life-time
life-timeanalysis
analysis
4/13/2015
Platform Design
H.Corporaal and B. Mesman
17
Low power aspects
Implementation
Independent
Design Database
• Estimation
+
area
speed
power
Mistral2
Estimation Database
Architecture
EXU
alu_1
acs_asu_1
or_asu_1
romctrl_1
acu_1
ipb_1
opb_1
ctrl
total
4/13/2015
ACTIVITY
20%
83%
10%
16%
36%
20%
11%
AREA
261
2382
611
65
294
107
163
1864
5747
POWER
105
3816
122
21
205
43
35
3597
7944
Platform Design
H.Corporaal and B. Mesman
18
GSM viterbi decoder : default solution
13750
EXU
alu_1
romctrl_1
acu_1
ipb_1
opb_1
ctrl
total
ACTIV
96%
48%
26%
5%
23%
AREA
3469
39
327
131
1804
9821
15591
POWER
46196
259
1209
105
5801
135035
188605
• controller responsible for 70% of power
consumption
– maximum resource-sharing
– heavy decision-making : “main” loop with 16
metrics-computations per iteration
• EXU-numbers include Registers for local storage
4/13/2015
Platform Design
H.Corporaal and B. Mesman
19
GSM viterbi decoder : no loop-folding
14247
EXU
alu_1
romctrl_1
acu_1
ipb_1
opb_1
ctrl
total
ACTIV
92%
45%
25%
5%
22%
AREA
3411
39
294
107
1661
4919
10431
POWER
45073
255
1087
86
5340
70087
121928
• area down by 33%
• power down by 35%
• next step: reduce # of program-steps with
second ALU
4/13/2015
Platform Design
H.Corporaal and B. Mesman
20
GSM viterbi decoder : 2 ALU’s
9739
EXU
alu_1
alu_2
romctrl_1
acu_1
ipb_1
opb_1
ctrl
total
ACTIV
69%
65%
67%
37%
8%
33%
AREA
1797
1393
39
294
149
2136
8957
14766
POWER
12248
8916
255
1087
119
6871
87235
116731
 cycle count down 30%
 area up 42%
 power down by 5%
 next step: introduce ASU
to reduce ALU-load
4/13/2015
Platform Design
H.Corporaal and B. Mesman
21
GSM viterbi decoder : 1 x ACS-ASU
func ACS ( M1, M2, d ) MS, MS8 =
begin
MS = if ( M1+d > M2-d ) -> ( M1+d) || ( M2-d) fi;
MS8 = if ( M1- d > M2+d) -> ( M1- d) || ( M2+d) fi;
end;
EXU
alu_1
acs_asu_1
or_asu_1
romctrl_1
acu_1
ipb_1
opb_1
ctrl
total
ACTIV
20%
83%
10%
16%
36%
20%
11%
AREA
261
2382
611
65
294
107
163
1864
5747
POWER
105
3816
122
21
205
43
35
3597
7944
=
1930
 cycle count down 5X
 power down 20X !
4/13/2015
Platform Design
H.Corporaal and B. Mesman
22
GSM viterbi decoder : 4 x ACS-ASU
425
EXU
alu_1
acs_asu_1
acs_asu_2
acs_asu_3
acs_asu_4
split_asu_1
or_asu_1
romctrl_1
acu_1
ipb_1
opb_1
ctrl
total
ACTIV
94%
95%
95%
95%
95%
47%
47%
28%
98%
23%
50%
AREA
243
1041
1041
1041
1041
90
592
48
212
60
369
1306
7084
POWER
97
420
420
420
420
18
118
6
85
6
80
555
2645
 cycle count down another 5X
 area up 23%
 power down another 3X !
4/13/2015
Platform Design
H.Corporaal and B. Mesman
23
GSM viterbi example : summary
Implementation
Independent
Design Database
20000
18000
power
16000
area
cycles
14000
Mistral2
12000
10000
8000
6000
4000
72x !
2000
0
default
4/13/2015
loop
2 ALU
Platform Design
H.Corporaal and B. Mesman
1 ACS
4 ACS
24
Discussion: phase 3
processor
model
application(s)
SW (code
generation)
HW
design
no
no
OK?
application(s)
Freeze
processor
model
no
yes
yes
OK?
yes
no
more appl.?
Exploration phase
4/13/2015
SW (code
generation)
Application software
development:
constraint driven compilation
Platform Design
H.Corporaal and B. Mesman
25
Discussion: problems with VLIWs
code size and instruction bandwidth
• code compaction = reduce code size after scheduling
possible compaction ratio ?
e.g. p0 = 0.9 and p1 = 0.1
information content (entropy) = - pi log2 pi = 0.47
maximum compression factor  2
• control parallelism during scheduling = switch between
different processor models (10% of code = 90% runtime)
• architecture
reduce number of control bits for operand addresses
e.g. 128 reg (TM) -> 28 bits/issue slot for addresses only
=> use stacks and fifos
4/13/2015
Platform Design
H.Corporaal and B. Mesman
26
RF1
RF2
RF3
RF4
FU1
FU2
FU3
FU4
flags
IR1
IR2
IR3
Instruction memory
4/13/2015
Platform Design
H.Corporaal and B. Mesman
IR4
Control
27
Conclusions
• ASIPs provide efficient solutions for well-defined application
domains (2 orders of magnitude higher efficiency).
• The methodology is interesting for IP creation.
• The key problem is retargetable compilation.
• A (distributed) VLIW model is a good compromise between
HW and SW.
• Although an automatic process can generate a default
solution, the process usually is interactive and iterative for
efficiency reasons. The key is fast and accurate feedback.
4/13/2015
Platform Design
H.Corporaal and B. Mesman
28