Co-processor for CalmRISC Use

Download Report

Transcript Co-processor for CalmRISC Use

Progress on media processor design
Presented by Chunyue Liu
([email protected])
Xiaolang Yan
Xing
Qin
Jian
Yang
Xiaohua Luo
Peiyong Zhang
Dake
Liu
([email protected])
([email protected])
([email protected])
([email protected])
([email protected])
([email protected])
Embedded DSP Research & Develop Group
Outline

Overview of media processor

Progress on Spock

Progress on Schubert
- Overview
- Key features
- Performance

Conclusions & Problems
Background and Challenges






Media applications have very high
computation complexity
Audio
Video
ARM
Accelerator
Accelerator
- H.264 encoding of 720 x 576 pixels
@ 30 frames /s up to 30 GOPS
Communication network
Media processor is on the demand
Nomatic
- Some state of art Media Processors
(e.g. Nomatic, da Vinci)
Multiple standards coexist
- Flexible & programmable
Our current IC design level
constraint ([email protected])
ASIP is the best choice
General MCU Enhanced DSP Vector Processor
Our proposal on IC-DFN’05
Overview of media processor

Programmable and heterogeneous processors on a SoC platform
- General MCU (CK510, a 32-bit RISC core)
Interface (GUI), Os (Linux)
- Enhanced DSP (Spock)
Audio processing, Bitstream parsing, Data transferring
- Vector processor (Schubert)
Schubert
Video processing
Core
Matrix
memory
Spock
External
Bus
Interface
CK510
DM
TM
Matrix Memory
Controller
DMA
AMBA BUS
SDRAM Controller
Off-chip Memory
Media
Access
Controller
MailBox
peripheral
Outline

Overview of media processor

Progress on Spock

Progress on Schubert
- Overview
- Key features
- Performance

Conclusions & Problems
Progress on Spock


Developed tools chain
- Assembler, Simulator and Debugger
FPGA prototype: real time decoding
-128kb/s OGG @ 40MHz

To test Spock , Dual-core SoC platform
is developed
- Integrated with CK510
- Inter-processor communication uses mailbox
and shared memory
-.18um, less than 500mw ,166MHz
- CK510 core area: 2 x 2 mm2
- Spock core area: 1.5 x 1.5 mm2
Overview of Spock


Optimization for Control
- Branch optimization:
conditional execution
2-level hardware loop, repeat
Optimization for Signal
Processing
- Multiple addressing mode:
Post address ++/-Reverse/module addressing
- MAC with parallel load
- VLX instruction set extension:
putbits, showbits, getbits, etc.
PC
MUX
PCFSM
PM
External
Bus
IF
pc+2 br
Decode
GPR
dependency
table
ID
Issuing logic
RF
Operand
Bypass
alu
mul
vlx
Address
adder
EX1
Write
buffer
acc
DM/TM
Aligner
External
Bus
EX2/MEM
Outline

Overview of media processor

Progress on Spock

Progress on Schubert
- Overview
- Key features
- Performance

Conclusions & Problems
Progress on Schubert





Design Methodology
Released 316 novel instructions
- SIMD and RISC
Developed tools chain
- Assembler
- Cycle-accurate Simulator
Mapped kernels
H.264/AVC
- IT/IIT, Intra/inter-prediction
- de-blocking, Motion estimation
MPEG2
- DCT, Motion compensation
Micro-Architecture is designed
estimated area: 3.5 x 3.5 [email protected]
with a 70KB SRAM
Application coverage to function coverage
SW-HW partition: 10%-90% locality
Assembly instruction set specification
Design of Assembler and Simulator
Build
golden
model
Benchmark
instruction
set
Behavior
function
verification
Good performance?
Micro-architecture design
RTL coding
Design for
test
Backend
design
RTL code
verification
Test chip fabrication & test board prototype
Key features of Schubert




Dual clusters and dual coupling pipelines
- SIMD combined with VLIW architecture
Explicit Data Organization SIMD (EDO-SIMD)
2-Dimensional and byte-align addressing storage
Cycle accurate instruction set simulator
Dual clusters and dual coupling pipelines



Two clusters:
- Cluster0: Computation (+/-,*,&,>/<,etc.)
- Cluster1: Data conversion & LD/ST
- Based on Decoupled Access & Execution
(DAE)
Two pipelines:
RF0
- Each cluster holds its own
EX0
executive-level pipeline
EX1
- Share the IF & ID level pipeline
EX2
Advantages
- Parallelize computation operations
EX3
with non-computation operations
WB0
- Perform well on cycle count
Cluster0 executive
IF
Instruction fetch
DP
Instruction decoder
UD
RF1
AD0/PERM
AD1
MEM
WB1
Cluster1 executive
Dual clusters and dual coupling pipelines
General Register File
RF0
RF1
EX0
AD0/PERM
ADG
EX1
AD1
EX2
MEM
Memory
EX3
WB
ACC
ACC
ACC
Cluster 0
WB
ACC
Cluster 1
Explicit Data Organization SIMD ISA


Bottleneck of conventional SIMD ISA
- SIMD is inefficient if sub-word data is unaligned each other
- SIMD is less flexible than VLIW
Related works
- Complex streamed instruction, Delft TU
- Stream buffer, Stream processor, Stanford University
- Indirect register addressing, Elite project, IBM
Cycle percent of conventional SIMD ISA
SIMD class
VIS
MMX/SSE
AltiVec
Ld/St
11.70%
21.00%
17.90%
Organize
9.70%
12.60%
17%
Integer ALU
13.60%
18.80%
11.80%
Float ALU
--
9.30%
6.90%
This overhead is reduced
by Dual-Cluster
How to reduce this overhead?
Explicit Data Organization SIMD ISA


Proposed EDO-SIMD ISA
- Explicit data organization information (e.g. 3x8|3:4:7:0:1:2:6:5)
Indicate operand relations (align, merge, extract, broadcast, cross)
- Append Permutation network onto the RF pipeline of Cluster0
- Add Permutation pipeline in the Cluster1 in parallel with AD0
Advantages
- Merge organization with computation to reduce overhead
interpolate
- As flexible as VLIW
- Simplified implementation
12
10
1a
2f
02
10
a0
3:4:7:0:1:2:6:5
1a 2f
+
+
03
02
vR1
a0
+
00
34
+
04
12
+
02
10
+
01
10
+
00
02
+
03
a0
38
14
11
10
05
vR2
vR0
34
1d
31
vOADD vR2<3x8|3:4:7:0:1:2:6:5>, vR1, vR0
DCT
Intra predict
IIT
2-D stream storage and addressing

Multimedia temporal data behavior
- 2-D block by block
- Row and column access
Block jump
Row access
- Byte alignment
- Flexible block jumping
Conventional 1-D addressing
Column access
impose burdens on Computation
Elements for address generation
and address alignment tasks
Related works
- Linear addressing with circle buffer, Blackfin
- Special transpose unit, Trimedia
ox0
B0
ox2
ox1
oy0
oy2
B1
oy1


2-D stream storage and addressing
Proposed storage and addressing mode
- 2-D stream storage (base, 2-D stride, 2-D offset)
- Row and interleave data arrangement (row access & column
access )
- Base update for block jump (UPDATE B0, OX0, OY0, B0)
- C-like programming model is
Base Address
Logic Space
friendly to programmer
y offset
asm: vLDOBR B0, 4, 2, vR0;
x offset 0 1 2 3 4 5 6 7
C: for(i=0; i<8; i++)
1 2 3 4 5 6 7 0
2 3 4 5 6 7 0 1
r [i] = b [2][4+i];
3 4 5 6 7 0 1 2
 Advantages
4 5 6 7 0 1 2 3
5 6 7 0 1 2 3 4
- Reduce addressing and aligning
6 7 0 1 2 3 4 5
7 0 1 2 3 4 5 6
overhead (avoid transpose)

y stride
x stride
Cycle accurate instruction set simulator

Useful for benchmarking and ISA design
space exploration during early stage
- Input is assemble text program not
binary code
- Focus on function not micro-architecture


Resource
Model
IF
IF
IS
IS
ID/RF
Read
OP
Decode
Perm
ISA model
Mult
Add
Behavior &
Timing model
EX0
Support
EX1
Shift/rounding
EX2
reduction
WB
Write
back
Logic
Consist of
- Resource modeling
- ISA function modeling at each pipeline
- Behavior and timing modeling
- Debug and profiling support
3 men for 2 months work, about 60,000
lines C++ code
Benchmarking and performance



Mapped benchmarks:
- Full H.264 baseline decoder kernels like integer transform,
intra predict, interpolation and de-blocking.
- H.264 fast motion estimation
- MPEG2 motion compensation and DCT/IDCT
The cycle accurate and function correct programs help:
- Make assembler, simulator more robust
- Demonstrate the performance of ISA
- Explore and refine ISA (more than 900 instructions are refined
Cycles for 8x8 IDCT with IEEE compliant precision
to 316 in the end )
600
Performance
500
- 4-CIF(704x576) H.264 baseline 400
real-time decoder @ 200MHz 300
- 16 kB code size for H.264
200
100
baseline decoder
0
RISCMedia[10]
MMX
TMS320C6x NEC V830
VIRAM
Proposed
Outline
Overview of media processor
 Progress on Spock
 Progress on Schubert
- Overview
- Key features
- Performance
 Conclusions & Problems

Conclusions


Integration of a general MCU with heterogeneous ASIPs in a
SoC platform is a good choice for media processing in China
- a good trade-off between performance and flexibility
- overcome our IC design level constraint([email protected])
Progress on our Media processor
- CK510 and Spock is finished
Schubert
Core
- A dual-core SoC of CK510
Matrix
Spock
and Spock is taped out
memory
CK510
- Novel features of
Schubert are verified
AMBA BUS
and the RTL implement
SDRAM Controller
MailBox
is on-going
Media
DM
External
Bus
Interface
TM
Access
Controller
Off-chip Memory
Matrix Memory
Controller
DMA
peripheral
Problems
Behavior Synthesis tool
Application coverage to function coverage
SW-HW partition: 10%-90% locality


The Behavior synthesis stage
in our ASIP design depends
on human experience not tools,
which takes too much effort.
It is very valuable to research and
develop CAD tools for design
space exploration of ASIP ISA and
ASIP SoC communication during
the early stage
Assembly instruction set specification
Design of Assembler and Simulator
Build
golden
model
Benchmark
instruction
set
Behavior
function
verification
Good performance?
Micro-architecture design
RTL coding
Design for
test
Backend
design
RTL code
verification
Test chip fabrication & test board prototype
Thank you!!!