A High Performance SoC: Pkunity Chen Jie TM Peking University Microprocessor R&D Center

Download Report

Transcript A High Performance SoC: Pkunity Chen Jie TM Peking University Microprocessor R&D Center

A High Performance SoC: PkunityTM
Chen Jie
Peking University Microprocessor R&D Center
Contents
• PkUnity SoC Introduction
• PkUnity SoC Low Power Design
ICSoC2005, Aug 05
Introduction
Frequency(MHz)
Finish develop
platform
Processor
develop
SoC
Components
develop
Communication
chip & Router
chip develop
Chip mass
production
Pkunity-3
SoC
600
500
400
300
200
100
UniCore16
Processor
00
UniCore32
Processor
UniCoreF64
Pkunity-2
SoC
Pkunity-1
SoC
Year
01
02
03
04
05
06
ICSoC2005, Aug 05
PKUnity-3 Architecture
ICSoC2005, Aug 05
UniCore fix-point processor
• UniCore Frequency: 600MHz
• 32-bit harvard-architecture
RISC CPU
• UniCore32 instruction set
compatible
• Add conditional mov & BLX
instructions
• 8-stage instruction pipeline
• Dynamic prediction policy: Gshare
• Pipelined I&D Cache
• Two-level TLB
ICSoC2005, Aug 05
Performance Evaluation
• Unicore-II CPI increase 10%15%
CPI
• G-share prediction, pipelined
cache, two-level TLB reduce
the increasing of CPI caused
by deep pipeline
CPI Increase
7
6.0026
6
4.9501
4.6426
5
4.3184
4.0891
3.8576
3.6477 3.4945
3.4777
4
3.2634
2.5334
3 2.18182.5013
2.43782.6634
2.0016
1.78491.9877 1.8586
1.684
2
1
0
164.gzip 176.gcc 181.mcf 186.crafty 197.parser 252.eon 254.gap 255.vortex 256.bzip2 300.twolf
Benchmark
• UniCore-II MIPS increase
70%- 80 %
MIPS Increase
350
250
235.17
232.19
220.86
179.95
200
150
295.94
293.88
300
MIPS
• Performance improvement
come from improvement of
micro-architecture and
technology
unicore-1
unicore-2
138.89
100
136.22
126.7
98
124.3
87.14
78.55
176.gcc
181.mcf 186.crafty 197.parser 252.eon
Benchmark
61.22
161.26
92.86
143.85
169.77
163.04
MIPS-unicore1
MIPS-unicore2
86.72
50
0
164.gzip
254.gap 255.vortex 256.bzip2 300.twolf
ICSoC2005, Aug 05
SoC Design Platform
•
To build :
– a chip-based infrastructure
– a integrated develop environment
– a design and verification flow
•
In PkUnity-3:
– CPU configurable
– BUS configurable
– Interrupt system configurable
– DMA configurable
– Frequency configurable
– Power management
ICSoC2005, Aug 05
Verification
Coverage-oriented
VERA
verification flow
SystemC-based
HW/SW Co-verification
methodology
FPGA
prototype
ICSoC2005, Aug 05
Contents
• PkUnity SoC Introduction
• PkUnity SoC Low Power Design
– Power research status
– PkUnity low power design and power
estimation
– Future work
ICSoC2005, Aug 05
Power : New Challenge
– Power is a dramatic issue for SoCs with billions of transistors
– Power has to be reduced for portable devices that require a
dramatic increase of computation power
– Deep submicron technologies (90 and 65 nm) will present a
dramatic increase of leakage power
– Power still too high for most SoCs
– SoC Architectures, HW/SW, multiprocessor, multiple memories,
are not well supported by CAD tools
– Reconfigurability and Flexibility compromises low-power
– Leakage and very low Vdd are dramatic problems
ICSoC2005, Aug 05
LP Research Condition
low power design technology and research topics
Technology
Feature size shrink, low dielectric constant material, SOI technology
Circuit
Design low power standard cell library
Gate
Design low power logic chain: gated clock, gated Vdd
RTL
reduce switching activity: gated clock, state machine & glitch optimization
Micro-arch
Parallel, Pipeline, Pre-computing
Instruction
Good task partition between HW/SW, design low power instruction set
Compiler
Saving power while improve performance, Memory organization
OS
Dynamic voltage scaling, I/O devices, Power and energy analysis of OS
Application
Task partition, Algorithm optimization
ICSoC2005, Aug 05
Power Estimation Research
Power Estimation Hierarchy
High level
architectural model
Analysis Speed
SimplePower
System
Wattch
CACTI
Algorithm
Register Transfer
PrimePower
PowerCompiler
Logic
Analysis Precision
HSPICE
Circuit
simulation vs. analysis
simulation
with timing info
extract circuit parameters
adding technology info
gate level simulation
analysis with
extractive parameter
ICSoC2005, Aug 05
• Embedded Processor: High
Performance vs. Low Power
• Three methods to reduce chip
power:
power(mW)
Power of Pkunity
– Close unused module
1800
1600
1400
1200
1000
800
600
400
200
0
Pkunity1
– Frequency scaling
– Close Pll
• Pkunity-3 object:
Pkunity2
CPU Power
2%
2%
1% 3%
0%
0%
Pkunity3
SoC Power
1%
14%
2%
– CPU <[email protected]/600MHz
– SoC <[email protected]/600MHz
38%
37%
Pkunity2 CPU Power
Unicore
FPU
CP1
CP0
FPU_reg
BIU
BIUIU
DCache
Icache
DMMU
IMMU
ICSoC2005, Aug 05
Power Estimation
TestBench
SPEC
VCS
Simulation Executable File
Gate level Netlist
?
VCS
RTL
PowerCompiler
Netlist
Floorplan
Power Report
PowerCompiler
PrimePower
SAIF file
Power_hier.rpt
Power_hier.rpt
Operating conditions :typical
Library:typical
Operating
conditions
:typical
Library:typical
Wire
load model
mode:top
Wire operating
load modelVoltage=1.8
mode:top
Global
Global operating Voltage=1.8
CTS&Router
PrimePower
ECO
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Hierarchy
Switch Power
Int Power
Leak Power Total Power
%
Hierarchy
Switch Power 161.355
Int Power
Leak Power 369.691
Total Power 100
%
Top_pad
208.267
6.83e+07
Top_pad
208.267
161.355
6.83e+07
369.691
100
……
……
U_unity_1
11.482
U_unity_1
11.482
……
……
U_fpu
1.983
1.983
……U_fpu
……
U_unicore
0.711
U_unicore 0.711
91.724
91.724
6.08e+07
6.08e+07
103.268
103.268
27.9
27.9
16.305
16.305
3.25e+07
3.25e+07
18.321
18.321
5.0
5.0
3.572
3.572
2.20e+07
2.20e+07
4.284
4.284
1.2
1.2
Why are
they so
different?
Signoff
ICSoC2005, Aug 05
Power optimization
• Close unused module through
gated clock
• Reduce chip power through
scaling among multiple run
mode
– Run
– Idle
– Sleep
Clock gating vs. non Clock gating
60
• Change chip frequency
through dynamic PLL
configuration
50
• Input vector control in
Execution components
0
40
30
20
10
MU
IM
MU
DM
he
ac
Ic
he
ac
DC
U
UI
BI
U
BI g
re
U_
FP
0
CP
1
CP
U
FP e
or
ic
Un
None-gated
Gated
ICSoC2005, Aug 05
Work Flow
Low power design and estimation flow
System power
simulator
Gate-level
Estimation
Signoff
Netlist
ECO
RTL
CTS&Router
SPEC
PC insert
clock gating
Floorplan
LP micorarch
design
Estimation with
timing and load
ICSoC2005, Aug 05
Future Work
Low Power Design
• Memory architecture (cache,
TLB, register file)
• Clock system ( Syn vs. Asyn )
• Bus system
• Instruction set selection
• Voltage and frequency scaling
• Compiler optimization
• Task movement
Power Estimation
• To pre-analyze arch & microarch design through fast and
accurate Architectural level
power simulator
• To build a full-chip power
simulator
• Power simulator parameter
reconfigurable
• To build accurate leakage
power estimation model
• Specific component power
model
ICSoC2005, Aug 05
Thank you
ICSoC2005, Aug 05