BLAH - University of Michigan

Download Report

Transcript BLAH - University of Michigan

University of Michigan
Electrical Engineering and Computer Science
From SODA to Scotch:
The Evolution of a Wireless
Baseband Processor
Mark Woh (University of Michigan - Ann Arbor)
Yuan Lin (University of Michigan - Ann Arbor)
Sangwon Seo (University of Michigan - Ann Arbor)
Scott Mahlke (University of Michigan - Ann Arbor)
Trevor Mudge (University of Michigan - Ann Arbor)
Chaitali Chakrabarti (Arizona State University)
Richard Bruce (ARM Ltd.)
Danny Kershaw (ARM Ltd.)
Alastair Reid (ARM Ltd.)
Mladen Wilder (ARM Ltd.)
Krisztian Flautner (ARM Ltd.)
From SODA to Scotch : What is this talk about?
• If a fully programmable 3G baseband
processor commercially viable?
►
The SODA processor was the first full research
design [ISCA06]
►
ARM R&D developed the Ardbeg SDR commercial
prototype
• What we will present
►
►
Comparison study between SODA and Ardbeg
Lessons learned in the evolution
2
University of Michigan
Electrical Engineering and Computer Science
2
Mobile Computing
• In 2007, world-wide mobile telephone subscription:
3.3 billion1
►
►
►
~Half of the world’s population
Some countries have mobile penetration over 100%
Largest consumer electronic device in terms of volume
• Wireless multimedia anywhere at anytime
Cell phones are getting
more complex
PCs are getting more
mobile
1. “Global cellphone penetration reaches 50 pct”, Reuter, Nov. 29th, 2007
3
University of Michigan
Electrical Engineering and Computer Science
3
Wireless Communication
Global
Network
GPS
DVB
GSM
W-CDMA
802.11g
802.11n
Bluetooth
UWB
Wide Area
Network
Local Area
Network
Personal Area
Network
4
University of Michigan
Electrical Engineering and Computer Science
4
Software Defined Radio
GPS
Camera
Bluetooth
Keypad
WCDMA
Analog
Frontend
Application
Processors
Baseband
Processor
Display
Speaker
Microphone
5
University of Michigan
Electrical Engineering and Computer Science
5
Software Defined Radio
GPP
Transport
GPS
Network Camera
Bluetooth
WCDMA
Analog
Frontend
Application
Processors
Baseband
Processor
Link
Keypad
MAC
Display
DSP + ASICsSpeaker
PHY Microphone
6
University of Michigan
Electrical Engineering and Computer Science
6
Software Defined Radio
GPS
Camera
Bluetooth
WCDMA
Analog
Frontend
Keypad
SDR
Baseband
Processor
Application
Processors
Display
Speaker
Microphone
7
University of Michigan
Electrical Engineering and Computer Science
7
Advantages of Soft Radio
• Design factor
►
►
►
Protocol complexity
Multi-mode operation
Prototyping and bug fixes
GPS
• Cost factor
►
►
►
►
802.11n
DVB
SDR
GSM
Time-to-market
Silicon area
Higher volume
Longevity of platform
802.11g
8
W-CDMA
UWB
Bluetooth
University of Michigan
Electrical Engineering and Computer Science
8
Mobile SDR Design Challenges
1000
r
cy
tte e n
Be ffici
rE
we
Po
W
Peak Performance (Gops)
/m
ps for 3G and WiFi
SDR Design Objectives
o
M
100
10
0
W
 Throughput requirements
Mobile SDR
1
 40+Gops
peak
throughput
Requirements
/m
ps
o
0M
Embedded ps/m
o
DSPs 1 M
10
W
IBM Cell
High-end
DSPs
General
Purpose
Processors
 Power budget
Pentium M
TI C6x
 100mW~500mW peak power
1
0.1
1
10
100
Power (Watts)
9
University of Michigan
Electrical Engineering and Computer Science
9
First Generation SDR Processor : SODA
• Our first attempt was the SODA processor
►
►
►
Design at 180nm technology
Built for WCDMA and 802.11a in mind
Sub 500mW operation estimated at 90nm
10
University of Michigan
Electrical Engineering and Computer Science
SODA
System:
• Heterogeneous multi-core
architecture
• Multi-level scratchpad
memories
PE:
• SIMD/Scalar/AGU LIW
• 32-lane 16-bit SIMD
• 16-bit scalar datapath
• Scalar-to-SIMD
• SIMD-to-scalar
• Iterative Perfect Shuffle
Network
SODA PE
To
System
Bus
1. wide SIMD
Pred.
Regs
3. Local
memory
512-bit
SIMD
ALU +
Mult
5. DMA
L1
SIMD
Data
Memory
512-bit
SIMD
Reg.
File
E
X
RF
W
B
SIMD
Shuffle
Network
(SSN)
W
B
SIMD
to
Scalar
(VtoS)
V
T
S
DMA
ALU
S
T
V
L1
Program
Memory
L1
Scalar
Data
Memory
2. Scalar
Scalar
RF
E
X
AGU
ALU
W
B
Scalar
ALU
W
B
Controller
AGU
RF
11
E
X
4. AGU
University of Michigan
Electrical Engineering and Computer Science
11
SODA Summary
Peak Performance (Gops)
1000
100
Picochip 130nm
SODA 90nm
SODA 180nm
Mobile
SDR
10
requirements
Sandbridge 90nm
General
Purpose
Processors
TI C6x 90nm
Embedded
DSPs
NXP EVP 90nm
1
High-end
DSPs
req. ASICs
0.1
1
10
100
Power (Watts)
12
University of Michigan
Electrical Engineering and Computer Science
12
Ardbeg SDR Processor
Ardbeg PE
Ardbeg System
Bus
L2
1. wide
SIMD
Sparse Connected
VLIW
Application Specific Hardware 1024-bit
MemoryRF for SIMD
3 Read/2 3.Write
VLIW
ACC
RF
Block Floating Point
512-bit
SIMD
Reg.
File
8,16,32
support
L1 bit fixed point
Fused
Permute ALU operations
I
Mem
PE
Execution
Unit
L1
Data
Memory
N
T
E
R
C
O
N
N
E
C
T
S
Pred.
Combined
Scalar/Vector Memory
Mem
RF
PE
SIMD
128-lane 8-bit Banyan
Network
L1
Execution
Pred.
512-bit
64-bit AMBA 3 AXI Interconnect
FEC
Accelerator
L2
Memory
L1
Mem
Mem
Control
Processor
ALU
Unit
SIMD
wdata
L1
Program
Memory
Scalar
wdata
13
W
B
E
X
512-bit
SIMD
ALU
with
shuffle
W
B
E
X
SIMD
Shuffle
Network
W
B
SIMD+
Scalar
Transf
Unit
E
X
Scalar
ALU+
Mult
Scalar
RF+ACC
AGU
AGU
RF
AGU
Multiple Data Address Accesses
Controller
512-bit
SIMD
Mult
I
N
T
E
R
C
O
N
N
E
C
T
S
2. Scalar & AGU
DMAC
Peripherals
E
X
W
B
AGU
University of Michigan
Electrical Engineering and Computer Science
Evolution to Ardbeg : Lessons Learned
• Ardbeg achieved ~3x speedup overall at 30%
lower power than SODA
• To get these improvements many lessons
were learned as a result of the studies done
• We will present a few of these studies
►
►
►
►
1) Benefit of Wide SIMD
2) VLIW on SIMD support
3) Support for Complex Shuffle Network
4) Application Specific Hardware
14
University of Michigan
Electrical Engineering and Computer Science
1.2
Energy -Delay
Area
12
1.0
10
0.8
8
0.6
6
0.4
4
0.2
2
0
8
16
32
64
Normalized Area
Normalized Energy-Delay Product
1) Benefiting from Wide SIMD
0
SIMD Width
• Increasing SIMD width still a good idea for SDR
• But area becomes a big concern
►
32 wide 16-bit SIMD at 90nm seems a good fit
15
University of Michigan
Electrical Engineering and Computer Science
2) VLIW Support for Wide SIMD
• VLIW execution on top
of the SIMD datapath
AGU
AGU
AGU
►
►
Shared between SIMD units
2-issue SIMD LIW
Only support the most
frequently used SIMD op
pairs
Data
MEM
SIMD
scalar
RF
W
B
E
X
128lane
SSN
W
B
E
X
SIMD
scalar
trans.
unit
W
B
E
X
16-bit
ALU
W
B
Interconnects
►
SIMD
RF
Interconnects
• 3 read ports, 2 write
ports
E
X
32lane
SIMD
ALU
Scalar
16
University of Michigan
Electrical Engineering and Computer Science
16
2) VLIW on SIMD Support
Mem.
Arith.
Mult.
Shuffle
Trans.
Move
Comp.
Mem.
NA
High
High
Low
High
Low
Low
Arith.
-NA
Mid
High
Mid
Low
Low
Mult.
--NA
Mid
High
High
Low
Shuffle
---NA
Mid
Low
Low
Trans.
----NA
Low
Low
Move
-----NA
Low
Comp.
------NA
• There is a distinct set of instructions that execute
frequently at the same time
• We want to take advantage of this in order to reduce
complexity of VLIW
17
University of Michigan
Electrical Engineering and Computer Science
2) VLIW on SIMD Support
2 Read/ 2 Write (Single Issue)
3 Read/ 2 Write (Ardbeg)
4 Read/ 4 Write (Any two SIMD ops)
6 Read/ 5 Write (Any three SIMD ops)
Energy-Delay Product
1.2
1
0.8
0.6
0.4
0.2
0
FIR
CFIR
FFT Radix-2
FFT Radix-4
Viterbi K7
Viterbi K9
Average
• 3 Read/ 2 Write provides us for the most case the
best overall design point
18
University of Michigan
Electrical Engineering and Computer Science
3) Support for Shuffle Network
AGU
•
Scalar
Data
MEM
7-stage single-cycle SSN
►
►
Banyan network
128-lane 8-bit (64-lane 16-bit)
scalar
RF
W
B
E
X
128lane
SSN
W
B
E
X
SIMD
scalar
trans.
unit
W
B
E
X
16-bit
ALU
W
B
Interconnects
SIMD
2 stage 16-lane Banyan network
Interconnects
SIMD
Data
MEM
SIMD
RF
E
X
32lane
SIMD
ALU
Scalar
19
University of Michigan
Electrical Engineering and Computer Science
19
Energy-Delay Product
3) Support for Shuffle Network
1.2
1
0.8
0.6
0.4
0.2
0
64pt FFT
Radix-2
2048pt FFT
Radix-2
64pt FFT
Radix-4
32 Wide Perfect
64 Wide Crossbar
2048pt FFT
Radix-4
Viterbi K9
64 Wide Perfect
64 Wide Banyan
• 64-Wide Banyan gives us close to a simple iterative
interconnect energy with crossbar like performance
20
University of Michigan
Electrical Engineering and Computer Science
4) Application Specific Optimizations
• Application specific hardware
►
►
►
►
Turbo coprocessor
Block-floating point support
Fused Permute-ALU operations
Interleaving support
• Trade-off programmability for performance
►
►
Less “soft” than SODA
But more energy efficient for common operations
21
University of Michigan
Electrical Engineering and Computer Science
21
4) Application Specific Optimizations
• Some kernels are common among many different
protocols
►
Many protocols use the same Error Correction algorithms
• Turbo Coprocessor is one of them
►
Tradeoff between Programmable vs ASIC
• ASIC implementations is around 5x more efficient
than programmable implementation
►
►
SODA PE: 2Mbps with 111mW in 90nm
ASIC: 2Mbps with 21mW in 90nm
22
University of Michigan
Electrical Engineering and Computer Science
Ardbeg Speedup Over SODA
Overall Improvements
Baseline SODA
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
Filtering
SIMD ALU
SIMD Shuffle
Modulation
VLIW
Compiler Optimization
Synchronization
7x
Error
Correction
• Achieves between ~1.5-7x speedup for
wireless algorithms compared to SODA
23
University of Michigan
Electrical Engineering and Computer Science
Summary of Ardbeg
Achieved Throughput (Mbps)
100
802.11a
802.11a
180nm 802.11a
802.11a
10
W-CDMA 2Mbps
1
180nm W-CDMA 2Mbps
W-CDMA 2Mbps
W-CDMA 2Mbps
W-CDMA data
W-CDMA data
SODA
ASIC
Sandblaster
TigerSHARC
7 Pentium M
0.1
W-CDMA voice
0.01
0.01
0.1
1
10
100
1000
Power (Watts)
• Power vs Throughput for protocols on different processors
24
University of Michigan
Electrical Engineering and Computer Science
Summary of Ardbeg
Achieved Throughput (Mbps)
100
802.11a
802.11a
802.11a
10
180nm 802.11a
802.11a
DVB-H
W-CDMA 2Mbps
Ardbeg
DVB-T
SODA
W-CDMA 2Mbps
1
180nm W-CDMA 2Mbps
W-CDMA 2Mbps
W-CDMA 2Mbps
ASIC
Sandblaster
W-CDMA data
W-CDMA data
TigerSHARC
W-CDMA data
7 Pentium M
0.1
W-CDMA voice
W-CDMA voice
0.01
0.01
0.1
1
10
100
1000
Power (Watts)
• Ardbeg is lower power at same throughput
• We are getting closer to ASICs
25
University of Michigan
Electrical Engineering and Computer Science
Conclusion
• SODA  Ardbeg
►
►
Overall ~1.5-7x improvement across multiple wireless
algorithms
30% less power over SODA (with turbo also in software)
• Fully programmable research design evolved to a
commercial design that is “less soft”
• Feasible to design programmable solutions that start
to approach ASIC efficiency
►
ASICs are locally optimal for single kernels but combined
create an inefficient system
• Programmability allows time multiplexing of hardware
= Less hardware, same amount of work
26
26
University of Michigan
Electrical Engineering and Computer Science
Questions?
Thanks!
27
University of Michigan
Electrical Engineering and Computer Science