Document 7379341

Download Report

Transcript Document 7379341

Wireless Communication Extensions
for DSPs and
General Purpose Processors
Sridhar Rajagopal
COMP 625
April 17, 2000
Motivation




Wireless, the next wave after Multimedia
Highly Compute-Intensive Algorithms
Real-Time Requirements
Design based on Time-to-Market
April 17,2000
Sridhar Rajagopal
2
Outline






Processor Core with Reconfigurable Support
Permutation Based Interleaved Memory
Processor Architecture -EPIC
Instruction Set Extensions
Truncated Multipliers
Software Support Needed
April 17,2000
Sridhar Rajagopal
3
Characteristics of Wireless Algorithms






Massive Parallelism
Bit-level Computations
Matrix Based Operations
Memory Intensive
Complex-valued Data
Approximate Computations
April 17,2000
Sridhar Rajagopal
4
What’s wrong with Current
Architectures for these applications?
April 17,2000
Sridhar Rajagopal
5
Problems with Current Architectures






UltraSPARC, C6x, MMX, IA-64
Not enough MIPs/FLOPs
Unable to fully exploit parallelism
Bit Level Computations
Memory Bottlenecks
Specialized Instructions for Wireless Communications
April 17,2000
Sridhar Rajagopal
6
Why Reconfigurable


Adapt algorithms to environment
Seamless and Continuous Data Processing during Handoffs
Home Area
Wireless LAN
High Speed
Office Wireless
LAN
April 17,2000
Outdoor CDMA
Cellular
Network
Sridhar Rajagopal
7
Reconfigurable Support
OSI
Layers
3-7
April 17,2000
User Interface
Translation
Synchronization
Transport
Network
OSI
Layer
2
Data Link Layer
(Converts Frames
to Bits)
OSI
Layer
1
Physical Layer
(hardware;
raw bit stream)
Sridhar Rajagopal
8
Different Protocols

MPEG-4, H.723 - Voice,Multimedia

Convolutional,Turbo - Channel Coding
Source Coding
Channel Coding
Source
Channel
Multiuser
Channel
Decoding
Decoding
Detection
Estimation
April 17,2000
Sridhar Rajagopal
9
A New Architecture
Processor
Core
(GPP/DSP)
Cache
Main
Memory
Q
Q
Crossbar
Reconfigurable
Real-Time
I/O
Logic
Bit Stream
RF Unit
Add-on PCMCIA
Network Interface Card
Processor
April 17,2000
Sridhar Rajagopal
10
Why Reconfigurable


Process initial bit level computations
Optimize for fast I/O transfer
Reconfigurable
Real-Time
I/O
Logic
Bit Stream
April 17,2000
Sridhar Rajagopal
RF Unit
11
Reconfigurable Support
2 64-bit data buses
1 64-bit address bus
Control
Blocks
Boolean
Fast I/O
Configuration
64-bit
Datapath
values
Caches Sequencer
GARP Architecture at UC,Berkeley
April 17,2000
Sridhar Rajagopal
12
Reconfigurable Support

Wide Path to Memory
– Data Transfer
– Minimize Load Times

Configuration Caches
– Recently Displaced Configurations(5 cycles)
– Can hold 4 full size Configurations

Independent Execution
April 17,2000
Sridhar Rajagopal
13
Reconfigurable Support

Access to same Memory System as Processor
– Minimize overhead

When idle
– Load Configurations
– Transfer Data
April 17,2000
Sridhar Rajagopal
14
Operation

Load Configuration
– If in configuration cache, minimal time

Copy initial data with coprocessor move instructions

Start execution

Issue wait that interlocks while active

Copy registers back at kernel completion
April 17,2000
Sridhar Rajagopal
15
Memory Interface

Access to Main Memory and L1 Data Cache
– Large, fast Memory Store

Memory Prefetch Queues for Sequential Accesses
– Read aheads and Write Behinds
Instruction Cache
Processor
Core
L1 Data
Cache
(GPP/DSP)
Q
Main
Memory
Q
Crossbar
FPGA
April 17,2000
Sridhar Rajagopal
16
Permutation Based Interleaved
Memory (PBI)




High Memory Bandwidth Needed
Stride-Insensitive Memory System for Matrices
Multiple Banks
Sustained Peak Throughput (95%)
L1 Data
Cache
April 17,2000
Main
Memory
Sridhar Rajagopal
17
PBI Scheme

N- address length

M = 2n Banks

2N-n words in each bank

To access a word,
– n-bit bank number
– N-n bit address (high-order)

Calculation of the n-bit Bank Number
April 17,2000
Sridhar Rajagopal
18
Calculate Bank Number

Use all N bits to get n-bit vector
Y = A X , A = n*N matrix of 0’s & 1’s

Y = AhXh + Al Xl (N-n,n) [Al -rank n]

N-bit parity circuit with logkN levels of XOR gates (k-Fanin)

N-bit address
Parity Ckt.
Parity Ckt.
Parity Ckt.
Row 0 of A
Row 1 of A
Row n-1 of A
n parity bit signals
Decoder
2n bank select signals
April 17,2000
Sridhar Rajagopal
19
Interleaved Memory Model
Input Buffers
Address Source
Memory Banks
M(0)
M(1)
Data Sink
M(M-1)
Data Sequencer
Output Buffers
April 17,2000
Sridhar Rajagopal
20
Processor Core



64-bit EPIC Architecture with Extensions(IA-64/C6x)
Statically determined Parallelism;exploit ILP
Execution Time Predictability
Processor
Core
(GPP/DSP)
Cache
Q
Q
Crossbar
FPGA
April 17,2000
Sridhar Rajagopal
21
EPIC Principle

Explicitly Parallel Instruction Computing

Evolution of VLIW Computing

Compiler- Key role

Architecture to assist Compiler

Better cope with dynamic factors
– which limited VLIW Parallelism
April 17,2000
Sridhar Rajagopal
22
Aspects of EPIC

Designing Plan of Execution(POE) at Compile Time

Permitting Compiler to play Statistics
– Conditional Branches, Memory references

Communicating POE to the hardware
– Static Scheduling
– Branch information
April 17,2000
Sridhar Rajagopal
23
Architecture Features in EPIC

Static Scheduling
– MultiOP
– Non-Unit Assumed Latency (NUAL)

The Branch Problem
– Predicated Execution
– Control Speculation
– Predicated Code Motion

The Memory Problem
– Cache Specifiers
– Data Speculation
April 17,2000
Sridhar Rajagopal
24
Instruction Set Extensions

To accelerate Bit level computations in Wireless

Real/Complex Integer - Bit Multiplications
– Used in Multiuser Detection, Decoding

Bit - Bit Multiplications
– Used in Outer Product Updates
– Correlation, Channel Estimation

Complex Integer-Integer Multiplications

Useful in other Signal Processing applications
– Speech, Video,,,
April 17,2000
Sridhar Rajagopal
25
Architecture Support

Support via Instruction Set Extensions

Minimal ALU Modifications necessary

Transparent to Register Files/Memory

Additional 8-bit Special Purpose Registers
April 17,2000
Sridhar Rajagopal
26
Integer - Bit Multiplications
D[I] = D[I] + b[J]*C[j]
Eg: Cross-Correlation
64-bit Register A
+/-
+/-
64-bit Register C
+/-
8-bit Register b
64-bit Register D
Register Renaming?
April 17,2000
Sridhar Rajagopal
27
8-bit to 64-bit conversions
1.1
D = D + b*bT
Eg: Auto-Correlation
1.2
2.1
b1 = b(1:8),b(1:8),….b(1:8) b2 = b(1)b(1)……b(8)b(8)
8-bit Register b
64-bit Register A
b(1)..b(8)
b(1) b(2)
b(7) b(8)
b(1)..b(8)
b(1) b(1)
b(8) b(8)
April 17,2000
Sridhar Rajagopal
28
Bit-Bit Multiplications
D = D + b*bT
Eg: Auto-Correlation
b1*b2
Bit-Bit Multiplications
64-bit Register A = b1
B1 B2 B1*B2
0
0
1
0
1
0
1
0
0
1
1
1
April 17,2000
64-bit Register B=b2
Ex-NOR
64-bit Register C=b1*b2
Sridhar Rajagopal
29
Increment/Decrement
D = D + b*bT
Eg: Auto-Correlation
64-bit Register D
1
+/-
+/-
+/-
8-bit Register b1*b2
64-bit Register (D+b1*b2)
April 17,2000
Sridhar Rajagopal
30
Complex-valued Data Processing



Is it easy to add ?
Is this worth an additional ALU Support ?
Typically supported by Software!
?
April 17,2000
Sridhar Rajagopal
31
Truncated Multipliers





Many applications need approximate computations
Adaptive Algorithms :Y = Y + mu*(Y*C)
Truncate lower bits
Truncated Multipliers - half the area/half the delay
Can do 2 truncated multiplies in parallel with regular
ALU Multipliers
Truncated
Multiplier 1
April 17,2000
Multiplier 2
Sridhar Rajagopal
Multiplier
32
Software Support

Greater Interaction between Compilers and Architectures
– EPIC
– Reconfigurable Logic


Compiler needs to find and exploit bit level computations
Reconfigurable Logic Programming
April 17,2000
Sridhar Rajagopal
33
Area Estimates

Area increase by 20% over a IA-64 architecture size due
to reconfigurable Support

Instruction Set extensions need min hardware support

Parallel Interleaved Memory Banks will need larger area
April 17,2000
Sridhar Rajagopal
34
Other Uses

Reconfigurable Logic
– For accelerating loops of general purpose processors

Bit Level Support
– For other voice, video and multimedia applications
April 17,2000
Sridhar Rajagopal
35
Conclusions




Processor Core with Reconfigurable Support developed for
Wireless Applications
Instruction Set Extensions added for accelerating
performance of the algorithms
Integration of Wireless Appliances with General Purpose
Processors
Great Impact on Performance of Wireless Algorithms
April 17,2000
Sridhar Rajagopal
36
Future Work

Simulations for finding performance improvements

Other Processor Architectures
– Bit Slice Architectures
– Out-of-order
April 17,2000
Sridhar Rajagopal
37
References

The GARP Architecture and C Compiler
– T.C. Callahan,J.R.Hauser,J.Wawrzynek, IEEE Computer,April 2000, pp62-69

http://brass.cs.berkeley.edu

EPIC:Explicitly Parallel Instruction Computing
– M.S.Schlansker,B.R.Rau, IEEE Computer, Feb 2000, pp 37-45

High-Bandwidth Interleaved Memories for Vector
Processors - A Simulation Study
– G.S.Sohi, IEEE Transactions on Computers, Vol.42,No.1,Jan 1993,pp34-44
April 17,2000
Sridhar Rajagopal
38
Acknowledgements



Vijay Pai
Partha Ranganathan
Joseph Cavallaro
April 17,2000
Sridhar Rajagopal
39