IRAM A Media-oriented Processor with Embedded DRAM Christoforos Kozyrakis, David Patterson, Katherine Yelick Computer Science Division University of California at Berkeley http://iram.cs.berkeley.edu.

Download Report

Transcript IRAM A Media-oriented Processor with Embedded DRAM Christoforos Kozyrakis, David Patterson, Katherine Yelick Computer Science Division University of California at Berkeley http://iram.cs.berkeley.edu.

IRAM
A Media-oriented Processor with
Embedded DRAM
Christoforos Kozyrakis, David Patterson,
Katherine Yelick
Computer Science Division
University of California at Berkeley
http://iram.cs.berkeley.edu
IRAM Overview
• A processor architecture for embedded/portable
systems running media applications
– Based on media processing and embedded DRAM
– Simple, scalable, and efficient
– Good compiler target
• Microprocessor prototype with
–
–
–
–
–
256-bit media processor, 16 MBytes DRAM
150 million transistors, 290 mm2
3.2 Gops, 2W at 200 MHz
Industrial strength compiler
Implemented by 6 graduate students
2
The IRAM Team
• Hardware:
– Joe Gebis, Christoforos Kozyrakis, Ioannis Mavroidis,
Iakovos Mavroidis, Steve Pope, Sam Williams
• Software:
– Alan Janin, David Judd, David Martin, Randi Thomas
• Advisors:
– David Patterson, Katherine Yelick
• Help from:
– IBM Microelectronics, MIPS Technologies, Cray
3
Outline
• Motivation and goals
• Instruction set
• IRAM prototype
– Microarchitecture and design
• Compiler
• Performance
– Comparison with SIMD
4
PostPC processor applications
• Multimedia processing
– image/video processing, voice/pattern recognition, 3D
graphics, animation, digital music, encryption
– narrow data types, streaming data, real-time response
• Embedded and portable systems
– notebooks, PDAs, digital cameras, cellular phones,
pagers, game consoles, set-top boxes
– limited chip count, limited power/energy budget
• Significantly different environment from that of
workstations and servers
5
Motivation and Goals
• Processor features for PostPC systems:
– High performance on demand for multimedia without
continuous high power consumption
– Tolerance to memory latency
– Scalable
– Mature, HLL-based software model
• Design a prototype processor chip
– Complete proof of concept
– Explore detailed architecture and design issues
– Motivation for software development
6
Key Technologies
• Media processing
–
–
–
–
High performance on demand for media processing
Low power for issue and control logic
Low design complexity
Well understood compiler technology
• Embedded DRAM
– High bandwidth for media processing
– Low power/energy for memory accesses
– “System on a chip”
7
Outline
• Motivation and goals
• Instruction set
• IRAM prototype
– Microarchitecture and design
• Compiler
• Performance
– Comparison with SIMD
8
Potential Multimedia Architecture
• “New” model: VSIW=Very Short Instruction Word!
–
–
–
–
Compact: Describe N operations with 1 short instruct.
Predictable (real-time) perf. vs. statistical perf. (cache)
Multimedia ready: choose N*64b, 2N*32b, 4N*16b
Easy to get high performance; N operations:
•
•
•
•
•
•
are independent
use same functional unit
access disjoint registers
access registers in same order as previous instructions
access contiguous memory words or known pattern
hides memory latency (and any other latency)
– Compiler technology already developed, for sale!
9
Operation & Instruction Count:
RISC v. “VSIW” Processor
(from F. Quintana, U. Barcelona.)
Spec92fp
Program
swim256
hydro2d
nasa7
su2cor
tomcatv
wave5
mdljdp2
Operations (M)
RISC VSIW
115
95
58
40
69
41
51
35
15
10
27
25
32
52
R/V
1.1x
1.4x
1.7x
1.4x
1.4x
1.1x
0.6x
Instructions (M)
RISC VSIW R / V
115
0.8
142x
58
0.8
71x
69
2.2
31x
51
1.8
29x
15
1.3
11x
27
7.2
4x
32
15.8
2x
VSIW reduces ops by 1.2X,
instructions by 20X!
10
Revive Vector (VSIW) Architecture!
• Cost: ~ $1M each?
• Low latency, high BW
memory system?
• Code density?
• Compilers?
• Vector Performance?
• Power/Energy?
• Scalar performance?
• Real-time?
• Limited to scientific
applications?
• Single-chip CMOS MPU/IRAM
• Embedded DRAM
•
•
•
•
•
Much smaller than VLIW/EPIC
For sale, mature (>20 years)
Easy scale speed with technology
Parallel to save energy, keep perf
Include modern, modest CPU
 OK scalar
• No caches, no speculation
 repeatable speed as vary input
• Multimedia apps vectorizable too:
N*64b, 2N*32b, 4N*16b
11
But ...
• But vectors are in your appendix, not in a chapter
• But my professor told me vectors are dead
• But I know my application doesn’t vectorize
(= “but my application is not a dense matrix”)
• But the latest fashion trend is VLIW,
and I don’t want to be out of style
12
Vector Surprise
• Use vectors for inner loop parallelism (no surprise)
– One dimension of array: A[0, 0], A[0, 1], A[0, 2], ...
– think of machine as 32 vector regs each with 64 elements
– 1 instruction updates 64 elements of 1 vector register
• and for outer loop parallelism!
– 1 element from each column: A[0,0], A[1,0], A[2,0], ...
– think of machine as 64 “virtual processors” (VPs)
each with 32 scalar registers! (~ multithreaded processor)
– 1 instruction updates 1 scalar register in 64 VPs
• Hardware identical, just 2 compiler perspectives
13
Vector Architecture State
Virtual Processors ($vlr)
VP0
VP1
VP$vlr-1
General vr
vr
0
Purpose 1
Registers
vr31
(32)
Scalar Regs
$vpw
Flag
Registers
(32)
vf
0
vf1
vs0
vs1
vs15
64b
vf31
1b
14
Vector Multiply with dependency
/* Multiply a[m][k] * b[k][n] to get c[m][n]
*/
for (i=1; i<m; i++)
{
for (j=1; j<n; j++)
{
sum = 0;
for (t=1; t<k; t++)
{
sum += a[i][t] * b[t][j];
}
c[i][j] = sum;
}
}
15
Novel Matrix Multiply Solution
• You don't need to do reductions for matrix multiply
• You can calculate multiple independent sums within one
vector register
• You can vectorize the outer (j) loop to perform
32 dot-products at the same time
• Or you can think of each 32 Virtual Processors doing
one of the dot products
– (Assume Maximum Vector Length is 32)
• Show it in C source code, but can imagine the assembly
vector instructions from it
16
Optimized Vector Example
/* Multiply a[m][k] * b[k][n] to get c[m][n] */
for (i=1; i<m; i++)
{
for (j=1; j<n; j+=32)/* Step j 32 at a time. */
{
sum[0:31] = 0; /* Initialize a vector
register to zeros. */
for (t=1; t<k; t++)
{
a_scalar = a[i][t]; /* Get scalar from
a matrix. */
b_vector[0:31] = b[t][j:j+31];
/* Get vector from
b matrix. */
prod[0:31] = b_vector[0:31]*a_scalar;
/* Do a vector-scalar multiply. */
17
Optimized Vector Example cont’d
/* Vector-vector add into results. */
sum[0:31] += prod[0:31];
}
/* Unit-stride store of vector of
results. */
c[i][j:j+31] = sum[0:31];
}
}
18
Vector Instruction Set
• Complete load-store vector instruction set
– Uses the MIPS64™ ISA coprocessor 2 opcode space
• Ideas work with any core CPU: Arm, PowerPC, ...
– Architecture state
• 32 general-purpose vector registers
• 32 vector flag registers
– Data types supported in vectors:
• 64b, 32b, 16b (and 8b)
– 91 arithmetic and memory instructions
• Not specified by the ISA
– Maximum vector register length
– Functional unit datapath width
19
Vector IRAM ISA Summary
Scalar
Vector
ALU
Vector
Memory
MIPS64 scalar instruction set
alu op
load
store
s.int
u.int
s.fp
d.fp
.v
.vv
.vs
.sv
s.int
u.int
8
16
32
64
•91 instructions
•660 opcodes
unit stride
constant stride
indexed
ALU operations:
integer, floating-point, convert, logical,
vector processing, flag processing
20
Support for DSP
zn
x n/2
y n/2
*
+
n
Round
sat
n
w
n
a
• Support for fixed-point numbers, saturation,
rounding modes
• Simple instructions for intra-register permutations
for reductions and butterfly operations
– High performance for dot-products and FFT without the
complexity of a random permutation
21
Compiler/OS Enhancements
• Compiler support
– Conditional execution of vector instruction
• Using the vector flag registers
– Support for software speculation of load operations
• Operating system support
–
–
–
–
MMU-based virtual memory
Restartable arithmetic exceptions
Valid and dirty bits for vector registers
Tracking of maximum vector length used
22
Outline
• Motivation and goals
• Vector instruction set
• Vector IRAM prototype
– Microarchitecture and design
• Vectorizing compiler
• Performance
– Comparison with SIMD
23
VIRAM Prototype Architecture
Flag Unit 0
Instr. Cache
(8KB)
Flag Unit 1
FPU
MIPS64™
5Kc Core
CP IF
Flag Register File (512B)
Arithmetic
Unit 0
Arithmetic
Unit 1
256b
SysAD IF
Vector Register File (8KB)
Data Cache
(8KB)
64b
256b
64b
Memory Unit
TLB
256b
JTAG IF
DMA
JTAG
Memory Crossbar
DRAM0
DRAM1
(2MB)
(2MB)
…
DRAM7
(2MB)
24
Architecture Details (1)
• MIPS64™ 5Kc core (200 MHz)
– Single-issue core with 6 stage pipeline
– 8 KByte, direct-map instruction and data caches
– Single-precision scalar FPU
• Vector unit (200 MHz)
– 8 KByte register file (32 64b elements per register)
– 4 functional units:
• 2 arithmetic (1 FP), 2 flag processing
• 256b datapaths per functional unit
– Memory unit
• 4 address generators for strided/indexed accesses
• 2-level TLB structure: 4-ported, 4-entry microTLB and singleported, 32-entry main TLB
• Pipelined to sustain up to 64 pending memory accesses
25
Architecture Details (2)
• Main memory system
– No SRAM cache for the vector unit
– 8 2-MByte DRAM macros
• Single bank per macro, 2Kb page size
• 256b synchronous, non-multiplexed I/O interface
• 25ns random access time, 7.5ns page access time
– Crossbar interconnect
• 12.8 GBytes/s peak bandwidth per direction (load/store)
• Up to 5 independent addresses transmitted per cycle
• Off-chip interface
– 64b SysAD bus to external chip-set (100 MHz)
– 2 channel DMA engine
26
Vector Unit Pipeline
• Single-issue, in-order pipeline
• Efficient for short vectors
– Pipelined instruction start-up
– Full support for instruction chaining, the vector
equivalent of result forwarding
• Hides long DRAM access latency
– Random access latency could lead to stalls due to long
loaduse RAW hazards
– Simple solution: “delayed” vector pipeline
27
Modular Vector Unit Design
256b
Control
Integer
Datapath 0
Integer
Datapath 0
Integer
Datapath 0
Integer
Datapath 0
FP Datapath
FP Datapath
FP Datapath
FP Datapath
Vector Reg.
Elements
Vector Reg.
Elements
Vector Reg.
Elements
Vector Reg.
Elements
Flag Reg. Elements
& Datapaths
Flag Reg. Elements
& Datapaths
Flag Reg. Elements
& Datapaths
Flag Reg. Elements
& Datapaths
Integer
Datapath 1
Xbar IF
Integer
Datapath 1
Xbar IF
Integer
Datapath 1
Xbar IF
Integer
Datapath 1
Xbar IF
64b
64b
64b
64b
• Single 64b “lane” design replicated 4 times
– Reduces design and testing time
– Provides a simple scaling model (up or down) without major
control or datapath redesign
• Most instructions require only intra-lane interconnect
– Tolerance to interconnect delay scaling
28
Floorplan
14.5 mm
• Technology: IBM SA-27E
– 0.18mm CMOS
– 6 metal layers (copper)
20.0 mm
• 290 mm2 die area
– 225 mm2 for memory/logic
– DRAM: 161 mm2
– Vector lanes: 51 mm2
• Transistor count: ~150M
• Power supply
– 1.2V for logic, 1.8V for DRAM
• Peak vector performance
– 1.6/3.2/6.4 Gops wo. multiply-add
(64b/32b/16b operations)
– 3.2/6.4 /12.8 Gops w. multiply-add
– 1.6 Gflops (single-precision)
29
Alternative Floorplans (1)
“VIRAM-8MB”
“VIRAM-2Lanes”
“VIRAM-Lite”
4 lanes, 8 Mbytes
2 lanes, 4 Mbytes
1 lane, 2 Mbytes
190 mm2
120 mm2
60 mm2
3.2 Gops at 200 MHz
(32-bit ops)
1.6 Gops at 200 MHz
0.8 Gops at 200 MHz
30
Alternative Floorplans (2)
• “RAMless” VIRAM
– 2 lanes, 55 mm2, 1.6 Gops at 200 MHz
– 2 high-bandwidth DRAM interfaces and decoupling
buffers
– Vector processors need high bandwidth, but they can
tolerate latency
31
Power Consumption
• Power saving techniques
– Low power supply for logic (1.2 V)
• Possible because of the low clock rate (200 MHz)
• Wide vector datapaths provide high performance
– Extensive clock gating and datapath disabling
• Utilizing the explicit parallelism information of vector
instructions and conditional execution
– Simple, single-issue, in-order pipeline
• Typical power consumption: 2.0 W
–
–
–
–
MIPS core:
Vector unit:
DRAM:
Misc.:
0.5 W
1.0 W (min ~0 W)
0.2 W (min ~0 W)
0.3 W (min ~0 W)
32
Outline
• Motivation and goals
• Vector instruction set
• Vector IRAM prototype
– Microarchitecture and design
• Vectorizing compiler
• Performance
– Comparison with SIMD
33
VIRAM Compiler
Frontends
C
C++
Fortran95
Optimizer
Cray’s
PDGCS
Code Generators
T3D/T3E
C90/T90/SV1
SV2/VIRAM
• Based on the Cray’s PDGCS production environment for
vector supercomputers
• Extensive vectorization and optimization capabilities
including outer loop vectorization
• No need to use special libraries or variable types for
vectorization
34
Exploiting 0n-Chip Bandwidth
• The vector ISA + compiler technology uses high bandwidth
to mask latency
• Compiled matrix-vector multiplication: 2 Flops/element
– Easy compilation problem; stresses memory bandwidth
– Compare to 304 Mflops (64-bit) for Power3 (hand-coded)
900
1 lane
MFLOPS
800
700
2 lane
600
4 lane
500
8 lane
400
300
200
100
mvm 64-bit,
16 banks
mvm 64-bit,
8 banks
mvm 32-bit,
16 banks
8 banks
mvm 32-bit,
0
–Performance
normally
scales with
number of
lanes
–Need more
memory
banks than
default
DRAM macro
35
Compiling Media Kernels on IRAM
• The compiler generates code for narrow data widths, e.g.,
16-bit integer
• Compilation model is simple, more scalable (across
generations) than MMX, VIS, etc.
– Strided and
3500
1
2
4
8
3000
MFLOPS
2500
lane
lane
lane
lane
2000
1500
1000
500
indexed
loads/stores
simpler than
pack/unpack
– Maximum
vector length is
longer than
datapath width
(256 bits); all
lane scalings
done with single
executable
0
colorspace
composite
FIR filter
36
Compiler Challenges
• Generate code for variable data type width
– Vectorizer starts with largest width (64b)
– At the end, vectorization discarded if greatest width met
is smaller; vectorization restarted
– For simplicity, a single loop will use the largest width
present in it
• Consistency between scalar cache and DRAM
– Problem when vector unit writes cached data
– Vector unit invalidates cache entries on writes
– Compiler generates synchronization instructions
• Vector after scalar, scalar after vector
• Read after write, write after read, write after write
37
Outline
• Motivation and goals
• Vector instruction set
• Vector IRAM prototype
– Microarchitecture and design
• Vectorizing compiler
• Performance
– Comparison with SIMD
38
Performance: Efficiency
Peak
Sustained
% of Peak
Image Composition
6.4 GOPS
6.40 GOPS
100%
iDCT
6.4 GOPS
3.10 GOPS
48.4%
Color Conversion
3.2 GOPS
3.07 GOPS
96.0%
Image Convolution
3.2 GOPS
3.16 GOPS
98.7%
Integer VM Multiply
3.2 GOPS
3.00 GOPS
93.7%
1.6 GFLOPS
1.59 GFLOPS
99.6%
FP VM Multiply
Average
89.4%
39
Performance: Comparison
VIRAM
MMX
iDCT
0.75
3.75 (5.0x)
Color Conversion
0.78
8.00 (10.2x)
Image Convolution
1.23
5.49 (4.5x)
QCIF (176x144)
7.1M
33M (4.6x)
CIF (352x288)
28M
140M (5.0x)
• QCIF and CIF numbers are in clock cycles per frame
• All other numbers are in clock cycles per pixel
• MMX results assume no first level cache misses
40
Vector Vs. SIMD
Vector
SIMD
One instruction keeps multiple
datapaths busy for many cycles
One instruction keeps one
datapath busy for one cycle
Wide datapaths can be used
without changes in ISA or issue
logic redesign
Wide datapaths can be used
either after changing the ISA or
after changing the issue width
Strided and indexed vector load
and store instructions
Simple scalar loads; multiple
instructions needed to load a
vector
No alignment restriction for
vectors; only individual elements
must be aligned to their width
Short vectors must be aligned in
memory; otherwise multiple
instructions needed to load them
41
Vector Vs. SIMD: Example
• Simple example: conversion from RGB to YUV
Y = [( 9798*R + 19235*G + 3736*B) / 32768]
U = [(-4784*R - 9437*G + 4221*B) / 32768] + 128
V = [(20218*R – 16941*G – 3277*B) / 32768] + 128
42
VIRAM Code (22 instructions)
RGBtoYUV:
vlds.u.b
vlds.u.b
vlds.u.b
xlmul.u.sv
xlmadd.u.sv
xlmadd.u.sv
vsra.vs
xlmul.u.sv
xlmadd.u.sv
xlmadd.u.sv
vsra.vs
vadd.sv
xlmul.u.sv
xlmadd.u.sv
xlmadd.u.sv
vsra.vs
vadd.sv
vsts.b
vsts.b
vsts.b
subu
bnez
r_v, r_addr, stride3,
g_v, g_addr, stride3,
b_v, b_addr, stride3,
o1_v, t0_s,
r_v
o1_v, t1_s,
g_v
o1_v, t2_s,
b_v
o1_v, o1_v,
s_s
o2_v, t3_s,
r_v
o2_v, t4_s,
g_v
o2_v, t5_s,
b_v
o2_v, o2_v,
s_s
o2_v, a_s,
o2_v
o3_v, t6_s,
r_v
o3_v, t7_s,
g_v
o3_v, t8_s,
b_v
o3_v, o3_v,
s_s
o3_v, a_s,
o3_v
o1_v, y_addr, stride3,
o2_v, u_addr, stride3,
o3_v, v_addr, stride3,
pix_s,pix_s, len_s
pix_s, RGBtoYUV
addr_inc
addr_inc
addr_inc
#
#
#
#
load R
load G
load B
calculate Y
# calculate U
# calculate V
addr_inc
addr_inc
addr_inc
# store Y
# store U
# store V
43
MMX Code (part 1)
RGBtoYUV:
movq
mm1,
pxor
mm6,
movq
mm0,
psrlq
mm1,
punpcklbw
movq
mm7,
punpcklbw
movq
mm2,
pmaddwd mm0,
movq
mm3,
pmaddwd mm1,
movq
mm4,
pmaddwd mm2,
movq
mm5,
pmaddwd mm3,
punpckhbw
pmaddwd mm4,
paddd
mm0,
pmaddwd mm5,
movq
mm1,
paddd
mm2,
movq
mm6,
[eax]
mm6
mm1
16
mm0,
mm1
mm1,
mm0
YR0GR
mm1
YBG0B
mm2
UR0GR
mm3
UBG0B
mm7,
VR0GR
mm1
VBG0B
8[eax]
mm3
mm1
ZEROS
ZEROS
mm6;
paddd
mm4,
movq
mm5,
psllq
mm1,
paddd
mm1,
punpckhbw
movq
mm3,
pmaddwd mm1,
movq
mm7,
pmaddwd mm5,
psrad
mm0,
movq
TEMP0,
movq
mm6,
pmaddwd mm6,
psrad
mm2,
paddd
mm1,
movq
mm5,
pmaddwd mm7,
psrad
mm1,
pmaddwd mm3,
packssdw
pmaddwd mm5,
psrad
mm4,
movq
mm1,
mm5
mm1
32
mm7
mm6,
ZEROS
mm1
YR0GR
mm5
YBG0B
15
mm6
mm3
UR0GR
15
mm5
mm7
UBG0B
15
VR0GR
mm0,
mm1
VBG0B
15
16[eax]
44
MMX Code (part 2)
paddd
mm6,
movq
mm7,
psrad
mm6,
paddd
mm3,
psllq
mm7,
movq
mm5,
psrad
mm3,
movq
TEMPY,
packssdw
movq
mm0,
punpcklbw
movq
mm6,
movq
TEMPU,
psrlq
mm0,
paddw
mm7,
movq
mm2,
pmaddwd mm2,
movq
mm0,
pmaddwd mm7,
packssdw
add
eax,
add
edx,
movq
TEMPV,
mm7
mm1
15
mm5
16
mm7
15
mm0
mm2,
TEMP0
mm7,
mm0
mm2
32
mm0
mm6
YR0GR
mm7
YBG0B
mm4,
24
8
mm4
mm6
ZEROS
mm3
movq
mm4,
pmaddwd mm6,
movq
mm3,
pmaddwd mm0,
paddd
mm2,
pmaddwd
pxor
mm7,
pmaddwd mm3,
punpckhbw
paddd
mm0,
movq
mm6,
pmaddwd mm6,
punpckhbw
movq
mm7,
paddd
mm3,
pmaddwd mm5,
movq
mm4,
pmaddwd mm4,
psrad
mm0,
paddd
mm0,
psrad
mm2,
paddd
mm6,
movq
mm5,
mm6
UR0GR
mm0
UBG0B
mm7
mm4,
mm7
VBG0B
mm1,
mm6
mm1
YBG0B
mm5,
mm5
mm4
YR0GR
mm1
UBG0B
15
OFFSETW
15
mm5
mm7
45
MMX Code (pt. 3: 121 instructions)
pmaddwd mm7,
psrad
mm3,
pmaddwd mm1,
psrad
mm6,
paddd
mm4,
packssdw
pmaddwd mm5,
paddd
mm7,
psrad
mm7,
movq
mm6,
packssdw
movq
mm4,
packuswb
movq
mm7,
paddd
mm1,
paddw
mm4,
psrad
mm1,
movq
[ebx],
packuswb
movq
mm5,
packssdw
paddw
mm5,
paddw
mm3,
UR0GR
15
VBG0B
15
OFFSETD
mm2,
VR0GR
mm4
15
TEMPY
mm0,
TEMPU
mm6,
OFFSETB
mm5
mm7
15
mm6
mm4,
TEMPV
mm3,
mm7
mm7
mm6
movq
[ecx], mm4
packuswb
mm5,
add
ebx,
8
add
ecx,
8
movq
[edx], mm5
dec
edi
jnz
RGBtoYUV
mm3
mm7
mm2
mm4
46
Performance: FFT (1)
FFT (Floating-point, 1024 points)
160
Execution Time (usec)
124.3
120
VIRAM
92
80
69
Pathfinder-2
Wildstar
TigerSHARC
ADSP-21160
40
36
16.8
25
TMS320C6701
0
47
Performance: FFT (2)
FFT (Fixed-point, 256 points)
151
Execution Time (usec)
160
120
VIRAM
87
Pathfinder-1
Carmel
80
TigerSHARC
PPC 604E
Pentium
40
7.2
8.1
9
7.3
0
48
Conclusions
• Vector IRAM
– An integrated architecture for media processing
– Based on vector processing and embedded DRAM
– Simple, scalable, and efficient
• One thing to keep in mind
– Use the most efficient solution to exploit each level of parallelism
– Make the best solutions for each level work together
– Vector processing is very efficient for data level parallelism
Levels of Parallelism
Multi-programming
Thread
Irregular ILP
Data
Efficient Solution
Clusters? NUMA? SMP?
MT? SMT? CMP?
VLIW? Superscalar?
VECTOR
49
Backup slides
50
Delayed Vector Pipeline
.
.
.
F D R E M W
DRAM latency: >25ns
VLD
A
T
VW
Load  Add RAW hazard
VADD
VST
DELAY
A
T
VR VX VW
VR
vld
vadd
vst
vld
vadd
vst
.
.
.
• Random access latency included in the vector unit pipeline
• Arithmetic operations and stores are delayed to shorten
RAW hazards
• Long hazards eliminated for the common loop cases
• Vector pipeline length: 15 stages
51
Handling Memory Conflicts
IDCT Kernel
1.86
Performance
2
1.87
1.57
1.58
1.5
1.00
1
0.5
0
1
2
4
8
1 w ith
Decoupling
Sub-banks per DRAM macro
• Single sub-bank DRAM macro can lead to memory
conflicts for non-sequential access patterns
• Solution 1: address interleaving
– Selects between 3 address interleaving modes for each virtual page
• Solution 2: address decoupling buffer (128 slots)
– Allows scheduling of long indexed accesses without stalling the
arithmetic operations executing in parallel
52
Hardware Exposed to Software
Pentium® III
• <25% of area for registers and datapaths
• The rest is still useful, but not visible to software
– Cannot turn off is not needed
53