CS 152 Computer Architecture and Engineering Lecture 15 -- Advanced CPUs 2014-3-11 John Lazzaro (not a prof - “John” is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: CS 152

Download Report

Transcript CS 152 Computer Architecture and Engineering Lecture 15 -- Advanced CPUs 2014-3-11 John Lazzaro (not a prof - “John” is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: CS 152

CS 152
Computer Architecture and Engineering
Lecture 15 -- Advanced CPUs
2014-3-11
John Lazzaro
(not a prof - “John” is always OK)
TA: Eric Love
www-inst.eecs.berkeley.edu/~cs152/
Play:
CS 152 L15: Superscalars and Scoreboards
UC Regents Spring 2014 © UCB
DEC Alpha 21164
Top performing
microprocessor
in its day (1995).
300 MFLOPS
in 0.5µ CMOS,
@ 300 MHz.
DEC Alpha 21164
Uses techniques
we cover in
Part I of lecture.
Lockup-free
cache integration.
Use of many
functional units.
Many instructions
issued per cycle
(superscalar)
DEC Alpha 21164
Most of chip is
cache (in blue).
This 4-issue chip
was the high
watermark for inorder designs.
In 2014,
in-order
superscalar lives
in the costsensitive sector
...
Marvell Embedded CPU: In-order dual-core superscalar
$35 retail
implies
Bill of
Materials
(BOM) in
the $20
range ...
ARM CPU
Wi-Fi
(Marvell)
2 GB 512 MB
Flash DRAM
Chromecast:
Web browser in a flash-drive form factor. Plugs into
the HDMI port on a TV. Includes a Wi-Fi chip so you
can control the browser from your cell phone.
Key Issue: Overcoming data hazards
Read After Write (RAW) hazards. Instruction
I2 expects to read a data value written by an
earlier instruction, but I2 executes “too early”
and reads the wrong copy of the data.
Write After Read (WAR) hazards. Instruction I2
expects to write over a data value after an earlier
instruction I1 reads it. But instead, I2 writes too
early, and I1 sees the new value.
Write After Write (WAW) hazards. Instruction I2
writes over data an earlier instruction I1 also
writes. But instead, I1 writes after I2, and the
final data value is incorrect.
CS 194-6 L9: Advanced Processors I
UC Regents Fall 2008 © UCB
Key issue: Structural Hazards ...
Floating Point Pipeline of Alpha 21164:
Insufficient register write ports to service
all sources every clock cycle.
Not every arithmetic unit is fully pipelined.
CS 194-6 L9: Advanced Processors I
UC Regents Fall 2008 © UCB
Topic #1: CPU side of our hit-over-miss cache ...
From CPU
CPU requests
a read by placing
MTYPE, TAG,
MADDR
Queue 1
in Queue 1.
To CPU
Queue 2
“We” == L1 D-Cache
controller
We do a normal cache access. If there is
a hit, we put place load result in Queue 2 ...
In the case of a miss, we use the
Inverted Miss Status Holding Register.
Integrating queues into the pipeline ...
A memory pipe
splits off from the
main pipeline,
after ALU
calculates index.
CPU uses
5 bits of TAG
to encode the
target/source
register for
LW/SW.
CS 194-6 L9: Advanced Processors I
Queue 1
Queue 2
UC Regents Fall 2008 © UCB
LockBits: a scoreboard data structure
In decode stage,
we stall any
instruction that
reads or writes
a locked register.
5
LockBits
rs
5
ws
1
wd
rd
WE
1
Each register
has a lock bit,
initialized to 0.
An example of a
scoreboard data
structure.
In decode stage,
we lock target
register of any LW
we issue.
CS 194-6 L9: Advanced Processors I
UC Regents Fall 2008 © UCB
How lock bits are cleared ...
5
5
1
LockBits
rs
ws
rd
1
wd
When data is returned to CPU
via Queue 2, CPU writes data
into register file, and clears the
associated lock bit.
WE
Dedicated write ports are needed
to avoid structural hazards.
From CPU
Queue 1
CS 194-6 L9: Advanced Processors I
To CPU
Queue 2
UC Regents Fall 2008 © UCB
Memory semantics and lock-free caches
The CPU expects that loads and stores to the same
memory location are applied in queued order.
The simple (low-performance) approach for the data
cache is to “snoop” Queue 1, and delay
accepting writes to addresses that are being read.
Finally, note the lack of sequential consistency.
From CPU
Queue 1
CS 194-6 L9: Advanced Processors I
To CPU
Queue 2
UC Regents Fall 2008 © UCB
Topic #2: Pipelines and latency ...
This pipeline splits after the RF stage,
feeding functional units with
different latencies.
CS 194-6 L9: Advanced Processors I
UC Regents Fall 2008 © UCB
Split pipelines: a write-after-write hazard.
Solution: SUB detects R1 clash in decode
stage and stalls, via a pipe-write scoreboard.
WAW Hazard
DIV R1, R2, R3
SUB R1, R2, R3
If long latency DIV
and short latency
SUB are sent to
parallel pipes, SUB
may finish first.
CS 194-6 L9: Advanced Processors I
The pipeline splits after the RF
stage, feeding functional units
with different latencies.
UC Regents Fall 2008 © UCB
Register write port: a structural hazard
Solution: A scoreboard structure to reserve future slots of
the write port. Stall SUB in decode until slot opens.
Structural
Hazard
DIV R1, R2, R3
[...]
SUB R5, R2, R3
DIV and SUB may
need to write register
file at the same time.
CS 194-6 L9: Advanced Processors I
Other solutions possible ...
above, solution of separate
UC Regents Fall 2008 © UCB
Functional unit input: a structural hazard
Solution: A scoreboard structure to detect busy functional
units. Stall DIV R5, ... in decode until divider is ready.
Structural
Hazard
DIV R1, R2, R3
DIV R5, R2, R3
Divide is usually not
fully pipelined, and
cannot accept new
operands every cycle.
CS 194-6 L9: Advanced Processors I
The pipeline splits after the RF
stage, feeding functional units
with different latencies.
UC Regents Fall 2008 © UCB
Imprecise exceptions: A difficult issue
Solutions: Too complicated for a slide.
See page C-58 in CA-AQA
Exceptions
DIV R1, R2, R3
SUB R4, R2, R3
If DIV throws an
exception after SUB
writes back, the
contract with the
programmer breaks.
CS 194-6 L9: Advanced Processors I
The pipeline splits after the RF
stage, feeding functional units
with different latencies.
UC Regents Fall 2008 © UCB
Superscalar: Multiple issues per cycle
Goal: Improve CPI by issuing
several instructions per cycle.
Example: CPU with floating
point ALUs: Issue 1 FP +
1 Integer instruction per cycle.
Difficulties: Load and branch
delays affect more instructions.
Ultimate Limiter: Programs may
be a poor match to issue rules.
CS 194-6 L9: Advanced Processors I
UC Regents Fall 2008 © UCB
Recall VLIW: Super-sized Instructions
Example: All instructions are 64-bit. Each
instruction consists of two 32-bit MIPS
instructions, that execute in parallel.
Syntax: ADD $8 $9 $10 Semantics:$8 = $9 + $10
opcode
rs
rt
rd
shamt funct
opcode
rs
rt
rd
shamt funct
Syntax: ADD $7 $8 $9
Semantics:$7 = $8 + $9
A 64-bit VLIW instruction
But what if we can’t change ISA execution semantics ?
CS 194-6 L3: Single-Cycle CPU
UC Regents Fall 2008 © UCB
IF (Fetch)
Superscalar
R machine
ID (Decode)
IR
IR
RegFile
rd1
rs2
ws1
64
WB
IR
IR
Y
R
rd2
Y
R
IR
IR
B
wd1
Data
Instr
Mem
rs3
Addr
ws2
rd3
A
rs4
rd4
B
wd2
32
PC and
Sequencer
MEM
A
rs1
Instruction
Issue Logic
EX (ALU)
WE1
WE2
IR
IF (Fetch)
CS 194-6 L9: Advanced Processors I
IR
ID (Decode)
EX (ALU)
MEM
WB
UC Regents Fall 2008 © UCB
IF (Fetch)
Sustaining Dual
Instr Issues
(no forwarding)
ID (Decode)
IR
ADD
ADD
ADD
ADD
R8,R0,R0
R11,R0,R0
R27,R26,R25
R30,R29,R28
ADD
ADD
ADD
ADD
ADD
ADD
R21,R20,R19
R24,R23,R22
R15,R14,R13
R18,R17,R16
R9,R8,R7
R12,R11,R10
It’s rarely
this good ...
ADD R9,R8,R7
RegFile
IR
ADD R15,
R14,R13
rd1
rs2
ws1
MEM
WB
ADD
R27
ADD R21,R20,R19
IR
IR
Y
R
Y
R
A
rs1
IR
EX (ALU)
rd2
B
wd1
rs3
rd3
A
rs4
ws2
rd4
B
wd2
WE1
WE2
ADD R12,R11,R10
CS 194-6 L9: Advanced Processors I
ID (Decode)
IR
ADD R18,
R17,R16
EX (ALU)
ADD R24,R23,R22
IR
IR
MEM
ADD
R30
WB
UC Regents Fall 2008 © UCB
IF (Fetch)
ID (Decode)
EX (ALU)
We add 12
ADD R11,R10,R0
IR
IR
forwarding buses
(not shown).
(6 to each ID from
RegFile
A
stages of both pipes).
ADD R10,
R9,R0
rs1
Worst-Case
Instruction Issue
ADD
ADD
ADD
ADD
rd1
rs2
R8,R0,R0
R9,R8,R0
R10,R9,R0
R11,R10,R0
ws1
rd2
rs3
rd3
CS 194-6 L9: Advanced Processors I
ADD R9,R8,R0
IR
IR
Y
R
Y
R
ADD
R8,
B
A
rs4
ws2
rd4
B
WE1
Dependencies
force
“serialization”
WB
wd1
wd2
IR
MEM
WE2
NOP
ID (Decode)
IR
NOP
EX (ALU)
IR
NOP
MEM
IR
NOP
WB
UC Regents Fall 2008 © UCB
Superscalar: A simple example ...
Example: Superscalar MIPS. Fetches
2 instructions at a time. If first integer and
second floating point, issue in same cycle
Integer instruction
FP instruction
LD
F0,0(R1)
LD F6,-8(R1)
LD F10,-16(R1) ADDD F4,F0,F2
LD F14,-24(R1) ADDD F8,F6,F2
LD F18,-32(R1) ADDD F12,F10,F2
SD 0(R1),F4 ADDD F16,F14,F2
SD -8(R1),F8 ADDD F20,F18,F2
SD -16(R1),F12
SD -24(R1),F16
CS 194-6 L9: Advanced Processors
I
Two
issues
per cycle
One issue
per cycle
UC Regents Fall 2008 © UCB
Superscalar: Visualizing the pipeline
Type
Int. instruction
FP instruction
Int. instruction
FP instruction
Int. instruction
FP instruction
Pipe Stages
IF
ID
EX MEM WB
IF
ID
EX MEM WB
IF
ID
EX MEM WB
IF
ID
EX MEM WB
IF
ID
EX MEM WB
IF
ID
EX MEM WB
Three instructions potentially affected by
a single cycle of load delay, as FP register
loads done in the “integer” pipeline).
CS 194-6 L9: Advanced Processors I
UC Regents Fall 2008 © UCB
Limitations of “lockstep” superscalar
Gets 0.5 CPI only for a 50/50 float/int mix with no
hazards. For games/media, may be OK.
Extending scheme to speed up general apps
(Microsoft Office, ...) is complicated.
If one accepts building a complicated machine,
there are better ways to do it.
Dynamic
Scheduling
:
After
spring
CS 194-6 L9: Advanced Processors I
UC Regents Fall 2008 © UCB
DEC
Alpha
21164
This
4-issue
chip
was the high
watermark
for
in-order
superscalar
designs.
Final paragraph
DEC was sold off to Compaq a
few years later ... who sold of
Digital Semiconductor to Intel
... who still makes
Alpha chips in small batches
for HP (who bought Compaq).
Break
Play:
CS 152 L15: Superscalars and Scoreboards
UC Regents Spring 2014 © UCB
The CDC 6600 was
the world’s fastest
computer for
5 years (1964-1969).
The design team was
located in a small
town in Wisconsin,
the home town
of its leader, Seymour The lab was placed far from CDC
Cray.
headquarters in Minneapolis, to limit
interference from upper management.
Operator Console
Top-down view:
Transistor-based
design, running at
100 ns clock speed.
64K of 60-bit words,
implemented with
magnetic
core memory.
Entire main
frame was liquid
cooled with
Freon.
Bus wires:
twisted wire pairs
that were trimmed
by hand to meet
cycle time.
Architecture
Out-of-order
execution.
The first
RISC machine
Peripheral
processor invented
multithreading
“Scoreboard”
10 functional units
Register
File
Includes eight
60-bit floating point
registers
Long, variable latency
Instruction
Fetch and the
Scoreboard
The scoreboard controls the
execution flow of all instructions.
It’s goal is to maintain a CPI of 1.
The instruction fetch unit is decoupled.
It’s goal is to pass one decoded instruction to
the scoreboard every cycle.
The scoreboard holds decoded copies
of all in-flight instructions, and tracks
the status of all elements cycle-by-cycle.
Lifecycle
of an
Pending
instruction in
Issue
the
scoreboard
(part 1)
Awaiting
operands
Newly arrived instructions placed in this
state, until
(1) a functional unit becomes free, and
(2) no other issued instructions want to
write the register it wants to write.
Prevents WAW hazards.
If an instruction is in pending
issue, the scoreboard stalls the
instruction fetch unit.
Execution
in progress
Execution
has
completed
Result
is
written
Lifecycle
of an
Pending
instruction in
Issue
the
scoreboard
(part 2)
Awaiting
operands
Instructions remain in this state, until
both of its operand registers are
not waiting to be written
by a functional unit.
Execution
in progress
Execution
has
completed
Prevents RAW hazards.
Result
is
written
Lifecycle
of an
Pending
instruction in
Issue
the
scoreboard
(part 3)
Awaiting
operands
This state can last many cycles,
as functional units have long latency.
Execution
in progress
Execution
has
completed
Result
is
written
Lifecycle
of an
Pending
instruction in
Issue
the
scoreboard
(part 4)
Awaiting
operands
Instructions may pass though this
state, unless there is an instruction
is Pending or Awaiting mode that
(1) preceded it in the instruction stream,
(2) Pending/Awaiting instruction needs to
read the register this instruction plans to
write.
Prevents WAR hazards.
Execution
in progress
Execution
has
completed
Result
is
written
What the
scoreboard
keeps score of.
The full status of each functional unit.
(1) Is it running an instruction? Which one?
(2) What are its source/destination registers?
(3) For each source: waiting/ready-to-read/read.
(4) For each source: who will be writing it?
For each register, which functional unit is planning
to write it?
Current state of all in-flight instructions.
Limitations of scoreboard control ...
If one accepts building a complicated machine,
there are better ways to do it.
Dynamic
Scheduling
:
After
spring
break.
CS 194-6 L9: Advanced Processors I
UC Regents Fall 2008 © UCB
On Thursday
Midterm Review Lecture