Instruction Set Design Part II Intro. To Pipelining

Download Report

Transcript Instruction Set Design Part II Intro. To Pipelining

ENGS 116 Lecture 4
1
Instruction Set Design Part II
Introduction to Pipelining
Vincent H. Berk
September 28, 2005
Reading for today: Chapter 2.1 – 2.12, Wulf article
Reading for Friday: Chapter A.1 – A.3, Patterson&Ditzel
Homework #1 tomorrow
ENGS 116 Lecture 4
2
Projects
• Teams of 2
• Two options:
– Research
– Programming
• Proposal due Wednesday 12th October:
– 2 pages
– Introduction to the problem, objectives
– Approach for solving the problem
– Expected working plan, hypothesis
– References to Literature
ENGS 116 Lecture 4
3
Projects
• Research Project:
– Exhaustive overview study of a particular topic.
– Research paper with a thesis and an argument (15-20 pages)
– Future vision
• Programming Project:
– Produce a simulator or a benchmark
– Use the produced software to test a thesis
– Present experimental results and analysis (Report)
ENGS 116 Lecture 4
Review: Instruction Set Design Parameters
• Operand storage in the CPU: Where are operands kept other than in
memory?
• Number of explicit operands named per instruction: How many
operands are named explicitly in a typical instruction?
• Operand location: Can any ALU operand be located in memory or
must some or all of the operands be internal storage in the CPU? If
an operand is located in memory, how is the memory location
specified?
• Operations: What operations are provided in the instruction set?
• Type and size of operations: What is the type and size of each
operand and how is it specified?
4
ENGS 116 Lecture 4
5
Intel 8086
• Not truly general-purpose register machine because nearly every
register has dedicated use
• 16-bit architecture: internal registers are 16 bits
• 20-bit address space, broken into 64-KB fragments
• Variable-length instructions
• 8086 has 14 registers divided into 4 groups: data registers, address
registers, segment registers, and control registers
• Addressing modes: absolute (16-bit absolute address), register
indirect, based, indexed, and based indexed with displacement
• Operations: data movement, arithmetic and logic, control flow, string
• 80386: 32-bit architecture with 32-bit registers and 32-bit address
space, additional addressing modes and additional operations
• 80x86 is most successful instruction set architecture of all time
• Awkward, old architecture is barrier to improvements
Intel 80x86 Integer Registers
ENGS 116 Lecture 4
80386, 80486, Pentium
31
AX
GPR 0 EAX
8086, 80286
15 8 7
0
AH
AL
Accumulator
GPR 1
ECX
CX
CH
CL
Count Reg: String, Loop
GPR 2
GPR 3
EDX
DX
DH
DL
Data Reg: Multiply, Divide
EBX
BX
BH
BL
Base Addr. Reg
GPR 4
GPR 5
ESP
SP
Stack Ptr.
EBP
BP
Base Ptr. (for base of stack seg.)
GPR 6
ESI
SI
Index Reg, String Source Ptr.
GPR 7
EDI
DI
Index Reg, String Dest. Ptr.
CS
SS
DS
ES
FS
GS
PC
EIP
IP
FLAGS
Code Segment Ptr.
Stack Segment Ptr. (top of stack)
Data Segment Ptr.
Extra Data Segment Ptr.
Data Segment Ptr. 2
Data Segment Ptr. 3
Instruction Ptr. (PC)
Condition Codes
6
ENGS 116 Lecture 4
7
Intel 80x86 Floating Point
Registers
79
0
FPR 0
FPR 1
FPR 2
FPR 3
FPR 4
FPR 5
FPR 6
FPR 7
15
Status
0
Top of FP Stack,
FP Condition Codes
ENGS 116 Lecture 4
8
80x86 Length Distribution
11
10
9
8
Length in bytes
7
0%
0%
1%
1%
0%
0%
0%
0%
0%
0%
0%
0%
2%
2%
4%
3%
4%
6
Espres s o
Gc c
6%
27%
13%
12%
13%
15%
12%
5
4
1%
Spice
NASA7
3%
4%
3%
3
27%
16%
2
21%
17%
1
19%
0%
10%
29%
20%
% i nstructi ons at each length
24%
24%
23%
25%
24%
24%
30%
ENGS 116 Lecture 4
9
Current Design Guidelines
• Use general-purpose registers with a load-store architecture
• Support these addressing modes: displacement, immediate, and
register deferred
• Use a minimalist instruction set
• Support simple, most-commonly used instructions
• Support standard data sizes and types: 8-, 16-, and 32-bit
integers and 64-bit IEEE 754 floating-point numbers
• Use fixed instruction encoding if interested in performance and
variable instruction encoding if interested in code size
• Provide at least 16 general-purpose registers plus separate
floating-point registers; 32 registers of each highly desirable
ENGS 116 Lecture 4
The Big Picture: The Performance Perspective
• Performance of a machine is determined by:
– Instruction count
– Clock cycle time
– Clock cycles per instruction
• Processor design (datapath and control) will determine:
– Clock cycle time
– Clock cycles per instruction
10
ENGS 116 Lecture 4
11
Pipelining: It’s Natural!
• Laundry Example
• Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold
• Washer takes 30 minutes
• Dryer takes 40 minutes
• “Folder” takes 20 minutes
A
B
C
D
ENGS 116 Lecture 4
12
Sequential Laundry
6 PM
7
8
9
10
11 Midnight
Time
T
a
s
k
O
r
d
e
r
30 40 20 30 40 20 30 40 20 30 40 20
A
B
C
D
• Sequential laundry takes 6 hours for 4 loads
• If they learned pipelining, how long would laundry take?
ENGS 116 Lecture 4
13
Pipelined Laundry
Start work ASAP
6 PM
7
8
9
10
Time
T
a
s
k
O
r
d
e
r
30 40 40 40 40 20
A
B
C
D
• Pipelined laundry takes 3.5 hours for 4 loads
11 Midnight
ENGS 116 Lecture 4
14
Pipelining Lessons
6 PM 7
8
9
Time
T
a
s
k
O
r
d
e
r
30 40 40 40 40 20
• Pipelining doesn’t help
latency of single task, it helps
throughput of entire workload
• Pipeline rate limited by
slowest pipeline stage
A
• Multiple tasks operating
simultaneously
B
• Potential speedup = Number
pipe stages
C
• Unbalanced lengths of pipe
stages reduces speedup
D
• Time to “fill” pipeline and
time to “drain” it reduces
speedup
ENGS 116 Lecture 4
15
Basic MIPS RISC Instruction Set
• All operations on data apply to data in registers
• Only operations that affect memory are load and store operations
that move data from memory to a register or to memory from a
register
• Instruction formats are few in number with all instructions typically
being one size
• 32 registers
• 3 classes of instructions: ALU, Load and Store, Branches and
jumps
ENGS 116 Lecture 4
Simple Implementation of the MIPS RISC
Instruction Set
• Instruction fetch cycle (IF)
–Send PC to memory
–Fetch current instruction from memory
–Update PC
• Instructions decode/register fetch cycle (ID)
– Decode instruction
– Read registers corresponding to register source specifiers from
register file (in parallel with decoding)
–Look for branch conditions, act accordingly
16
ENGS 116 Lecture 4
Simple Implementation of the MIPS RISC
Instruction Set
• Execution/effective address cycle (EX)
–ALU operates on operands prepared from prior cycle, then
performs one of three things…
– Memory reference: ALU adds base register and offset to form
effective address
–Register-register ALU instruction: ALU does operation specified
by ALU opcode on values read from register file
–Register-immediate ALU instruction in which ALU does operation
specified by ALU opcode on first value read from register file +
sign extended immediate
17
ENGS 116 Lecture 4
Simple Implementation of the MIPS RISC
Instruction Set
• Memory Access (MEM)
– Performs read using effective address if instruction is a load
– Performs write of data from second register read from register
file using effective address if instruction is a store
• Write-back Cycle (WB)
– Write to register file for either register-register ALU instruction or
load instruction
18
ENGS 116 Lecture 4
19
ENGS 116 Lecture 4
20
Example
Consider a nonpipelined machine with 5 execution steps of lengths
50 ns, 50 ns, 60 ns, 50 ns, and 50 ns. Due to clock skew and setup,
pipelining adds 5 ns of overhead to each instruction stage. Ignoring
latency impact, how much speedup in the instruction execution rate
will we gain from a pipeline?
ENGS 116 Lecture 4
21
Sequential Execution
260
260
260
50 50 60 50 50 50 50 60 50 50 50 50 60 50 50
Pipelined Execution
65
65
65
65
65
60
60
60
60
60
60
60
60
60
60
60
60
60
60
5
5
5
60
5
5
ENGS 116 Lecture 4
22
It’s Not That Easy for Computers
• Limits to pipelining: Hazards prevent next instruction from executing
during its designated clock cycle
– Structural hazards: Hardware cannot support this combination of
instructions
– Data hazards: Instruction depends on result of prior instruction
still in pipeline
– Control hazards: Pipelining of branches & other instructions.
Common solution is to stall the pipeline until the hazard “bubbles”
through the pipeline
ENGS 116 Lecture 4
23
Speed Up Equation for Pipelining
Speedup from pipelining
=
=
Avg. Instr. Time Unpipelined
Avg. Instr. Time Pipelined
CPI unpipelined  Clock Cycle unpipelined
CPI pipelined  Clock Cycle pipelined
= CPI unpipelined
CPI pipelined

Clock Cycle unpipelined
Clock Cycle pipelined
Ideal CPI = CPIunpipelined /Pipeline depth
Clock Cycle unpipelined
Ideal CPI  Pipeline depth
Speedup =

CPI pipelined
Clock Cycle pipelined
ENGS 116 Lecture 4
24
Speed Up Equation for Pipelining
CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instr
Ideal CPI x Pipeline depth
Speedup =
Ideal CPI + Pipeline stall CPI
Pipeline depth
Speedup =
1 + Pipeline stall CPI

Clock Cycle unpipelined

Clock Cycle pipelined
Clock Cycle unpipelined
Clock Cycle pipelined