PowerPoint 프레젠테이션 - German University in Cairo

Download Report

Transcript PowerPoint 프레젠테이션 - German University in Cairo

Computer Architecture
Lec 1: Introduction
Dr. Eng. Amr T. Abdel-Hamid
Spring 2011
Text book slides: Computer Architec
ture: A Quantitative Approach 4th E
dition, John L. Hennessy & David A
. Patterso with modifications.
Computer Architecture
CSEN 601
Computer Architecture
Dr. Amr Talaat
Elect 707
CPU History in a Flash
Computer Architecture
 Intel 4004 (1971): 4-bit processor,
2312 transistors, 0.4 MHz,
10 micron PMOS, 11 mm2 chip
• RISC II (1983): 32-bit, 5 stage
pipeline, 40,760 transistors, 3 MHz,
3 micron NMOS, 60 mm2 chip
• 125 mm2 chip, 0.065 micron CMOS
= 2312 RISC II+FPU+Icache+Dcache
– RISC II shrinks to ~ 0.02 mm2 at 65 nm
– Caches via DRAM or 1 transistor SRAM
Dr. Amr Talaat
• Processor is the new transistor?
Elect 707
Instruction Set Architecture: Critical Interface
Computer Architecture
software
instruction set
hardware
 Properties of a good abstraction
Dr. Amr Talaat




Elect 707
Lasts through many generations (portability)
Used in many different ways (generality)
Provides convenient functionality to higher levels
Permits an efficient implementation at lower levels
ISA vs. Computer Architecture
Computer Architecture
 Old definition of computer architecture
= instruction set design
 Other aspects of computer design called implementation
 Insinuates implementation is uninteresting or less challengi
ng
 Our view is: computer architecture >> ISA
 Architect’s job much more than instruction set design; te
chnical hurdles today more challenging than those in ins
truction set design
Dr. Amr Talaat
Elect 707
Computer Architecture is Design and Analysis
Computer Architecture
Architecture is an iterative process:
• Searching the space of possible designs
• At all levels of computer systems
Design
Analysis
Creativity
Cost /
Performance
Analysis
Dr. Amr Talaat
Good Ideas
Elect 707
Bad Ideas
Mediocre Ideas
Administrivia
Computer Architecture
Dr. Amr Talaat
Elect 707
Course Focus
Computer Architecture
Understanding the design techniques, machine structu
res, technology factors, evaluation methods that will
determine the form of computers in 21st Century
Technology
Applications
Parallelism
Computer Architecture:
• Organization
• Hardware/Software Boundary
Dr. Amr Talaat
Operating Measurement & Eval
Systems
uation
Elect 707
Programming
Languages
Interface Design
(ISA)
Compilers
History
Why to study Computer Architecture?
Computer Architecture
 Culture of anticipating and exploiting advances in techn
ology
 Careful, quantitative comparisons




Define, quantity, and summarize relative performance
Define and quantity relative cost
Define and quantity dependability
Define and quantity power
 Culture of well-defined interfaces that are carefully impl
emented and thoroughly checked
 Quantitative Principles of Design
Dr. Amr Talaat
1.
2.
3.
4.
5.
Elect 707
Take Advantage of Parallelism
Principle of Locality
Focus on the Common Case
Amdahl’s Law
The Processor Performance Equation
1) Taking Advantage of Parallelism
Computer Architecture
Dr. Amr Talaat
 Increasing throughput of server computer via multiple processors or
multiple disks
 Detailed HW design (DSD course shortly)
 Carry lookahead adders uses parallelism to speed up computing
sums from linear to logarithmic in number of bits per operand
 Multiple memory banks searched in parallel in set-associative ca
ches
 Pipelining: overlap instruction execution to reduce the total time to c
omplete an instruction sequence.
 Not every instruction depends on immediate predecessor  exe
cuting instructions completely/partially in parallel possible
 Classic 5-stage pipeline:
1) Instruction Fetch (Ifetch),
2) Register Read (Reg),
3) Execute (ALU),
4) Data Memory Access (Dmem),
5) Register Write (Reg)
Elect 707
2) The Principle of Locality
Computer Architecture
 The Principle of Locality:
 Program access a relatively small portion of the address spa
ce at any instant of time.
 Two Different Types of Locality:
 Temporal Locality (Locality in Time): If an item is referenced, it
will tend to be referenced again soon (e.g., loops, reuse)
 Spatial Locality (Locality in Space): If an item is referenced, ite
ms whose addresses are close by tend to be referenced soo
n
(e.g., straight-line code, array access)
 Last 30 years, HW relied on locality for memory perf.
Dr. Amr Talaat
P
Elect 707
$
MEM
Levels of the Memory Hierarchy
Capacity
Access Time
Cost
Computer Architecture
CPU Registers
100s Bytes
300 – 500 ps (0.3-0.5 ns)
L1 and L2 Cache
10s-100s K Bytes
~1 ns - ~10 ns
$1000s/ GByte
Main Memory
G Bytes
80ns- 200ns
~ $100/ GByte
Dr. Amr Talaat
Disk
10s T Bytes, 10 ms
(10,000,000 ns)
~ $1 / GByte
Tape
infinite
sec-min
~$1 / 707
GByte
Elect
Staging
Xfer Unit
Registers
Instr. Operands
L1 Cache
Blocks
Upper Level
prog./compiler
1-8 bytes
faster
cache cntl
32-64 bytes
L2 Cache
Blocks
cache cntl
64-128 bytes
Memory
Pages
OS
4K-8K bytes
Files
user/operator
Mbytes
Disk
Tape
Larger
Lower Level
3) Focus on the Common Case
Computer Architecture
 Common sense guides computer design
 Since its engineering, common sense is valuable
 In making a design trade-off, favor the frequent case o
ver the infrequent case
 E.g., Instruction fetch and decode unit used more frequen
tly than multiplier, so optimize it 1st
 E.g., If database server has 50 disks / processor, storage
dependability dominates system dependability, so optimiz
e it 1st
 Frequent case is often simpler and can be done faster t
han the infrequent case
Dr. Amr Talaat
 E.g., overflow is rare when adding 2 numbers, so improve
performance by optimizing more common case of no over
flow
 May slow down overflow, but overall performance improve
d by optimizing for the normal case
 What is frequent case and how much performance impr
oved by making case faster => Amdahl’s Law
Elect 707
4) Amdahl’s Law
Computer Architecture

Fractionenhanced 
ExTimenew  ExTimeold  1  Fractionenhanced  

Speedup

enhanced 
Speedupoverall 
ExTimeold

ExTimenew
1
1  Fractionenhanced  
Fractionenhanced
Speedupenhanced
Best you could ever hope to do:
Speedupmaximum 
Dr. Amr Talaat
Elect 707
1
1 - Fractionenhanced 
Amdahl’s Law example
Computer Architecture
 New CPU 10X faster
 I/O bound server, so 60% time waiting for I/O
Speedup overall 
1
Fraction enhanced
1  Fraction enhanced  
Speedup enhanced
1
Dr. Amr Talaat
1


 1.56
0.4 0.64
1  0.4 
10
• Apparently, its human nature to be attracted by 10X
faster, vs. keeping in perspective its just 1.6X faster
Elect 707
5) Processor performance equation
inst count
Computer Architecture
CPU time
= Seconds
=
Program
CPI
Instructions x
Program
Inst Count
CPI
Dr. Amr Talaat
X
Compiler
X
(X)
Inst. Set.
X
X
Organization
X
Elect 707
Cycles
x Seconds
Instruction
Program
Technology
Cycle time
Clock Rate
X
X
Cycle
5 Steps of MIPS Datapath
Figure A.2, Page A-8
Instr. Decode
Reg. Fetch
Execute
Addr. Calc
Next SEQ PC
WB Data
Elect 707
L
M
D
MUX
Sign
Extend
Data
Memory
Dr. Amr Talaat
Imm
ALU
RD
Reg File
Inst
Memory
Address
RS2
Write
Back
Zero?
RS1
MUXMUX
4
Memory
Access
MUX
Next PC
Adder
Computer Architecture
Instruction
Fetch
5 Steps of MIPS Datapath
Figure A.3, Page A-9
Write
Back
RD
RD
MUX
RD
MEM/WB
Data
Memory
EX/MEM
ALU
MUXMUX
ID/EX
Sign
Extend
WB Data
Zero?
Reg File
IF/ID
Memory
Dr. Amr Talaat
Imm
Elect 707
Next SEQ PC
RS1
RS2
Memory
Access
MUX
Next SEQ PC
Adder
4
Execute
Addr. Calc
Instr. Decode
Reg. Fetch
Next PC
Address
Computer Architecture
Instruction
Fetch
5 Steps of MIPS Datapath
Figure A.3, Page A-9
Next SEQ PC
• Data stationary control
– local decode for each instruction phase / pipeline stage
WB Data
RD
MUX
RD
MEM/WB
Sign
Extend
RD
Elect 707
Data
Memory
EX/MEM
ALU
MUXMUX
ID/EX
Reg File
IF/ID
Memory
Dr. Amr Talaat
Imm
Write
Back
Zero?
RS1
RS2
Memory
Access
MUX
Next SEQ PC
Adder
4
Execute
Addr. Calc
Instr. Decode
Reg. Fetch
Next PC
Address
Computer Architecture
Instruction
Fetch
Visualizing Pipelining
Figure A.2, Page A-8
Elect 707
Reg
DMem
Ifetch
Reg
Ifetch
Reg
Ifetch
Reg
Cycle 6 Cycle 7
Reg
DMem
Reg
DMem
ALU
Dr. Amr Talaat
O
r
d
e
r
Ifetch
ALU
I
n
s
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
ALU
Computer Architecture
Time (clock cycles)
Reg
DMem
Reg
Pipelining is not quite that easy!
Computer Architecture
 Limits to pipelining: Hazards prevent next instruction from
executing during its designated clock cycle
Dr. Amr Talaat
 Structural hazards: HW cannot support this combination of ins
tructions (single person to fold and put clothes away)
 Data hazards: Instruction depends on result of prior instructio
n still in the pipeline (missing sock)
 Control hazards: Caused by delay between the fetching of ins
tructions and decisions about changes in control flow (branch
es and jumps).
Elect 707
One Memory Port/Structural Hazards
Figure A.4, Page A-14
Instr 4
Elect 707
Ifetch
Reg
Ifetch
Cycle 5
DMem
Reg
Cycle 6
Cycle 7
DMem
Reg
DMem
Reg
DMem
Reg
ALU
Instr 3
Reg
Cycle 4
ALU
Dr. Amr Talaat
O
r
d
e
r
Instr 2
Cycle 3
ALU
I Load Ifetch
n
s
Instr 1
t
r.
Cycle 2
ALU
Cycle 1
ALU
Computer Architecture
Time (clock cycles)
Ifetch
Ifetch
Reg
Reg
Reg
DMem
Reg
One Memory Port/Structural Hazards
(Similar to Figure A.5, Page A-15)
Stall
Instr 3
Elect 707
DMem
Ifetch
Reg
DMem
Reg
Ifetch
Reg
Reg
DMem
Reg
BubbleBubbleBubbleBubbleBubble
Ifetch
Reg
How do you “bubble” the pipe?
ALU
Dr. Amr Talaat
O
r
d
e
r
Instr 2
Reg
ALU
I Load Ifetch
n
s
Instr 1
t
r.
ALU
Cycle Cycle
1
2
Cycle 3
Cycle Cycle
4
5Cycle C
6ycle 7
ALU
Computer Architecture
Time (clock cycles)
DMem
Reg
Speed Up Equation for Pipelining
Computer Architecture
CPIpipelined  Ideal CPI  Average Stall cycles per Inst
Cycle Timeunpipelined
Ideal CPI  Pipeline depth
Speedup 

Ideal CPI  Pipeline stall CPI
Cycle Timepipelined
For simple RISC pipeline, CPI = 1:
Dr. Amr Talaat
Cycle Timeunpipelined
Pipeline depth
Speedup 

1  Pipeline stall CPI
Cycle Timepipelined
Elect 707
Example: Dual-port vs. Single-port
Computer Architecture
 Machine A: Dual ported memory (“Harvard Architecture”)
 Machine B: Single ported memory, but its pipelined implement
ation has a 1.05 times faster clock rate
 Ideal CPI = 1 for both
 Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.0
5)
Dr. Amr Talaat
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
 Machine A is 1.33 times faster
Elect 707
Data Hazard on R1
Figure A.6, Page A-17
and r6,r1,r7
or
r8,r1,r9
xor r10,r1,r11
Elect 707
Ifetch
DMem
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
sub r4,r1,r3
Reg
ALU
Ifetch
WB
ALU
add r1,r2,r3
MEM
ALU
Dr. Amr Talaat
O
r
d
e
r
IF ID/RF EX
ALU
Computer Architecture
I
n
s
t
r.
Time (clock cycles)
Reg
Reg
Reg
Reg
DMem
Reg
Three Generic Data Hazards
Computer Architecture
 Read After Write (RAW)
InstrJ tries to read operand before InstrI writes it
I: add r1,r2,r3
J: sub r4,r1,r3
 Caused by a “Dependence” (in compiler nomenclature). Thi
s hazard results from an actual need for communication.
Dr. Amr Talaat
Elect 707
Three Generic Data Hazards
Computer Architecture
 Write After Read (WAR)
InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
 Called an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”.
 Can’t happen in MIPS 5 stage pipeline because:
Dr. Amr Talaat
 All instructions take 5 stages, and
 Reads are always in stage 2, and
 Writes are always in stage 5
Elect 707
Three Generic Data Hazards
Computer Architecture
 Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
 Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”.
 Can’t happen in MIPS 5 stage pipeline because:
Dr. Amr Talaat
 All instructions take 5 stages, and
 Writes are always in stage 5
 Will see WAR and WAW in more complicated pipes
Elect 707
Forwarding to Avoid Data Hazard
Figure A.7, Page A-19
Dr. Amr Talaat
or
r8,r1,r9
xor r10,r1,r11
Elect 707
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
and r6,r1,r7
Ifetch
DMem
ALU
sub r4,r1,r3
Reg
ALU
O
r
d
e
r
add r1,r2,r3 Ifetch
ALU
I
n
s
t
r.
ALU
Computer Architecture
Time (clock cycles)
Reg
Reg
Reg
Reg
DMem
Reg
HW Change for Forwarding
Figure A.23, Page A-37
mux
Data
Memory
MEM/WR
EX/MEM
ALU
mux
ID/EX
Registers
Computer Architecture
NextPC
mux
Immediate
Dr. Amr Talaat
What circuit detects and resolves this hazard?
Elect 707
Forwarding to Avoid LW-SW Data Hazard
Figure A.8, Page A-20
Dr. Amr Talaat
or
r8,r6,r9
xor r10,r9,r11
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
sw r4,12(r1)
Ifetch
DMem
ALU
lw r4, 0(r1)
Reg
ALU
O
r
d
e
r
add r1,r2,r3 Ifetch
ALU
I
n
s
t
r.
ALU
Computer Architecture
Time (clock cycles)
Reg
Reg
Elect 707
32
Reg
Reg
DMem
Reg
Data Hazard Even with Forwarding
Figure A.9, Page A-21
or
Elect 707
r8,r1,r9
Ifetch
Reg
DMem
Reg
Ifetch
Ifetch
Reg
Reg
Reg
DMem
ALU
and r6,r1,r7
DMem
ALU
O
r
d
e
r
sub r4,r1,r6
Reg
ALU
lw r1, 0(r2) Ifetch
Dr. Amr Talaat
I
n
s
t
r.
ALU
Computer Architecture
Time (clock cycles)
Reg
DMem
Reg
Data Hazard Even with Forwarding
(Similar to Figure A.10, Page A-21)
and r6,r1,r7
or r8,r1,r9
Elect 707
Reg
DMem
Ifetch
Reg
Bubble
Ifetch
Bubble
Reg
Bubble
Ifetch
Reg
How is this detected?
DMem
ALU
sub r4,r1,r6
Ifetch
Reg
Reg
DMem
ALU
Dr. Amr Talaat
O
r
d
e
r
lw r1, 0(r2)
ALU
I
n
s
t
r.
ALU
Computer Architecture
Time (clock cycles)
Reg
DMem
22: add r8,r1,r9
Dr. Amr Talaat
36: xor r10,r1,r11
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
r6,r1,r7
DMem
ALU
18: or
Reg
ALU
14: and r2,r3,r5
Ifetch
ALU
10: beq r1,r3,36
ALU
Computer Architecture
Control Hazard on Branches
Three Stage Stall
Reg
Reg
Reg
What do you do with the 3 instructions in between?
How do you do it?
Elect 707
Where is the “commit”?
35
4/26/2020
CS252-s06, Lec 0
Reg
DMem
Branch Stall Impact
Computer Architecture
 If CPI = 1, 30% branch,
Stall 3 cycles => new CPI = 1.9!
 Two part solution:
 Determine branch taken or not sooner, AND
 Compute taken branch address earlier
 MIPS branch tests if register = 0 or  0
 MIPS Solution:
Dr. Amr Talaat
 Move Zero test to ID/RF stage
 Adder to calculate new PC in ID/RF stage
 1 clock cycle penalty for branch versus 3
Elect 707
Pipelined MIPS Datapath
Figure A.24, page A-38
Write
Back
Adder
Zero?
RD
RD
• Interplay of instruction set design and cycle time.
RD
MUX
Dr. Amr Talaat
Sign
Extend
MEM/WB
Data
Memory
EX/MEM
ALU
MUX
ID/EX
Reg File
IF/ID
Memory
RS2
WB Data
RS1
Imm
Elect 707
Memory
Access
MUX
Next S
EQ PC
Adder
4
Execute
Addr. Calc
Instr. Decode
Reg. Fetch
Next PC
Address
Computer Architecture
Instruction
Fetch