Computer Architecture - Princess Sumaya University for Technology
Download
Report
Transcript Computer Architecture - Princess Sumaya University for Technology
LOGO
P r i n c e s s
S u m a y a
U n i v e r s i t y
f o r
Computer
Architecture
Dr. Esam Al_Qaralleh
Te c h n o l o g y
Review
computer arctecture ~ PSUT
2
The Von Neumann Machine, 1945
The Von Neumann model consists of five
major components:
input unit
output unit
ALU
memory unit
control unit.
Sequential Execution
computer arctecture ~ PSUT
3
Von Neumann Model
A refinement of the Von Neumann model, the system bus model
has a CPU (ALU and control), memory, and an input/output unit.
Communication among components is handled by a shared
pathway called the system bus, which is made up of the data
bus, the address bus, and the control bus. There is also a power
bus, and some architectures may also have a separate I/O bus.
computer arctecture ~ PSUT
4
Performance
Both Hardware and Software affect
performance:
Algorithm determines number of sourcelevel statements
Language/Compiler/Architecture determine
machine instructions
Processor/Memory determine how fast
instructions are executed
computer arctecture ~ PSUT
5
Computer Architecture
Instruction Set Architecture - ISA refers to
the actual programmer-visible machine
interface such as instruction set, registers,
memory organization and exception
handling. Two main approaches: RISC and
CISC architectures.
computer arctecture ~ PSUT
6
Applications Change over Time
Data-sets & memory requirements larger
Cache & memory architecture become more critical
Standalone networked
IO integration & system software become more critical
Single task multiple tasks
Parallel architectures become critical
• Limited IO requirements rich IO
requirements
60s: tapes & punch cards
70s: character oriented displays
80s: video displays, audio, hard disks
90s: 3D graphics; networking, high-quality audio
00s: real-time video, immersion, …
computer arctecture ~ PSUT
7
Application Properties to
Exploit in Computer Design
Locality in memory/IO references
Programs work on subset of instructions/data at any point in time
Both spatial and temporal locality
Parallelism
Data-level (DLP): same operation on every element of a data
sequence
Instruction-level (ILP): independent instructions within sequential
program
Thread-level (TLP): parallel tasks within one program
Multi-programming: independent programs
Pipelining
Predictability
Control-flow direction, memory references, data values
computer arctecture ~ PSUT
8
Levels of Machines
There are a number of levels in a computer,
from the user level down to the transistor level.
computer arctecture ~ PSUT
9
How Do the Pieces Fit Together?
Application
Operating
System
Compiler
Memory
system
Firmware
Instr. Set Proc.
Instruction Set
Architecture
I/O system
Datapath & Control
Digital Design
Circuit Design
computer arctecture ~ PSUT
10
Instruction Set Architecture (ISA)
Complex Instruction Set (CISC)
Single instructions for complex tasks (string
search, block move, FFT, etc.)
Usually have variable length instructions
Registers have specialized functions
Reduced Instruction Set (RISC)
Instructions for simple operations only
Usually fixed length instructions
Large orthogonal register sets
computer arctecture ~ PSUT
11
RISC Architecture
RISC designers focused on two critical
performance techniques in computer
design:
the exploitation of instruction-level
parallelism, first through pipelining and later
through multiple instruction issue,
the use of cache, first in simple forms and
later using sophisticated organizations and
optimizations.
computer arctecture ~ PSUT
12
RISC ISA Characteristics
All operations on data apply to data in registers and
typically change the entire register;
The only operations that affect memory are load and
store operations that move data from memory to a
register or to memory from a register, respectively;
A small number of memory addressing modes;
The instruction formats are few in number with all
instructions typically being one size;
Large number of registers;
These simple properties lead to dramatic
simplifications in the implementation of advanced
pipelining techniques, which is why RISC architecture
instruction sets were designed this way.
computer arctecture ~ PSUT
13
Performance
&
cost
computer arctecture ~ PSUT
14
Computer Designers and Chip Costs
The computer designer affects die size,
and hence cost, both by what functions
are included on or excluded from the die
and by the number of I/O pins
computer arctecture ~ PSUT
15
LOGO
Measuring and Reporting
Performance
performance
Time to do the task (Execution Time)
– execution time, response time, latency
Tasks per day, hour, week, sec, ns. .. (Performance)
– performance, throughput, bandwidth
Response time– the time between the start and the completion of a task
Thus, to maximize performance, need to minimize execution time
1
perform ancex
execution_ tim ex
If X is n times faster than Y, then
performancex execution_ timey
N
performancey execution_ timex
Throughput – the total amount of work done in a given time
Important to data center managers
Decreasing response time almost always improves
computer arctecture ~ PSUT
throughput
17
Calculating CPU Performance
Want to distinguish elapsed time and the time spent on
our task
CPU execution time (CPU time) – time the CPU spends
working on a task
Does not include time waiting for I/O or running other
programs
CPU _ Time CPU _ clock _ cycles _ for _ a _ program * Clock _ cycle _ time
OR
CPU _ clock _ cycles _ for _ a _ program
CPU _ Time
Clock _ rate
Can improve performance by reducing either the length
of the clock cycle or the number of clock cycles required
for a program
computer arctecture ~ PSUT
18
Calculating CPU Performance (Cont.)
We tend to count instructions executed = IC
Note looking at the object code is just a start
What we care about is the dynamic count - e.g. don’t
forget loops, recursion, branches, etc.
CPI (Clock Per Instruction) is a figure of merit
CPU _ clock _ cycles _ for _ a _ program
CPI
IC
IC * CPI
CPU _ Time IC * CPI * Clock _ cycle _ time
Clock _ rate
computer arctecture ~ PSUT
19
Calculating CPU Performance (Cont.)
3 Focus Factors -- Cycle Time, CPI, IC
Sadly - they are interdependent and making one better often
makes another worse (but small or predictable impacts)
• Cycle time depends on HW technology and organization
• CPI depends on organization (pipeline, caching...) and ISA
• IC depends on ISA and compiler technology
Often CPI’s are easier to deal with on a per instruction
basis
n
CPU _ clock _ cycles CPI i * ICi
n
i 1
CPI i * ICi
n
ICi
Overall _ CPI
CPI i *
Instructio n _ count i 1
Instructio n _ count
i 1
computer arctecture ~ PSUT
20
Example of Computing CPU time
If a computer has a clock rate of 50 MHz, how long
does it take to execute a program with 1,000
instructions, if the CPI for the program is 3.5?
Using the equation
CPU time = Instruction count x CPI / clock rate
gives
CPU time = 1000 x 3.5 / (50 x 106)
If a computer’s clock rate increases from 200 MHz to
250 MHz and the other factors remain the same, how
many times faster will the computer be?
CPU time old
clock rate new
250 MHz
------------------- = ---------------------- = ---------------- = 1.25
CPU time new
clock rate old
200 MHZ
Evaluating ISAs
Design-time metrics:
Can it be implemented, in how long, at what cost?
Can it be programmed? Ease of compilation?
Static Metrics:
How many bytes does the program occupy in memory?
Dynamic Metrics:
How many instructions are executed? How many bytes does the
processor fetch to execute the program?
CPI
How many clocks are required per instruction?
Best Metric: Time to execute the program!
Inst. Count
depends on the instructions set, the processor
organization, and compilation computer
techniques.
arctecture ~ PSUT
Cycle Time
24
LOGO
Quantitative Principles of
Computer Design
Amdahl’s Law
Defines speedup gained from a particular feature
Depends on 2 factors
Fraction of original computation time that can take
advantage of the enhancement - e.g. the commonality
of the feature
Level of improvement gained by the feature
Amdahl’s law
Quantification of the
diminishing return principle
computer arctecture ~ PSUT
26
Amdahl's Law (Cont.)
Suppose that enhancement E accelerates a
fraction F of the task by a factor S,
and the remainder of the task is unaffected
computer arctecture ~ PSUT
27
Simple Example
Important Application:
Amdahl’s Law says
FPSQRT 20%
FP instructions account for 50% nothing about cost
Other 30%
Designers say same cost to speedup:
FPSQRT by 40x
FP by 2x
Other by 8x
Which one should you invest?
Straightforward plug in the numbers & compare
BUT what’s your guess??
computer arctecture ~ PSUT
28
And the Winner Is…?
computer arctecture ~ PSUT
29
Example of Amdahl’s Law
Floating point instructions are improved to run twice as fast, but only
10% of the time was spent on these instructions originally. How
much faster is the new machine?
1
ExTimeold
Speedup =
=
(1 - Fractionenhanced) + Fractionenhanced
ExTimenew
Speedupenhanced
1
Speedup =
= 1.053
(1 - 0.1) + 0.1/2
° The new machine is 1.053 times as fast, or 5.3% faster.
° How much faster would the new machine be if floating point
instructions become 100 times faster?
1
Speedup =
= 1.109
(1 - 0.1) + 0.1/100
Estimating Performance Improvements
Assume a processor currently requires 10 seconds to
execute a program and processor performance
improves by 50 percent per year.
By what factor does processor performance improve in
5 years?
(1 + 0.5)^5 = 7.59
How long will it take a processor to execute the
program after 5 years?
ExTimenew = 10/7.59 = 1.32 seconds
Performance Example
Computers M1 and M2 are two implementations of the
same instruction set.
M1 has a clock rate of 50 MHz and M2 has a clock rate of
75 MHz.
M1 has a CPI of 2.8 and M2 has a CPI of 3.2 for a given
program.
How many times faster is M2 than M1 for this program?
ExTimeM1
=
ExTimeM2
ICM1 x CPIM1 / Clock RateM1
=
ICM2 x CPIM2 / Clock RateM2
2.8/50
3.2/75
What would the clock rate of M1 have to be for them to
have the same execution time?
= 1.31
Simple Example
Suppose we have made the following
measurements:
Frequency of FP operations (other than FPSQR)
=25%
Average CPI of FP operations=4.0
Average CPI of other instructions=1.33
Frequency of FPSQR=2%
CPI of FPSQR=20
Two design alternatives
Reduce the CPI of FPSQR to 2
Reduce the average CPI of all FP operations to 2
computer arctecture ~ PSUT
33
And The Winner is…
n
ICi
CPI original CPI i *
(4 * 25%) (1.33 * 75%) 2.0
Instructio n _ count
i 1
CPI with _ new _ FPSQR CPI original 2% * (CPI oldFPSQR CPI ofnewFPSQRonly)
2.0 2% * (20 2) 1.64
CPI newFP (75% * 1.33) (25% * 2.0) 1.5
computer arctecture ~ PSUT
34
Instruction Set
Architecture
(ISA)
computer arctecture ~ PSUT
35
Outline
Introduction
Classifying instruction set architectures
Instruction set measurements
Memory addressing
Addressing modes for signal processing
Type and size of operands
Operations in the instruction set
Operations for media and signal processing
Instructions for control flow
Encoding an instruction set
MIPS architecture
computer arctecture ~ PSUT
36
LOGO
Instruction Set Principles and
Examples
Basic Issues in Instruction Set Design
What operations and How many
Load/store/Increment/branch are sufficient to do any
computation, but not useful (programs too long!!).
How (many) operands are specified?
Most operations are dyadic (e.g., AB+C); Some are
monadic (e.g., A B).
How to encode them into instruction format?
Instructions should be multiples of Bytes.
Typical Instruction Set
32-bit word
Basic operand addresses are 32-bit long.
Basic operands (like integer) are 32-bit long.
In general, Instruction could refer 3 operands (AB+C).
Challenge: Encode operations in a small number of
bits.
computer arctecture ~ PSUT
38
Brief Introduction to ISA
Instruction Set Architecture: a set of instructions
Each instruction is directly executed by the CPU’s hardware
How is it represented?
By a binary format since the hardware understands only bits
6
opcode
5
rs
5
16
rt
Immediate
Options - fixed or variable length formats
Fixed - each instruction encoded in same size field (typically 1
word)
Variable – half-word, whole-word, multiple word instructions are
possible
computer arctecture ~ PSUT
39
What Must be Specified?
Instruction Format (encoding)
How is it decoded?
Location of operands and result
Where other than memory?
How many explicit operands?
How are memory operands located?
Data type and Size
Operations
What are supported?
computer arctecture ~ PSUT
40
LOGO
Classifying
Instruction Set
Architecture
Instruction Set Design
CPU _ Time IC * CPI * Cycle _ time
The instruction set influences everything
computer arctecture ~ PSUT
42
Instruction Characteristics
Usually a simple operation
Which operation is identified by the op-code field
But operations require operands - 0, 1, or 2
To identify where they are, they must be addressed
• Address is to some piece of storage
• Typical storage possibilities are main memory, registers, or a stack
2 options explicit or implicit addressing
Implicit - the op-code implies the address of the operands
• ADD on a stack machine - pops the top 2 elements of the stack,
then pushes the result
• HP calculators work this way
Explicit - the address is specified in some field of the instruction
• Note the potential for 3 addresses - 2 operands + the destination
computer arctecture ~ PSUT
43
Operand Locations for Four ISA Classes
computer arctecture ~ PSUT
44
C=A+B
Stack
Register (registermemory)
Push A
Push B
Add
• Pop the top-2 values of
the stack (A, B) and push
the result value into the
stack
Pop C
Accumulator (AC)
Load A
Add B
• Add AC (A) with B and
store the result into AC
Store C
Load R1, A
Add R3, R1, B
Store R3, C
Register (load-store)
Load R1, A
Load R2, B
Add R3, R1, R2
Store R3, C
computer arctecture ~ PSUT
45
Modern Choice – Load-store Register
(GPR) Architecture
Reasons for choosing GPR (general-purpose registers)
architecture
Registers (stacks and accumulators…) are faster than memory
Registers are easier and more effective for a compiler to use
• (A+B) – (C*D) – (E*F)
– May be evaluated in any order (for pipelining concerns or …)
» But on a stack machine must left to right
Registers can be used to hold variables
• Reduce memory traffic
• Speed up programs
• Improve code density (fewer bits are used to name a register)
Compiler writers prefer that all registers be equivalent
and unreserved
The number of GPR: at least 16
computer arctecture ~ PSUT
46
LOGO
Memory Addressing
Memory Addressing Basics
All architectures must address memory
What is accessed - byte, word, multiple words?
Today’s machine are byte addressable
Main memory is organized in 32 - 64 byte lines
Big-Endian or Little-Endian addressing
Hence there is a natural alignment problem
Size s bytes at byte address A is aligned if
A mod s = 0
Misaligned access takes multiple aligned memory
references
Memory addressing mode influences instruction
counts (IC) and clock cycles per instruction (CPI)
computer arctecture ~ PSUT
48
Big-Endian and Little-Endian Assignments
Big-Endian: lower byte addresses are used for the most significant bytes of the word
Little-Endian: opposite ordering. lower byte addresses are used for the less significant
bytes of the word
Word
address
Byte address
Byte address
0
0
1
2
3
0
3
2
1
0
4
4
5
6
7
4
7
6
5
4
•
•
•
k
2 -4
k
2 -4
k
2 -3
•
•
•
k
2- 2
k
2 - 1
(a) Big-endian assignment
k
2 - 4
k
2- 1
k
2 - 2
k
2 -3
k
2 -4
(b) Little-endian assignment
computer
arctecture ~ PSUT
Byte and
word addressing.
49
Addressing Modes
Immediate
Add R4, #3
Regs[R4] Regs[R4]+3
Register
Add R4, R3
Regs[R4] Regs[R4]+Regs[R3]
R3
Operand:3
Register Indirect
Add R4, (R1)
Regs[R4] Regs[R4]+Mem[Regs[R1]]
R1
Operand
Registers
Operand
Registers
Memory
computer
arctecture ~ PSUT
50
Addressing Modes(Cont.)
Direct
Memory Indirect
Add R4, (1001)
Add R4, @(R3)
Regs[R4] Regs[R4]+Mem[1001] Regs[R4] Regs[R4]+Mem[Mem[Regs[R3]]]
R3
1001
Operand
Operand
Memory
Registers
computer arctecture ~ PSUT
Memory
51
Addressing Modes(Cont.)
Displacement
Add R4, 100(R1)
Regs[R4] Regs[R4]+Mem[100+R1]
R1
100
Scaled
Add R1, 100(R2) [R3]
Regs[R1] Regs[R1]+Mem[100+
Regs[R2]+Regs[R3]*d]
R3 R2
100
Operand
Operand
*d
Registers
Memory
Registers
computer arctecture ~ PSUT
Memory
52
Typical Address Modes (I)
computer arctecture ~ PSUT
53
Typical Address Modes (II)
computer arctecture ~ PSUT
54
Operand Type & Size
Typical types: assume word= 32 bits
Character - byte - ASCII or EBCDIC (IBM) - 4
per word
Short integer - 2- bytes, 2’s complement
Integer - one word - 2’s complement
Float - one word - usually IEEE 754 these
days
Double precision float - 2 words - IEEE 754
BCD or packed decimal - 4- bit values packed
8 per word
computer arctecture ~ PSUT
55
LOGO
ALU Operations
What Operations are Needed
Arithmetic + Logical
Integer arithmetic: ADD, SUB, MULT, DIV, SHIFT
Logical operation: AND, OR, XOR, NOT
Data Transfer - copy, load, store
Control - branch, jump, call, return, trap
System - OS and memory management
We’ll ignore these for now - but remember they are needed
Floating Point
Same as arithmetic but usually take bigger operands
Decimal
String - move, compare, search
Graphics – pixel and vertex,
compression/decompression operations
computer arctecture ~ PSUT
57
Top 10 Instructions for 80x86
load: 22%
conditional branch: 20%
compare: 16%
store: 12%
add: 8%
and: 6%
sub: 5%
move register-register:
4%
call: 1%
return: 1%
The most widely
executed instructions
are the simple
operations of an
instruction set
The top-10
instructions for 80x86
account for 96% of
instructions executed
Make them fast, as
they are the common
case
computer arctecture ~ PSUT
58
Control Instructions are a Big Deal
Jumps - unconditional transfer
Conditional Branches
How is condition code set? – by flag or part of the
instruction
How is target specified? How far away is it?
Calls
How is target specified? How far away is it?
Where is return address kept?
How are the arguments passed? Callee vs. Caller
save!
Returns
Where is the return address? How far away is it?
How are the results passed?
computer arctecture ~ PSUT
59
Branch Address Specification
Known at compile time for unconditional and
conditional branches - hence specified in the
instruction
As a register containing the target address
As a PC-relative offset
Consider word length addresses, registers, and
instructions
Full address desired? Then pick the register option.
• BUT - setup and effective address will take longer.
If you can deal with smaller offset then PC relative
works
• PC relative is also position independent - so simple linker
duty
computer arctecture ~ PSUT
60
Returns and Indirect Jumps
Branch target is not known at compile time
Need a way to specify the target
dynamically
Use a register
Permit any addressing mode
Regs[R4] Regs[R4] + Mem[Regs[R1]]
Also useful for
case or switch
Dynamically shared libraries
High-order functions or function pointers
computer arctecture ~ PSUT
61
LOGO
Encoding an
Instruction Set
Encoding the ISA
Encode instructions into a binary representation for
execution by CPU
Can pick anything but:
Affects the size of code - so it should be tight
Affects the CPU design - in particular the instruction decode
So it may have a big influence on the CPI or cycle-time
Must balance several competing forces
Desire for lots of addressing modes and registers
Desire to make average program size compact
Desire to have instructions encoded into lengths that will be easy
to handle in a pipelined implementation (multiple of bytes)
computer arctecture ~ PSUT
63
3 Popular Encoding Choices
Variable (compact code but difficult to encode)
Primary opcode is fixed in size, but opcode modifiers may exist
Opcode specifies number of arguments - each used as address fields
Best when there are many addressing modes and operations
Use as few bits as possible, but individual instructions can vary widely in
length
e. g. VAX - integer ADD versions vary between 3 and 19 bytes
Fixed (easy to encode, but lengthy code)
Every instruction looks the same - some field may be interpreted
differently
Combine the operation and the addressing mode into the opcode
e. g. all modern RISC machines
Hybrid
Set of fixed formats
e. g. IBM 360 and Intel 80x86
Trade-off between size of program
VS. ease of decoding
computer arctecture ~ PSUT
64
3 Popular Encoding Choices (Cont.)
computer arctecture ~ PSUT
65
An Example of Variable Encoding -- VAX
addl3 r1, 737(r2), (r3): 32-bit integer add
instruction with 3 operands need 6 bytes to
represent it
Opcode for addl3: 1 byte
A VAX address specifier is 1 byte (4-bits: addressing
mode, 4-bits: register)
• r1: 1 byte (register addressing mode + r1)
• 737(r2)
– 1 byte for address specifier (displacement addressing + r2)
– 2 bytes for displacement 737
• (r3): 1 byte for address specifier (register indirect + r3)
Length of VAX instructions: 1—53 bytes
computer arctecture ~ PSUT
66
Short Summary – Encoding the
Instruction Set
Choice between variable and fixed
instruction encoding
Code size than performance variable
encoding
Performance than code size fixed encoding
computer arctecture ~ PSUT
67
LOGO
Role of Compilers
Critical goals in ISA from the compiler
viewpoint
What features will lead to high-quality code
What makes it easy to write efficient
compilers for an architecture
computer arctecture ~ PSUT
69
Compiler and ISA
ISA decisions are no more for programming AL
easily
Due to HLL, ISA is a compiler target today
Performance of a computer will be significantly
affected by compiler
Understanding compiler technology today is
critical to designing and efficiently implementing
an instruction set
Architecture choice affects the code quality and
the complexity of building a compiler for it
computer arctecture ~ PSUT
70
Optimization Observations
Hard to reduce branches
Biggest reduction is often memory
references
Some ALU operation reduction happens
but it is usually a few %
Implication:
Branch, Call, and Return become a larger
relative % of the instruction mix
Control instructions among the hardest to
speed up
computer arctecture ~ PSUT
71
LOGO
The MIPS
Architecture
MIPS Instruction Format
Encode addressing mode into the opcode
All instructions are 32 bits with 6-bit
primary opcode
computer arctecture ~ PSUT
73
MIPS Instruction Format (Cont.)
I-Type Instruction
6
5
opcode
rs
5
rt
16
Immediate
Loads and Stores
LW R1, 30(R2), S.S F0, 40(R4)
ALU ops on immediates
DADDIU R1, R2, #3
rt <-- rs op immediate
Conditional branches
BEQZ R3, offset
rs is the register checked
rt unused
immediate specifies the offset
Jump registers ,jump and link register
JR R3
rs is target register
rt and immediate are unused but = 011
computer arctecture ~ PSUT
74
MIPS Instruction Format (Cont.)
6
opcode
R-Type Instruction
5
5
5
rs
rt
rd
5
shamt
6
func
Register-register ALU operations: rdrs funct rt DADDU R1, R2, R3
Function encodes the data path operations: Add, Sub...
read/write special registers
Moves
J-Type Instruction: Jump, Jump and Link, Trap and return from exception
6
26
opcode
Offset added to PC
computer arctecture ~ PSUT
75
Data path
computer arctecture ~ PSUT
76
The processor : Data Path and Control
Data
PC
Address
Instructions
Register #
Register
Bank
Register #
Instruction
Memory
A
L
U
Address
Data Memory
Register #
Data
Two types of functional units:
elements that operate on data values (combinational)
elements that contain state (state elements)
computer arctecture ~ PSUT
77
Single Cycle Implementation
State
element
1
Combinational
Logic
State
element
2
Clock Cycle
Typical execution:
read contents of some state elements,
send values through some combinational logic
write results to one or more state elements
Using a clock signal for synchronization
Edge triggered methodology
computer arctecture ~ PSUT
78
A portion of the datapath used for fetching instructions
computer arctecture ~ PSUT
79
The datapath for R-type instructions
computer arctecture ~ PSUT
80
The datapath for load and store insructions
computer arctecture ~ PSUT
81
The datapath for branch instructions
computer arctecture ~ PSUT
82
Complete Data Path
computer arctecture ~ PSUT
83
Control
Selecting the operations to perform (ALU, read/write,
etc.)
Controlling the flow of data (multiplexor inputs)
Information comes from the 32 bits of the instruction
Example: lW $1, 100($2)
Value of control signals is dependent upon:
what instruction is being executed
which step is being performed
computer arctecture ~ PSUT
84
Data Path with Control
computer arctecture ~ PSUT
85
Single Cycle Implementation
Calculate cycle time assuming negligible delays except:
memory (2ns), ALU and adders (2ns), register file access
(1ns)
computer arctecture ~ PSUT
86
pipelining
computer arctecture ~ PSUT
87
Basic DataPath
•What do we need to add to actually split the datapath into stages?
computer arctecture ~ PSUT
88
Pipeline DataPath
computer arctecture ~ PSUT
89
The Five Stages of the Load Instruction
Cycle 1 Cycle 2
Load Ifetch
Reg/Dec
Cycle 3 Cycle 4 Cycle 5
Exec
Mem
Wr
Ifetch: Instruction Fetch
Fetch the instruction from the Instruction
Memory
Reg/Dec: Registers Fetch and Instruction
Decode
Exec: Calculate the memory address
Mem: Read the data from the Data
Memory
computer arctecture
Wr: Write the data
back~ PSUT
to the register file 90
Pipelined Execution
Time
IFetch Dcd
Exec
IFetch Dcd
Mem
WB
Exec
Mem
WB
Exec
Mem
WB
Exec
Mem
WB
Exec
Mem
WB
Exec
Mem
IFetch Dcd
IFetch Dcd
IFetch Dcd
Program Flow
IFetch Dcd
On a processor multiple instructions are in
various stages at the same time.
Assume each instruction takes five cycles
computer arctecture ~ PSUT
WB
91
Single Cycle, Multiple Cycle, vs. Pipeline
computer arctecture ~ PSUT
92
Graphically Representing Pipelines
Can help with answering questions like:
How many cycles does it take to execute this code?
What is the ALU doing during cycle 4?
Are two instructions trying to use the same resource
computer arctecture ~ PSUT
93
at the same time?
Why Pipeline? Because the resources are there!
computer arctecture ~ PSUT
94
Why Pipeline?
Suppose
100 instructions are executed
The single cycle machine has a cycle time of 45 ns
The multicycle and pipeline machines have cycle times of 10
ns
The multicycle machine has a CPI of 4.6
Single Cycle Machine
45 ns/cycle x 1 CPI x 100 inst = 4500 ns
Multicycle Machine
10 ns/cycle x 4.6 CPI x 100 inst = 4600 ns
Ideal pipelined machine
10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns
Ideal pipelined vs. single cycle speedup
4500 ns / 1040 ns = 4.33
What has not yet been
considered?
computer arctecture ~ PSUT
95
Compare Performance
Compare: Single-cycle, multicycle and pipelined control using
SPECint2000
Single-cycle: memory access = 200ps, ALU = 100ps, register file read
and write = 50ps
200+50+100+200+50=600ps
Multicycle: 25% loads, 10% stores, 11% branches, 2% jumps, 52%
ALU
CPI = 4.12, The clock cycle = 200ps (longest functional unit)
Pipelined
1 clock cycle when there is no load-use dependence
2 when there is, average 1.5 per load
Stores and ALU take 1 clock cycle
Branches - 1 when predicted correctly and 2 when not, average 1.25
Jump – 2
1.5x25%+1x10%+1x52%+1.25x11%+2x2% = 1.17
Average instruction time: single-cycle = 600ps, multicycle =
4.12x200=824, pipelined 1.17x200 = 234ps
Memory access 200ps is the bottleneck. How to improve?
computer arctecture ~ PSUT
96
Can pipelining get us into trouble?
Yes: Pipeline Hazards
structural hazards: attempt to use the same resource two
different ways at the same time
•
E.g., two instructions try to read the same memory at the same
time
data hazards: attempt to use item before it is ready
•
instruction depends on result of prior instruction still in the
pipeline
add r1, r2, r3
sub r4, r2, r1
control hazards: attempt to make a decision before condition is
evaulated
•
branch instructions
beq r1, loop
add r1, r2, r3
Can always resolve hazards by waiting
pipeline control must detect the hazard
take action (or delay action) to resolve hazards
computer arctecture ~ PSUT
97
Single Memory is a Structural Hazard
Time (clock cycles)
Instr 4
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
ALU
Instr 3
Reg
ALU
Instr 2
Mem
Mem
ALU
Instr 1
Reg
ALU
O
r
d
e
r
Load
Mem
ALU
I
n
s
t
r.
Mem
Reg
Detection is easy in this case!
(right half highlight means read, left half write)98
computer arctecture ~ PSUT
Structural Hazards limit performance
Example: if 1.3 memory accesses per
instruction and only one memory access
per cycle then
average CPI = 1.3
otherwise resource is more than 100%
utilized
Solution 1: Use separate instruction and
data memories
Solution 2: Allow memory to read and
write more than one word per cycle
computer arctecture ~ PSUT
99
Solution 3: Stall
Control Hazard Solutions
Stall: wait until decision is clear
Its possible to move up decision to 2nd stage by
adding hardware to check registers as being read
Beq
Load
Reg
Mem
Mem
Reg
Reg
Mem
Reg
Mem
Reg
ALU
Add
Mem
ALU
O
r
d
e
r
Time (clock cycles)
ALU
I
n
s
t
r.
Mem
Reg
Impact: 2 clock cycles per branch instruction
=> slow
computer arctecture ~ PSUT
100
Control Hazard Solutions
Predict: guess one direction then back up if
wrong
Beq
Load
Reg
Mem
Mem
Reg
Reg
Mem
Reg
Mem
Reg
ALU
Add
Mem
ALU
O
r
d
e
r
Predict not takenTime (clock cycles)
ALU
I
n
s
t
r.
Mem
Reg
Impact: 1 clock cycle per branch instruction if
right, 2 if wrong (right 50% of time)
More dynamic scheme: history of 1 branch (
computer arctecture ~ PSUT
90%)
101
Control Hazard Solutions
Redefine branch behavior (takes place after next
instruction) “delayed branch”
Misc
Load
Mem
Mem
Reg
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
ALU
Beq
Reg
ALU
Add
Mem
ALU
O
r
d
e
r
Time (clock cycles)
ALU
I
n
s
t
r.
Mem
Reg
Impact: 1 clock cycles per branch instruction if
can find instruction to put in “slot” ( 50% of time)
Launch more instructions per clock cycle=>less
computer arctecture ~ PSUT
useful
102
Data Hazard
computer arctecture ~ PSUT
103
Data Hazard---Forwarding
Use temporary results, don’t wait for them to be written
register file forwarding to handle read/write to same register
ALU forwarding
computer arctecture ~ PSUT
104
Can’t always forward
Load word can still cause a hazard:
an instruction tries to read a register following a load instruction that
writes to the same register.
Thus, we need a hazardcomputer
detection
unit ~toPSUT
“stall” the load instruction 105
arctecture
Stalling
We can stall the pipeline by keeping an instruction in the same
stage
computer arctecture ~ PSUT
106
Memory Hierarchy Design
computer arctecture ~ PSUT
107
LOGO
5.1 Introduction
Memory Hierarchy Design
Motivated by the principle of locality - A 90/10
type of rule
Take advantage of 2 forms of locality
• Spatial - nearby references are likely
• Temporal - same reference is likely soon
Also motivated by cost/performance structures
Smaller hardware is faster: SRAM, DRAM, Disk, Tape
Access vs. bandwidth variations
Fast memory is more expensive
Goal – Provide a memory system with cost
almost as low as the cheapest level and speed
almost as fast as the fastest level
computer arctecture ~ PSUT
109
Memory relevance in Computer Design ?
A computer’s performance is given by the number of
instructions executed per time unit
The time for executing an instruction depends on:
The ALU speed (I.e. the data-path cycle duration)
The time it takes for each instruction to load/store its
operands/result from/into the memory (in brief, the time to
access memory)
The processing speed (CPU speed) grows faster than
the memory speed. As a result the CPU speed cannot
be fully exploited. This speed gap leads to an
Unbalanced System !
computer arctecture ~ PSUT
110
Levels in A Typical Memory
Hierarchy
computer arctecture ~ PSUT
111
Unit of Transfer / Addressable Unit
Unit of Transfer: Number of bits read from, or written
into memory at a time
Internal : usually governed by data bus width
External : usually a block of words e.g 512 or
more.
Addressable unit: smallest location which can be
uniquely addressed
Internal : word or byte
External : device dependent e.g. a disk “cluster”
computer arctecture ~ PSUT
112
Access Method
Sequential
Data is stored in records, access is in linear sequence
(tape)
Direct
Data blocks have a unique and direct access, data
within block is sequential (disk)
Random
Data has unique and direct access (ram)
Associative
Data retrieved based on (partial) match rather than
address (cache)
computer arctecture ~ PSUT
113
LOGO
5.2 Review of the
ABCs of Caches
36 Basic Terms on Caches
Cache
Full associative
Write allocate
Virtual memory
dirty bit
unified cache
memory stall cycles
block offset
misses per instruction
directed mapped
write back
block
valid bit
data cache
locality
block address
hit time
address trace
write through
cache miss
set
instruction cache
page fault
random placement
average memory access time miss rate
index field
cache hit
n-way set
associative
no-write allocate
page
least-recently used
write buffer
miss penalty
tag field
write stall
computer arctecture ~ PSUT
115
Cache
The first level of the memory hierarchy
encountered once the address leaves the CPU
Persistent mismatch between CPU and main-memory
speeds
Exploit the principle of locality by providing a small,
fast memory between CPU and main memory -- the
cache memory
Cache is now applied whenever buffering is
employed to reuse commonly occurring terms
(ex. file caches)
Caching – copying information into faster
storage system
Main memory can be viewed as a cache for
secondary storage
computer arctecture ~ PSUT
116
General Hierarchy Concepts
At each level - block concept is present (block is the
caching unit)
Block size may vary depending on level
• Amortize longer access by bringing in larger chunk
• Works if locality principle is true
Hit - access where block is present - hit rate is the probability
Miss - access where block is absent (in lower levels) - miss rate
Mirroring and consistency
Data residing in higher level is subset of data in lower level
Changes at higher level must be reflected down - sometime
• Policy of sometime is the consistency mechanism
Addressing
Whatever the organization you have to know how to get at it!
Address checking and protection
computer arctecture ~ PSUT
117
Physical Address Structure
Key is that you want different block sizes
at different levels
computer arctecture ~ PSUT
118
Latency and Bandwidth
The time required for the cache miss depends
on both latency and bandwidth of the memory
(or lower level)
Latency determines the time to retrieve the first
word of the block
Bandwidth determines the time to retrieve the
rest of this block
A cache miss is handled by hardware and
causes processors following in-order execution
to pause or stall until the data are available
computer arctecture ~ PSUT
119
Predicting Memory Access Times
On a hit: simple access time to the cache
On a miss: access time + miss penalty
Miss penalty = access time of lower + block transfer
time
Block transfer time depends on
• Block size - bigger blocks mean longer transfers
• Bandwidth between the two levels of memory
– Bandwidth usually dominated by the slower memory and the
bus protocol
Performance
Average-Memory-Access-Time = Hit-Access-Time +
Miss-Rate * Miss-Penalty
Memory-stall-cycles = IC * Memory-reference-perinstruction * Miss-Rate * Miss-Penalty
computer arctecture ~ PSUT
120
Four Standard Questions
Block Placement
Where can a block be placed in the upper
level?
Block Identification
How is a block found if it is in the upper level?
Block Replacement
Which block should be replaced on a miss?
Write Strategy
What happens on a write?
Answer the four questions for the first level of the memory hierarchy
computer arctecture ~ PSUT
121
Block Placement Options
Direct Mapped
(Block address) MOD (# of cache blocks)
Fully Associative
Can be placed anywhere
Set Associative
Set is a group of n blocks -- each block is called a
way
Block first mapped into a set (Block address)
MOD (# of cache sets)
Placed anywhere in the set
Most caches are direct mapped, 2- or 4-way
set associative
computer arctecture ~ PSUT
122
Block Placement Options (Cont.)
computer arctecture ~ PSUT
123
Block Identification
Each cache block carries tags
Address Tags: which block am I?
Many memory blocks may
map to the same cache
block
Physical address now: address tag## set index##
block offset
Note relationship of block size, cache size, and tag
size
The smaller the set tag the cheaper it is to find
Status Tags: what state is the block in?
valid, dirty, etc.
Physical address =
r + m + n bits
r
(address tag)
m
(set index)
n
(block offset)
2m addressable sets
in the cache
2n bytes
per block
computer arctecture ~ PSUT
124
Block Identification (Cont.)
Physical address = r + m + n bits
r (address tag)
m
2m addressable sets
in the cache
n
2n bytes
per block
•
Caches have an address tag on each block frame that gives
the block address.
•
A valid bit to say whether or not this entry contains a valid
address.
•
The block frame address can be divided into the tag field and
the index field.
computer arctecture ~ PSUT
125
Block Replacement
Random: just pick one and chuck it
Simple hash game played on target block frame
address
Some use truly random
• But lack of reproducibility is a problem at debug time
LRU - least recently used
Need to keep time since each block was last
accessed
• Expensive if number of blocks is large due to global compare
• Hence approximation is oftenOnly
usedone
= Use
bitfor
tagdirect-mapped
and LFU
choice
FIFO
placement
computer arctecture ~ PSUT
126
Short Summaries from the Previous
Figure
More-way associative is better for small cache
2- or 4-way associative perform similar to 8-way
associative for larger caches
Larger cache size is better
LRU is the best for small block sizes
Random works fine for large caches
FIFO outperforms random in smaller caches
Little difference between LRU and random for
larger caches
computer arctecture ~ PSUT
127
Improving Cache Performance
MIPS mix is 10% stores and 37% loads
Writes are about 10%/(100%+10%+37%) = 7% of
overall memory traffic, and 10%/(10%+37%)=21% of
data cache traffic
Make the common case fast
Implies optimizing caches for reads
Read optimizations
Block can be read concurrent with tag comparison
On a hit the read information is passed on
On a miss the - nuke the block and start the miss
access
Write optimizations
Can’t modify until after tag check - hence take longer
computer arctecture ~ PSUT
128
Write Options
Write through: write posted to cache line and through to next lower
level
Incurs write stall (use an intermediate write buffer to reduce the stall)
Write back
Only write to cache not to lower level
Implies that cache and main memory are now inconsistent
• Mark the line with a dirty bit
• If this block is replaced and dirty then write it back
Pro’s and Con’s both are useful
Write through
• No write on read miss, simpler to implement, no inconsistency with main
memory
Write back
• Uses less main memory bandwidth, write times independent of main
memory speeds
• Multiple writes within a block require only one write to the main memory
computer arctecture ~ PSUT
129
LOGO
5.3 Cache
Performance
Cache Performance
computer arctecture ~ PSUT
131
Cache Performance Example
Each instruction takes 2 clock cycle (ignore memory
stalls)
Cache miss penalty – 50 clock cycles
Miss rate = 2%
Average 1.33 memory reference per instructions
•
•
•
•
Ideal – IC * 2 * cycle-time
With cache – IC*(2+1.33*2%*50)*cycle-time = IC * 3.33 * cycle-time
No cache – IC * (2+1.33*100%*50)*cycle-time
The importance of cache for CPUs with lower CPI and higher clock
rates is greater – Amdahl’s Law
computer arctecture ~ PSUT
132
Average Memory Access Time VS
CPU Time
Compare two different cache organizations
Miss rate – direct-mapped (1.4%), 2-way associative (1.0%)
Clock-cycle-time – direct-mapped (2.0ns), 2-way associative
(2.2ns)
CPI with a perfect cache – 2.0, average memory
reference per instruction – 1.3; miss-penalty – 70ns; hittime – 1 CC
• Average Memory Access Time (Hit time + Miss_rate * Miss_penalty)
• AMAT(Direct) = 1 * 2 + (1.4% * 70) = 2.98ns
• AMAT(2-way) = 1 * 2.2 + (1.0% * 70) = 2.90ns
• CPU Time
• CPU(Direct) = IC * (2 * 2 + 1.3 * 1.4% * 70) = 5.27 * IC
• CPU(2-way) = IC * (2 * 2.2 + 1.3 * 1.0% * 70) = 5.31 * IC
Since CPU time is our bottom-line evaluation, and since direct mapped is
simpler to build, the preferred cache is direct mapped in this example
computer arctecture ~ PSUT
133
Unified and Split Cache
Unified – 32KB cache, Split – 16KB IC and 16KB DC
Hit time – 1 clock cycle, miss penalty – 100 clock cycles
Load/Store hit takes 1 extra clock cycle for unified cache
36% load/store – reference to cache: 74% instruction, 26% data
• Miss rate(16KB instruction) = 3.82/1000/1.0 = 0.004
Miss rate (16KB data) = 40.9/1000/0.36 = 0.114
• Miss rate for split cache – (74%*0.004) + (26%*0.114) = 0.0324
Miss rate for unified cache – 43.3/1000/(1+0.36) = 0.0318
• Average-memory-access-time = % inst * (hit-time + inst-miss-rate *
miss-penalty) + % data * (hit-time + data-miss-rate * miss-penalty)
• AMAT(Split) = 74% * (1 + 0.004 * 100) + 26% * (1 + 0.114 * 100) = 4.24
• AMAT(Unified) = 74% * (1 + 0.0318 * 100) + 26% * (1 + 1 + 0.0318* 100)
= 4.44
computer arctecture ~ PSUT
134
Improving Cache Performance
Average-memory-access-time = Hittime + Miss-rate * Miss-penalty
Strategies for improving cache
performance
Reducing the miss penalty
Reducing the miss rate
Reducing the miss penalty or miss rate via
parallelism
Reducing the time to hit in the cache
computer arctecture ~ PSUT
135
LOGO
5.4
Reducing Cache
Miss Penalty
Techniques for Reducing Miss
Penalty
Multilevel Caches (the most important)
Critical Word First and Early Restart
Giving Priority to Read Misses over Writes
Merging Write Buffer
Victim Caches
computer arctecture ~ PSUT
137
Multi-Level Caches
Probably the best miss-penalty reduction
Performance measurement for 2-level
caches
AMAT = Hit-time-L1 + Miss-rate-L1* Misspenalty-L1
Miss-penalty-L1 = Hit-time-L2 + Miss-rate-L2 *
Miss-penalty-L2
AMAT = Hit-time-L1 + Miss-rate-L1 * (Hit-timeL2 + Miss-rate-L2 * Miss-penalty-L2)
computer arctecture ~ PSUT
138
Critical Word First and Early
Restart
Do not wait for full block to be loaded before restarting
CPU
Critical Word First – request the missed word first from memory
and send it to the CPU as soon as it arrives; let the CPU
continue execution while filling the rest of the words in the block.
Also called wrapped fetch and requested word first
Early restart -- as soon as the requested word of the block
arrives, send it to the CPU and let the CPU continue execution
Benefits of critical word first and early restart depend on
Block size: generally useful only in large blocks
Likelihood of another access to the portion of the block that has
not yet been fetched
• Spatial locality problem: tend to want next sequential word, so not
clear if benefit
block
computer arctecture ~ PSUT
139
Victim Caches
Remember what was just discarded in case it is need
again
Add small fully associative cache (called victim cache)
between the cache and the refill path
Contain only blocks discarded from a cache because of a miss
Are checked on a miss to see if they have the desired data
before going to the next lower-level of memory
• If yes, swap the victim block and cache block
Addressing both victim and regular cache at the same time
• The penalty will not increase
Jouppi (DEC SRC) shows miss reduction of 20 - 95%
For a 4KB direct mapped cache with 1-5 victim blocks
computer arctecture ~ PSUT
140
Victim Cache Organization
computer arctecture ~ PSUT
141
LOGO
5.5 Reducing Miss
Rate
Classify Cache Misses - 3 C’s
Compulsory independent of cache size
First access to a block no choice but to load it
Also called cold-start or first-reference misses
Capacity decrease as cache size increases
Cache cannot contain all the blocks needed during
execution, then blocks being discarded will be later
retrieved
Conflict (Collision) decrease as associativity
increases
Side effect of set associative or direct mapping
A block may be discarded and later retrieved if too
many blocks map to the same cache block
computer arctecture ~ PSUT
143
Techniques for Reducing Miss Rate
Larger Block Size
Larger Caches
Higher Associativity
Way Prediction Caches
Compiler optimizations
computer arctecture ~ PSUT
144
Larger Block Sizes
Obvious advantages: reduce compulsory
misses
Reason is due to spatial locality
Obvious disadvantage
Higher miss penalty: larger block takes longer
to move
May increase conflict misses and capacity
miss if cache is small
Don’t let increase in miss penalty outweigh the decrease in miss rate
computer arctecture ~ PSUT
145
Large Caches
Help with both conflict and capacity
misses
May need longer hit time AND/OR higher
HW cost
Popular in off-chip caches
computer arctecture ~ PSUT
146
Higher Associativity
8-way set associative is for practical purposes
as effective in reducing misses as fully
associative
2: 1 Rule of thumb
2 way set associative of size N/ 2 is about the same
as a direct mapped cache of size N (held for cache
size < 128 KB)
Greater associativity comes at the cost of
increased hit time
Lengthen the clock cycle
Hill [1988] suggested hit time for 2-way vs. 1-way:
external cache +10%, internal + 2%
computer arctecture ~ PSUT
147
Way Prediction
Extra bits are kept in cache to predict the way, or
block within the set of the next cache access
Multiplexor is set early to select the desired
block, and only a single tag comparison is
performed that clock cycle
A miss results in checking the other blocks for
matches in subsequent clock cycles
Alpha 21264 uses way prediction in its 2-way
set-associative instruction cache. Simulation
using SPEC95 suggested way prediction
accuracy is in excess of 85%
computer arctecture ~ PSUT
148
Compiler Optimization for Data
Idea – improve the spatial and temporal locality
of the data
Lots of options
Array merging – Allocate arrays so that paired
operands show up in same cache block
Loop interchange – Exchange inner and outer loop
order to improve cache performance
Loop fusion – For independent loops accessing the
same data, fuse these loops into a single aggregate
loop
Blocking – Do as much as possible on a sub- block
before moving on
computer arctecture ~ PSUT
149
Merging Arrays Example
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];
val
key
/* After: 1 array of stuctures */
struct merge {
int val;
val key val key val key
int key;
};
struct merge merged_array[SIZE];
Reducing conflicts between val & key; improve
spatial locality
computer arctecture ~ PSUT
150
LOGO
5.7 Reducing Hit
Time
Reducing Hit Time
Hit time is critical because it affects the
clock cycle time
On many machines, cache access time limits
the clock cycle rate
A fast hit time is multiplied in importance
beyond the average memory access time
formula because it helps everything
Average-Memory-Access-Time = HitAccess-Time + Miss-Rate * Miss-Penalty
• Miss-penalty is clock-cycle dependent
computer arctecture ~ PSUT
152
Techniques for Reducing Hit Time
Small and Simple Caches
Avoid Address Translation during Indexing
of the Cache
Pipelined Cache Access
Trace Caches
computer arctecture ~ PSUT
153
Cache Optimization Summary
computer arctecture ~ PSUT
154
LOGO
5.9 Main Memory
Main Memory -- 3 important issues
Capacity
Latency
Access time: time between a read is requested and the word
arrives
Cycle time: min time between requests to memory (> access
time)
• Memory needs the address lines to be stable between accesses
By addressing big chunks - like an entire cache block (amortize
the latency)
Critical to cache performance when the miss is to main
Bandwidth -- # of bytes read or written per unit time
Affects the time it takes to transfer the block
computer arctecture ~ PSUT
156
3 Examples of Bus Width, Memory Width, and
Memory Interleaving to Achieve Memory Bandwidth
computer arctecture ~ PSUT
157
Wider Main Memory
Doubling or quadrupling the width of the cache or
memory will doubling or quadrupling the memory
bandwidth
Miss penalty is reduced correspondingly
Cost and Drawback
More cost on memory bus
Multiplexer between the cache and the CPU may be on the
critical path (CPU is still access the cache one word at a time)
• Multiplexors can be put between L1 and L2
The design of error correction become more complicated
• If only a portion of the block is updated, all other portions must be
read for calculating the new error correction code
Since main memory is traditionally expandable by the customer,
the minimum increment is doubled or quadrupled
computer arctecture ~ PSUT
158
LOGO
5.10 Virtual
Memory
Virtual Memory
Virtual memory divides physical memory into
blocks (called page or segment) and allocates
them to different processes
With virtual memory, the CPU produces virtual
addresses that are translated by a combination
of HW and SW to physical addresses, which
accesses main memory. The process is called
memory mapping or address translation
Today, the two memory-hierarchy levels
controlled by virtual memory are DRAMs and
magnetic disks
computer arctecture ~ PSUT
160
Example of Virtual to Physical
Address Mapping
Mapping by a
page table
computer arctecture ~ PSUT
161
Address Translation Hardware for
Paging
frame number frame offset
f (l-n)
d (n)
computer arctecture ~ PSUT
162
Cache vs. VM Differences
Replacement
Cache miss handled by hardware
Page fault usually handled by OS
Addresses
VM space is determined by the address size of the
CPU
Cache space is independent of the CPU address size
Lower level memory
For caches - the main memory is not shared by
something else
For VM - most of the disk contains the file system
• File system addressed differently - usually in I/ O space
• VM lower level is usually called SWAP space
computer arctecture ~ PSUT
163
2 VM Styles - Paged or Segmented?
Virtual systems can be categorized into two classes: pages (fixed-size
blocks), and segments (variable-size blocks)
Page
Segment
Words per address
One
Two (segment and offset)
Programmer visible?
Invisible to application
programmer
May be visible to application
programmer
Replacing a block
Trivial (all blocks are the same
size)
Hard (must find contiguous, variable-size,
unused portion of main memory)
Memory use inefficiency
Internal fragmentation (unused
portion of page)
External fragmentation (unused pieces
of main memory)
Efficient disk traffic
Yes (adjust page size to balance
access time and transfer time)
Not always (small segments may
transfer just a few bytes)
computer arctecture ~ PSUT
164
Virtual Memory – The Same 4
Questions
Block Placement
Choice: lower miss rates and complex placement or
vice versa
• Miss penalty is huge, so choose low miss rate place
anywhere
• Similar to fully associative cache model
Block Identification - both use additional data
structure
Fixed size pages - use a page table
Variable sized segments - segment table
frame number frame offset
f (l-n)
d (n)
computer arctecture ~ PSUT
165
Virtual Memory – The Same 4
Questions (Cont.)
Block Replacement -- LRU is the best
However true LRU is a bit complex – so use
approximation
• Page table contains a use tag, and on access the use tag is
set
• OS checks them every so often - records what it sees in a
data structure - then clears them all
• On a miss the OS decides who has been used the least and
replace that one
Write Strategy -- always write back
Due to the access time to the disk, write through is
silly
Use a dirty bit to only write back pages that have
been modified
computer arctecture ~ PSUT
166
Techniques for Fast Address
Translation
Page table is kept in main memory (kernel memory)
Each process has a page table
Every data/instruction access requires two memory
accesses
One for the page table and one for the data/instruction
Can be solved by the use of a special fast-lookup hardware
cache called associative registers or translation look-aside
buffers (TLBs)
If locality applies then cache the recent translation
TLB = translation look-aside buffer
TLB entry: virtual page no, physical page no, protection bit, use
bit, dirty bit
computer arctecture ~ PSUT
167
TLB = Translation Look-aside
Buffer
The TLB must be on chip; otherwise it is
worthless
Fully associative – parallel search
Typical TLB’s
Hit time - 1 cycle
Miss penalty - 10 to 30 cycles
Miss rate - .1% to 2%
TLB size - 32 B to 8 KB
computer arctecture ~ PSUT
168
Paging Hardware with TLB
computer arctecture ~ PSUT
169
TLB of Alpha 21264
Address Space Number: process
ID to prevent context switch
A total of 128 TLB entries
computer arctecture ~ PSUT
170
Page Size – An Architectural Choice
Large pages are good:
Reduces page table size
Amortizes the long disk access
If spatial locality is good then hit rate will
improve
Reduce the number of TLB miss
Large pages are bad:
More internal fragmentation
• If everything is random each structure’s last page
is only half full
Process start computer
up time
takes
arctecture
~ PSUTlonger
171