COSC3330/6308 Computer Architecture

Download Report

Transcript COSC3330/6308 Computer Architecture

THIRD REVIEW SESSION
Jehan-François Pâris
May 5, 2010
MATERIALS (I)
• Memory hierarchies:
– Caches
– Virtual memory
– Protection
– Virtual machines
– Cache consistency
MATERIALS (II)
• I/O Operations
– More about disks
– I/O operation implementation:
• Busses
• Memory-mapped I/O
• Specific I/O instructions
– RAID organizations
MATERIALS (III)
• Parallel Architectures
– Shared memory multiprocessors
– Computer clusters
– Hardware multithreading
– SISD, SIMD, MIMD, …
– Roofline performance model
CACHING AND VIRTUAL MEMORY
Common objective
• Make a combination of
– Small, fast and expensive memory
– Large, slow and cheap memory
look like
– A single large and fast memory
• Fetch policy is fetch on demand
Questions to ask
•
•
•
•
•
What are the transfer units?
How are they placed in the faster memory?
How are they accessed?
How do we handle misses?
How do we implement writes?
and more generally
• Are these tasks performed by the hardware or
the OS?
Transfer units
• Block or pages containing 2n bytes
– Always properly aligned
• If a block or a page contains 2n bytes,
the n LSBs of its start address will be all zeroes
Examples
• If block size is 4 words,
– Corresponds to 16 = 2 4 bytes
– 4 LSB of block address will be all zeroes
• If page size is 4KB
– Corresponds to 22×210 = 212 bytes
– 12 LSBs of page address will be all zeroes
– Remaining bits of address form page number
Examples
Page size = 4KB
32-bit address of first byte in page
XXXXXXXXXXXXXXXXX<12 zeroes>
In page address:
20-bit page number + 12 bit offset
XXXXXXXXXXXXXXXXX YYYYYYYYY
Consequence
• In a 32-bit architecture,
– We identify a block or a page of size 2n bytes
by the 32 – n MSBs of its address
– Will be called
• Tag
• Page number
Placement policy
• Two extremes
– Each block can only occupy a fixed address in
the faster memory
• Direct mapping (many caches)
– Each page can occupy any address in the
faster memory
• Full association (virtual memory)
Direct mapping
• Assume
– Cache has 2m entries
– Block size is 2n bytes
– a is the block address
(with its n LSBs removed)
• The block will be placed at cache position
a % 2m
Consequence
• The tag identifying the cache block will be the
start address of the block with its n + m LSBs
removed
– the original n LSBs because they are known
to be all zeroes
– the next m LSBs because they are equal to
a % 2m
Consequence
Block start address
Remove n LSBs because they are all-zeroes
Block address
Remove m additional LSBs given by a%2m
Tag
A cache whose block size is 8 bytes
000
001
010
011
100
101
110
111
Valid
Y/N
Y/N
Y/N
Y/N
Y/N
Y/N
Y/N
Y/N
Tag
Bits 31:6
Bits 31:6
Bits 31:6
Bits 31:6
Bits 31:6
Bits 31:6
Bits 31:6
Bits 31:6
Contents
Word
Word
Word
Word
Word
Word
Word
Word
Word
Word
Word
Word
Word
Word
Word
Word
Fully associative solution
• Used in virtual memory
systems
• Each page can occupy
any free page frame in
main memory
• Use a page table
– Without redundant
first column
Page Frame
#
#
0
4
1
7
2
27
3
44
4
5
…
…
Solutions with limited associativity
• A cache of size 2m with associativity level k lets a
given block occupy any of k possible locations in
the cache
• Implementation looks very much like k caches of
size 2m/k put together
• All possible cache locations for a block have the
same position a % 2m/k in each of the smaller
caches
A set-associative cache with k=2
000
001
010
011
100
101
110
111
Valid
Y/N
Y/N
Y/N
Y/N
Y/N
Y/N
Y/N
Y/N
Tag
Contents Valid Tag
Contents
Bits 31:5 Block
000 Y/N Bits 31:5 Block
Bits 31:5 Block
001 Y/N Bits 31:5 Block
Bits 31:5 Block
010 Y/N Bits 31:5 Block
Bits 31:5 Block
011 Y/N Bits 31:5 Block
Bits 31:5 Block
100 Y/N Bits 31:5 Block
Bits 31:5 Block
101 Y/N Bits 31:5 Block
Bits 31:5 Block
110 Y/N Bits 31:5 Block
Bits 31:5 Block
111 Y/N Bits 31:5 Block
Accessing an entry
• In a cache, use hardware to compute the
possible cache position for the block containing
the data
– a % 2m for a cache using direct mapping
– a % 2m/k for a cache of associativity level k
• Check then if the cache entry is valid using its
valid bit
Accessing an entry
• In a VM system, hardware checks the TLB to
find the frame containing a given page number
• TLB entries contain
– A page number (tag)
– A frame number
– A valid bit
– A dirty bit
Accessing an entry
Page number
Page frame number
Bits
• The valid bit indicates if the mapping is valid
• The dirty bit indicates whether we need to save
the page contents when we expel it
Accessing an entry
• If page mapping is not in the TLB, must consult
the page table and update the TLB
– Can be done by hardware or software
Realization
Page No
Virtual Address
2
Offset
897
PAGE TABLE
1
(10 bits)
3
5
7
Physical Address
Page
Frame
No
5
897
Handling cache misses
• Cache hardware fetches missing block
• Often overwriting an existing entry
• Which one?
– The one that occupies the same location if
cache use direct mapping
– One of those that occupy the same location if
cache use direct mapping
Handling cache misses
• Before expelling a cache entry, we must
– Check its dirty bit
– Save its contents if dirty bit is on.
Handling page faults
• OS fetches missing page
• Often overwriting an existing page
• Which one?
– One that was not recently used
• Selected by page replacement policy
Handling page faults
• Before expelling a page, we must
– Check its dirty bit
– Save its contents if dirty bit is on.
Handling writes (I)
• Two ways to handle writes
– Write through:
• Each write updates both the cache and the
main memory
– Write back:
• Writes are not propagated to the main
memory until the updated word is expelled
from the cache
Handling writes (II)
• Write through
CPU
• Write back
CPU
Cache
Cache
later
RAM
RAM
Pros and cons
• Write through:
– Ensures that memory is always up to date
• Expelled cache entries can be overwritten
• Write back:
– Faster writes
– Complicates cache expulsion procedure
• Must write back cache entries that have
been modified in the cache
A better write through (I)
• Add a small buffer to speed up write
performance of write-through caches
– At least four words
• Holds modified data until they are written into
main memory
– Cache can proceed as soon as data are
written into the write buffer
A better write through (II)
• Write through
• Better write through
CPU
CPU
Cache
Cache
Write buffer
RAM
RAM
Designing RAM to support caches
• RAM connected to CPU through a "bus"
– Clock rate much slower than CPU clock rate
• Assume that a RAM access takes
– 1 bus clock cycle to send the address
– 15 bus clock cycle to initiate a read
– 1 bus clock cycle to send a word of data
Designing RAM to support caches
• Assume
– Cache block size is 4 words
– One-word bank of DRAM
• Fetching a cache block would take
1 + 4×15 + 4×1 = 65 bus clock cycles
– Transfer rate is 0.25 byte/bus cycle
• Awful!
Designing RAM to support caches
• Could
– Have an interleaved memory organization
– Four one-word banks of DRAM
– A 32-bit bus
32 bits
RAM
bank 0
RAM
bank 1
RAM
bank 2
RAM
bank 3
Designing RAM to support caches
• Can do the 4 accesses in parallel
• Must still transmit the block 32 bits by 32 bits
• Fetching a cache block would take
1 + 15 + 4×1 = 20 bus clock cycles
– Transfer rate is 0.80 word/bus cycle
• Even better
• Much cheaper than having a 64-bit bus
PERFORMANCE ISSUES
Memory stalls
• Can divide CPU time into
– NEXEC clock cycles spent executing
instructions
– NMEM_STALLS cycles spent waiting for memory
accesses
• We have
CPU time = (NEXEC + NMEM_STALLS)×TCYCLE
Memory stalls
• We assume that
– cache access times can be neglected
– most CPU cycles spent waiting for memory
accesses are caused by cache misses
Global impact
• We have
NMEM_STALLS = NMEM_ACCESSES×Cache miss rate×
Cache miss penalty
• and also
NMEM_STALLS = NINSTRUCTIONS×(NMISSES/Instruction)×
Cache miss penalty
Example
• Miss rate of instruction cache is 2 percent
Miss rate of data cache is 5 percent
In the absence of memory stalls, each instruction
would take 2 cycles
Miss penalty is 100 cycles
40 percent of instructions access the main
memory
• How many cycles are lost due to cache misses?
Solution (I)
• Impact of instruction cache misses
0.02×100 =2 cycles/instruction
• Impact of data cache misses
0.40×0.05×100 =2 cycles/instruction
• Total impact of cache misses
2 + 2 = 4 cycles/instruction
Solution (II)
• Average number of cycles per instruction
2 + 4 = 6 cycles/instruction
• Fraction of time wasted
4 /6 = 67 percent
Average memory access time
• Some authors call it AMAT
TAVERAGE = TCACHE + f×TMISS
where f is the cache miss rate
• Times can be expressed
– In nanoseconds
– In number of cycles
Example
• A cache has a hit rate of 96 percent
• Accessing data
– In the cache requires one cycle
– In the memory requires 100 cycles
• What is the average memory access time?
Solution
• Miss rate = 1 – Hit rate = 0.04
• Applying the formula
TAVERAGE = 1 + 0.04×100 = 401 cycles
In other words
It's the miss rate, stupid!
Improving cache hit rate
• Two complementary techniques
– Using set-associative caches
• Must check tags of all blocks with the same
index values
– Slower
• Have fewer collisions
– Fewer misses
– Use a cache hierarchy
A cache hierarchy
• Topmost cache
– Optimized for speed, not miss rate
– Rather small
– Uses a small block size
• As we go down the hierarchy
– Cache sizes increase
– Block sizes increase
– Cache associativity level increases
Example
• Cache miss rate per instruction is 3 percent
In the absence of memory stalls, each instruction
would take one cycle
Cache miss penalty is 100 ns
Clock rate is 4GHz
• How many cycles are lost due to cache misses?
Solution (I)
• Duration of clock cycle
1/(4 Ghz) = 0.25×10-9 s = 0.25 ns
• Cache miss penalty
100ns = 400 cycles
• Total impact of cache misses
0.03×400 = 12 cycles/instruction
Solution (II)
• Average number of cycles per instruction
1 + 12 = 13 cycles/instruction
• Fraction of time wasted
12/13 = 92 percent
A very good case for hardware multithreading
Example (cont'd)
• How much faster would the processor if we
added a L2 cache that
– Has a 5 ns access time
– Would reduce miss rate to main memory to
one percent?
Solution (I)
• L2 cache access time
5ns = 20 cycles
• Impact of cache misses per instruction
L1 cache misses + L2 cache misses =
0.03×20+0.01×400 = 0.6 + 4.0 =
4.6 cycles/instruction
• Average number of cycles per instruction
1 + 4.6 = 5.6 cycles/instruction
Solution (II)
• Fraction of time wasted
4.6/5.6 = 82 percent
• CPU speedup
13/4.6 = 2.83
Problem
• Redo the second part of the example assuming
that the secondary cache
– Has a 3 ns access time
– Can reduce miss rate to main memory to one
percent?
Solution
• Fraction of time wasted
86 percent
• CPU speedup
1.22
New L2 cache with a lower access time
but a higher miss rate performs much worse
than first L2 cache
Example
• A virtual memory has a page fault rate of 10-4
faults per memory access
• Accessing data
– In the memory requires 100 ns
– On disk requires 5 ms
• What is the average memory access time?
Tavg = 100 ns + 10-4 ×5 ms = 600ns
The cost of a page fault
• Let
– Tm be the main memory access time
– Td the disk access time
– f the page fault rate
– Ta the average access time of the VM
Ta = (1 – f ) Tm + f (Tm + Td ) = Tm + f Td
Example
• Assume Tm = 50 ns and Td = 5 ms
f
Mean memory access time
10-3
= 50 ns + 5 ms/103 = 5,050 ns
10-4
= 50 ns + 5 ms/104 = 550 ns
10-5
= 50 ns + 5 ms/105 = 100 ns
10-6
= 50 ns + 5 ms/ 106 = 55 ns
In other words
It's the page fault rate, stupid!
Locality principle (I)
• A process that would access its pages in a
totally unpredictable fashion would perform very
poorly in a VM system unless all its pages are in
main memory
Locality principle (II)
• Process P accesses randomly a very large
array consisting of n pages
• If m of these n pages are in main memory,
the page fault frequency of the process will be
( n – m )/ n
• Must switch to another algorithm
First problem
• A virtual memory system has
– 32 bit addresses
– 4 KB pages
• What are the sizes of the
– Page number field?
– Offset field?
Solution (I)
• Step 1:
Convert page size to power of 2
4 KB = 212 B
• Step 2:
Exponent is length of offset field
Solution (II)
• Step 3:
Size of page number field =
Address size – Offset size
Here 32 – 12 = 20 bits
12 bits for the offset and 20 bits for the page number
MEMORY PROTECTION
Objective
• Unless we have an isolated single-user system,
we must prevent users from
– Accessing
– Deleting
– Modifying
the address spaces of other processes,
including the kernel
Memory protection (I)
• VM ensures that processes cannot access page
frames that are not referenced in their page
table.
• Can refine control by distinguishing among
– Read access
– Write access
– Execute access
• Must also prevent processes from modifying
their own page tables
Dual-mode CPU
• Require a dual-mode CPU
• Two CPU modes
– Privileged mode or executive mode that
allows CPU to execute all instructions
– User mode that allows CPU to execute only
safe unprivileged instructions
• State of CPU is determined by a special bits
Switching between states
• User mode will be the default mode for all
programs
– Only the kernel can run in supervisor mode
• Switching from user mode to supervisor mode is
done through an interrupt
– Safe because the jump address is at a welldefined location in main memory
Memory protection (II)
• Has additional advantages:
– Prevents programs from corrupting address
spaces of other programs
– Prevents programs from crashing the kernel
• Not true for device drivers which are inside
the kernel
• Required part of any multiprogramming system
INTEGRATING CACHES AND VM
The problem
• In a VM system, each byte of memory has teo
addresses
– A virtual address
– A physical address
• Should cache tags contain virtual addresses or
physical addresses?
Discussion
• Using virtual addresses • Using physical
addresses
– Directly available
– Must access first TLB
– Bypass TLB
– Cache entries not
– Cache entries specific
specific to a given
to a given address
address space
space
– Do not have to flush
– Must flush caches
caches when the OS
when the OS selects
selects another process
another process
The best solution
• Let the cache use physical addresses
– No need to flush the cache at each context
switch
– TLB access delay is tolerable
VIRTUAL MACHINES
Key idea
• Let different operating systems run at the same
time on a single computer
– Windows, Linux and Mac OS
– A Real-time Os and a conventional OS
– A production OS and a new OS being tested
How it is done
• A hypervisor /VM monitor defines two or more
virtual machines
• Each virtual machine has
– Its own virtual CPU
– Its own virtual physical memory
– Its own virtual disk(s)
Two virtual machines
User
mode
User
process
User
process
User
process
VM Kernel
Privileged
mode
User
process
VM Kernel
Hypervisor
Translating a block address
Access block x, y
of my virtual disk
That's block v, w
of the actual disk
VM kernel
Hypervisor
Virtual disk
Access block v, w
of actual disk
Actual disk
Handling I/Os
• Difficult task because
– Wide variety of devices
– Some devices may be shared among several
VMs
• Printers
• Shared disk partition
– Want to let Linux and Windows access
the same files
Virtual Memory Issues
• Each VM kernel manages its own memory
– Its page tables map program virtual
addresses into pseudo-physical addresses
• It treats these addresses as physical
addresses
The dilemma
User process
A
VM kernel
Page 735 of process A is
stored in page frame 435
That's page frame 993 of
the actual RAM
Hypervisor
The solution (I)
• Address translation must remain fast!
– Hypervisor lets each VM kernel manage their
own page tables but do not use them
• They contain bogus mappings!
– It maintains instead its own shadow page
tables with the correct mappings
• Used to handle TLB misses
The solution (II)
• To keep its shadow page tables up to date,
hypervisor must track any changes made by the
VM kernels
• Mark page tables read-only
Nastiest Issue
• The whole VM approach assumes that a kernel
executing in user mode will behave exactly like a
kernel executing in privileged mode
• Not true for all architectures!
– Intel x86 Pop flags (POPF) instruction
–…
Solutions
1. Modify the instruction set and eliminate
instructions like POPF
• IBM redesigned the instruction set of their
360 series for the 370 series
2. Mask it through clever software
• Dynamic "binary translation" when direct
execution of code could not work
(VMWare)
CACHE CONSISTENCY
The problem
• Specific to architectures with
– Several processors sharing the same main
memory
– Multicore architectures
• Each core/processor has its own private cache
– A must for performance
• Happens when same data are present in two or
more private caches
An example (I)
CPU
CPU
Cache
x=0
Cache
x=0
RAM
An example (II)
CPU
Increments x
CPU
Cache
x=1
Still assumes x =0
Cache
x=0
RAM
An example
Sets x to 1
Resets x to 1
Both CPUs must apply
the two updates
in the same order
CPU
CPU
CPU
CPU
Cache
x=1
Cache
x=1
Cache
x=?
Cache
x=?
RAM
Rules
1. Whenever a processes accesses a variable it
always gets the value stored by the processor
that updated that variable last if the
updates are sufficiently separated in times
2. A processor accessing a variable sees all
updates applied to that variable in the
same order
– No compromise is possible here
A realization: Snoopy caches
• All caches are linked to the main memory
through a shared bus
– All caches observe the writes performed by
other caches
• When a cache notices that another cache
performs a write on a memory location that it has
in its cache, it invalidates the corresponding
cache block
An example (I)
CPU
Cache
x=2
CPU
Fetches x = 2
RAM
Cache
An example (II)
CPU
Cache
x=2
CPU
Also fetches x
RAM
Cache
x=2
An example (III)
CPU
CPU
Resets x to 0
Cache
x=0
Cache
x=2
RAM
An example (IV)
CPU
Cache
x=0
Performs
write-through
Detects
write-through
and invalidates
its copy of x
RAM
CPU
Cache
x=?
An example (IV)
CPU
Cache
x=0
CPU
when CPU wants
to access x. cache
gets correct value
from RAM
RAM
Cache
x=0
A last correctness condition
• Cache cannot reorder their memory updates
– Cache RAM buffer must be FIFO
• First in first out
Miscellaneous fallacies
• Segmented address spaces
– Address is segment number + offset in
segment
– Programmers hate them
• Ignoring virtual memory behavior when
accessing large two-dimensional arrays
Miscellaneous fallacies
• Segmented address spaces
– Address is segment number + offset in
segment
– Programmers hate them
• Ignoring virtual memory behavior when
accessing large two-dimensional arrays
• Believing that you can virtualize any CPU
architecture
DEPENDABILITY
Reliability and Availability
• Reliability
– Probability R(t) that system will be up at time
t if it was up at time t = 0
• Availability
– Fraction of time the system is up
• Reliability and availability do not measure the
same thing!
MTTF, MMTR and MTBF
• MTTF is mean time to failure
• MTTR is mean time to repair
• 1/MTTF is failure rate l
• MTTBF, the mean time between failures, is
MTBF = MTTF + MTTR
Reliability
• As a first approximation
R(t) = exp(–t/MTTF)
– Not true if failure rate varies over time
Availability
• Measured by
(MTTF)/(MTTF + MTTR) = MTTF/MTBF
– MTTR is very important
Example
• A server crashes on the average once a month
• When this happens, it takes six hours to reboot it
• What is the server availability?
Solution
• MTBF = 30 days
• MTTR = ½ day
• MTTF = 29 ½ days
• Availability is 29.5/30 =98.3 %
Example
• A disk drive has a MTTF of 20 years.
• What is the probability that the data it contains
will not be lost over a period of five years?
Example
• A disk farm contains 100 disks whose MTTF is
20 years.
• What is the probability that no data will be lost
over a period of five years?
Solution
• The aggregate failure rate of the disk farm is
100x1/20 =5
• The mean time to failure of the farm is
1/5 year
• We apply the formula
R(t) = exp(–t/MTTF) = -exp(–5×5) = 1.4 ×10-11
RAID Arrays
Today’s Motivation
• We use RAID today for
– Increasing disk throughput by allowing parallel
access
– Eliminating the need to make disk backups
• Disks are too big to be backed up in an
efficient fashion
RAID LEVEL 0
• No replication
• Advantages:
– Simple to implement
– No overhead
• Disadvantage:
– If array has n disks failure rate is n times the
failure rate of a single disk
RAID levels 0 and 1
RAID level 0
RAID level 1
Mirrors
RAID LEVEL 1
• Mirroring:
– Two copies of each disk block
• Advantages:
– Simple to implement
– Fault-tolerant
• Disadvantage:
– Requires twice the disk capacity of normal file
systems
RAID LEVEL 2
• Instead of duplicating the data blocks we use an
error correction code
• Very bad idea because disk drives either work
correctly or do not work at all
– Only possible errors are omission errors
– We need an omission correction code
• A parity bit is enough to correct a single
omission
RAID levels 2 and 3
RAID level 2
RAID level 3
Check disks
Parity disk
RAID LEVEL 3
• Requires N+1 disk drives
– N drives contain data (1/N of each data block)
• Block b[k] now partitioned into N fragments
b[k,1], b[k,2], ... b[k,N]
– Parity drive contains exclusive or of these N
fragments
p[k] = b[k,1]  b[k,2]  ...  b[k,N]
How parity works?
• Truth table for XOR (same as parity)
A
0
0
1
1
B
0
1
0
1
AB
0
1
1
0
Recovering from a disk failure
• Small RAID level 3 array with data disks D0 and
D1 and parity disk P can tolerate failure of either
D0 or D1
D0
0
0
1
1
D1
0
1
0
1
P
0
1
1
0
D1P=D0
0
0
1
1
D0P=D1
0
1
0
1
How RAID level 3 works (I)
• Assume we have N + 1 disks
• Each block is partitioned into N equal chunks
Block
Chunk
Chunk
Chunk
Chunk
N = 4 in
example
How RAID level 3 works (II)
• XOR data chunks to compute the parity chunk



 Parity
• Each chunk is written into a separate disk
Parity
How RAID level 3 works (III)
• Each read/write involves all disks in RAID array
– Cannot do two or more reads/writes in parallel
– Performance of array not netter than that of a
single disk
RAID LEVEL 4 (I)
• Requires N+1 disk drives
– N drives contain data
• Individual blocks, not chunks
– Blocks with same disk address form a stripe
x
x
x
x
?
RAID LEVEL 4 (II)
• Parity drive contains exclusive or of the
N blocks in stripe
p[k] = b[k]  b[k+1]  ...  b[k+N-1]
• Parity block now reflects contents of several
blocks!
• Can now do parallel reads/writes
RAID levels 4 and 5
RAID level 4
RAID level 5
Bottleneck
RAID LEVEL 5
• Single parity drive of RAID level 4 is involved in
every write
– Will limit parallelism
• RAID-5 distribute the parity blocks among the
N+1 drives
– Much better
The small write problem
• Specific to RAID 5
• Happens when we want to update a single block
– Block belongs to a stripe
– How can we compute the new value of the
parity block
b[k]
b[k+1]
b[k+2]
...
p[k]
First solution
• Read values of N-1 other blocks in stripe
• Recompute
p[k] = b[k]  b[k+1]  ...  b[k+N-1]
• Solution requires
– N-1 reads
– 2 writes (new block and new parity block)
Second solution
• Assume we want to update block b[m]
• Read old values of b[m] and parity block p[k]
• Compute
p[k] = new b[m]  old b[m]  old p[k]
• Solution requires
– 2 reads (old values of block and parity block)
– 2 writes (new block and new parity block)
RAID level 6 (I)
• Not part of the original proposal
– Two check disks
– Tolerates two disk failures
– More complex updates
RAID level 6 (II)
• Has become more popular as disks are
becoming
– Bigger
– More vulnerable to irrecoverable read errors
• Most frequent cause for RAID level 5 array
failures is
– Irrecoverable read error occurring while
contents of a failed disk are reconstituted
CONNECTING I/O DEVICES
Busses
• Connecting computer subsystems with each
other was traditionally done through busses
• A bus is a shared communication link
connecting multiple devices
• Transmit several bits at a time
– Parallel buses
Busses
Examples
• Processor-memory busses
– Connect CPU with memory modules
– Short and high-speed
• I/O busses
– Longer
– Wide range of data bandwidths
– Connect to memory through processormemory bus of backplane bus
Synchronous busses
• Include a clock in the control lines
• Bus protocols expressed in actions to be taken at
each clock pulse
• Have very simple protocols
• Disadvantages
– All bus devices must run at same clock rate
– Due to clock skew issues, cannot be both
fast and long
Asynchronous busses
•
•
•
•
Have no clock
Can accommodate a wide variety of devices
Have no clock skew issues
Require a handshaking protocol before any
transmission
– Implemented with extra control lines
Advantages of busses
• Cheap
– One bus can link many devices
• Flexible
– Can add devices
Disadvantages of busses
• Shared devices
– can become bottlenecks
• Hard to run many parallel lines at high clock
speeds
New trend
• Away from parallel shared buses
• Towards serial point-to-point switched
interconnections
– Serial
• One bit at a time
– Point-to-point
• Each line links a specific device to another
specific device
x86 bus organization
• Processor connects to peripherals through two
chips (bridges)
– North Bridge
– South Bridge
x86 bus organization
North
Bridge
South
Bridge
North bridge
• Essentially a DMA controller
– Lets disk controller access main memory w/o
any intervention of the CPU
• Connects CPU to
– Main memory
– Optional graphics card
– South Bridge
South Bridge
• Connects North bridge to a wide variety of I/O
busses
Communicating with I/O devices
• Two solutions
– Memory-mapped I/O
– Special I/O instructions
Memory mapped I/O
• A portion of the address space reserved for I/O
operations
– Writes to any to these addresses are
interpreted as I/O commands
– Reading from these addresses gives access
to
• Error bit
• I/O completion bit
• Data being read
Memory mapped I/O
• User processes cannot access these addresses
– Only the kernel
• Prevents user processes from accessing the
disk in an uncontrolled fashion
Dedicated I/O instructions
• Privileged instructions that cannot be executed
by User processes cannot access these
addresses
– Only the kernel
• Prevents user processes from accessing the
disk in an uncontrolled fashion
Polling
• Simplest way for an I/O device to communicate
with the CPU
• CPU periodically checks the status of pending
I/O operations
– High CPU overhead
I/O completion interrupts
• Notify the CPU that an I/O operation has
completed
• Allows th CPU to do something else while
waiting for the completion of an I/O operation
– Multiprogramming
• I/O completion interrupts are processed by CPU
between instructions
– No internal instruction state to save
Interrupts levels
• See previous chapter
Direct memory access
• DMA
• Lets disk controller access main memory w/o
any intervention of the CPU
DMA and virtual memory
• A single DMA transfer may cross page
boundaries with
– One page being in main memory
– One missing page
Solutions
• Make DMA work with virtual addresses
– Issue is then dealt by the virtual memory
subsystem
• Break DMA transfers crossing page boundaries
into chains of transfers that do not corss page
boundaries
An Example
Page
Break
into
Page
Page
DMA transfer
DMA
DMA
Page
DMA and cache hierarchy
•
Three approaches for handling temporary
inconsistencies between caches and main
memory
Solutions
1. Running all DMA accesses to the cache
– Bad solution
2. Have OS selectively
– Invalidate affected cache entries when
performing a read
– Forcing immediate flush of dirty cache
entries when performing a write
3. Have specific hardware to do same
Benchmarking I/O
Benchmarks
• Specific benchmarks for
– Transaction processing
• Emphasis on speed and graceful recovery
from failures
– Atomic transactions:
• All or nothing behavior
An important observation
• Very difficult to operate a disk subsystem at a
reasonable fraction of its maximum throughput
– Unless we access sequentially very large
ranges of data
• 512 KB and more
Major fallacies
• Since rated MTTFs of disk drives exceed one
million hours, disk can last more than 100 years
– MTTF expresses failure rate during the disk
actual lifetime
• Disk failure rates in the field match the MMTTFS
mentioned in the manufacturers’ literature
– They are up to ten times higher
Major fallacies
• Neglecting to do end-to-end checks
–…
• Using magnetic tapes to back up disks
– Tape formats can become quickly obsolescent
– Disk bit densities have grown much faster
than tape data densities.
WRITING PARALLEL PROGRAMS
Overview
• Some problems are embarrassingly parallel
– Many computer graphics tasks
– Brute force searches in cryptography or
password guessing
• Much more difficult for other applications
– Communication overhead among sub-tasks
– Ahmdahl's law
– Balancing the load
Amdahl's Law
• Assume a sequential process takes
– tp seconds to perform operations that could
be performed in parallel
– ts seconds to perform purely sequential
operations
• The maximum speedup will be
(tp + ts )/ts
Balancing the load
• Must ensure that workload is equally divided
among all the processors
• Worst case is when one of the processors does
much more work than all others
A last issue
• Humans likes to address issues one after the
order
– We have meeting agendas
– We do not like to be interrupted
– We write sequential programs
MULTI PROCESSOR ORGANIZATIONS
Shared memory multiprocessors
PU
PU
Cache
Cache
PU
…
Cache
Interconnection network
RAM
I/O
Shared memory multiprocessor
• Can offer
– Uniform memory access to all processors
(UMA)
• Easiest to program
– Non-uniform memory access to all processors
(NUMA)
• Can scale up to larger sizes
• Offer faster access to nearby memory
Computer clusters
PU
PU
Cache
Cache
RAM
RAM
PU
…
Interconnection network
Cache
RAM
Computer clusters
• Very easy to assemble
• Can take advantage of high-speed LANs
– Gigabit Ethernet, Myrinet, …
• Data exchanges must be done through
message passing
HARDWARE MULTITHREADING
General idea
• Let the processor switch to another thread of
computation while them current one is stalled
• Motivation:
– Increased cost of cache misses
Implementation
• Entirely controlled by the hardware
– Unlike multiprogramming
• Requires a processor capable of
– Keeping track of the state of each thread
• One set of registers—including PC– for
each concurrent thread
– Quickly switching among concurrent threads
Approaches
• Fine-grained multithreading:
– Switches between threads for each instruction
– Provides highest throughputs
– Slows down execution of individual threads
Approaches
• Coarse-grained multithreading
– Switches between threads whenever a long
stall is detected
– Easier to implement
– Cannot eliminate all stalls
Approaches
• Simultaneous multi-threading:
– Takes advantage of the possibility of modern
hardware to perform different tasks in parallel
for instructions of different threads
– Best solution
ALPHABET SOUP
Classification
• SISD:
– Single instruction, single data
– Conventional uniprocessor architecture
• MIMD:
– Multiple instructions, multiple data
– Conventional multiprocessor architecture
Classification
• SIMD:
– Single instruction, multiple data
– Perform same operations on a set of similar
data
• Think of adding two vectors
for (i = 0; i++; i < VECSIZE)
sum[i] = a[i] + b[i];
PERFORMANCE ISSUES
Roofline model
• Takes into account
– Memory bandwidth
– Floating-point performance
• Introduces arithmetic intensity
– Total number of floating point operations in a
program divided by total number of bytes
transferred to main memory
– Measured in FLOPS/byte
Roofline model
• Attainable GFLOPS/s =
Min(Peak Memory BWArithmetic Intensity,
Peak Floating-Point Performance
Attainable GFLOPS/s
Roofline model
20
18
16
14
12
10
8
6
4
2
0
Peak floating-point performance
Floating-point performance is
limited by memory bandwidth
0
1
2
3
Arithmetic Intensity
4
5