It’s the end of the class as we know it

Transcript It’s the end of the class as we know it

Last lecture
Some misc. stuff
An older real processor
Class review/overview.
Misc. Status issues
• Saturday (4/18) @9pm
– Project due.
• Sunday 4/19:
– Review session 4-6pm (DOW 1017)
– Pizza House 7-9pm (RSVP request posted, please respond by
tomorrow)
• Tuesday (4/21):
– Project talk groups
• Sign up if you haven’t yet done so!
• Tuesday (4/21)
– Written report due (via e-mail) @9pm
• Wednesday (4/22)
– Review session 3-4:30pm (1500 EECS)
• Thursday (4/23)
– Exam at 4pm in our classroom. That’s 4pm sharp, not Michigan time!
Office hour changes
• Friday 4/17 moved to 10:30am-12 in my
office (4632 Beyster)
• Tuesday 4/21 office hours moved to
3:15-4:45 in my office
Stuff still to do
• Oral report
– Don’t forget to be there for the whole hour (or
longer if your group is during class time)
– PowerPoint or other slides
• Either bring portable or USB stick
• Written report
– Due 9pm Tuesday via e-mail.
AMD 64-bit core
Most taken from
http://www.chip-architect.com/
Bit-interleaved
busses running
“North-South”
Integer
Decode/Dispatch
• 3 types of instructions
– Direct path
• RISC-like
– Vector path
• Broken into smaller instructions via micro code.
– Double
• 128-bit instructions which can be broken into 2 64-bit
independent instructions are (called Double)
• Others are done via microcode
• Most 128-bit SSE and SSE2 are made into doubles.
RS
• Each cycle an instruction is issued into
one of 3 lanes.
– Each lane has
• 8 RSs
• 1 ALU
• 1 AGU (Address Generation Unit)
– Each RS sees broadcasts from all ALUs,
AGUs, L/S units etc.
Rename
• Break the physical register file into 2 parts
(sort of like P6 scheme with ARF/RoB)
– 72 in-flight instructions are kept in the RoB
• The other structure is the IFFRF: Integer
Future File and Register File
– 16 registers of committed state
– 16 “future registers”
– 8 scratch-pad registers
Future file
• In the P6 scheme we had to look 3 places for the
data
– The PRF
– The RoB
– The CDB (later)
• Here we look in the FF or the CDB-like-things
later.
– The FF holds the speculative value if it is known.
– At execution complete instructions check to see if
they were the last thing to dispatch that writes to a
given physical register.
• This is done by tagging the FF with the RoB number.
– If they were the last to have that AR as a destination,
they update the FF.
How do we use the FF?
• At dispatch we:
–
–
–
–
Check the FF for source operands
Reserve a spot in the RoB
Place our tag (RoB number) in the FF
Mark the FF entry as invalid
• At EX complete we:
– Send RoB number and data to the CDB
– Send data to the RoB
– Update FF if tag matches
• At retire
– update ARF value (from RoB)
• At mispredict
– Copy ARF value into FF.
What did the FF buy us?
• P6-like advantages
– No free-list for PRF
– Can just clear the RAT on mis-predict.
• But no need to access the RoB looking for
data
– RoB data only written once (EX complete)
and only read once (Commit)
• Some pain
– Early branch resolution looks hard
ROB: An 8-bit descriptor for 72 entries
Re-Order-Buffer Tag definition
Instruction In Flight Number
wrap
bit
bit 7
re-order buffer index 0...23
bit 6
bit 5
bit 4
bit 3
sub-index 0..2
bit 2
bit 1
bit 0
1) A sub-index 0,1 or 2 which identifies from which of the three lanes the
instruction was dispatched.
2) A value 0..23 that identifies the “cycle" in which the instruction was
dispatched. The "cycle counter" wraps to 0 after reaching 23.
3) A wrap bit. When two instructions have different wrap bits then the cycle
counter has wrapped between the dispatches.
More on the RoB
• What is basically happening is that we
have three RoBs
– Each one size 24
– We cycle through each one so that none get
ahead of the other.
– Reduces read/write ports!
– “Banking”
Mispredictions
• It looks like they wait until retirement to
resolve all exceptions.
– Mispredictions are treated as exceptions!
• They just clear everything and have the
retired registers overwrite the speculative
ones in the IFFRF
More details.
• Each x86 instruction can launch both an
ALU and an AGU operation
– Because x86 has lots of memory operations
this makes sense.
• ALUs broadcast result tag one cycle early
– So RS can launch data to the ALU before
data arrives.
Lane
8
Class summary
• Major topics
– ILP in hardware (Out-of-order processors)
• How they work AND why we use them
–
–
–
–
Caches and Virtual Memory
Multi-processor
ILP in software (Complier, IA-64)
Power
• Less major topics
– Memory disambiguation
– Branch prediction
• Direction and target
– Advanced OoO issues
• Superscalar, instruction scheduling, multi-threading, etc.
The big questions
• What is computer architecture?
• What are the metrics of performance?
• What are the techniques we use to
maximize these metrics?
ILP in hardware (1/2)
• ILP definitions
– Hazards vs dependencies
• Data, Name and Control dependencies
– What ILP means and finding it.
• Dynamic Scheduling
– Tomasulo’s (three versions!)
• You can be promised a question on this!
• Branch Prediction
– Local, global, hybrid/correlating
• Tournament and gshare
– BTBs
ILP in hardware (2/2)
• Multiple Issue
– Static
• Static Superscalar
• VLIW
– Dynamic superscalar
• Speculation
– Branch, data
• ILP limit studies
ILP in hardware: Questions
•
True or False
1. The original T-algorithm only allows reordering
within basic blocks
2. In P6, if it weren’t for precise interrupts, it would be
okay to retire instructions out-of-order as long as
they had finished executing and a branch isn’t
skipped over.
3. ILP in hardware is limited in scope due to the
“instruction window” which is basically the size of
the RS.
Quick idea: SMT
• One processor, two threads.
Caching (1/2)
• There is a huge amount of stuff associated
with caching. The important stuff
– Locality
• Temporal/Spatial
• 3’Cs model
• Stack distance model
– Nuts-and-bolts
•
•
•
•
Replacement policies (LRU, pseudo-LRU)
Performance (hit rate, Thit; Tmiss, average access time)
Write back/Write thru
Block size
– Basic improvement
• Multi-level cache
• Critical word first
• Write buffers
Caching (2/2)
• Non-standard caches
– Hash
– Victim
– Skew
• Misc.
– Virtual addresses and caching
– Impact of prefetching
– Latency hiding with OO execution
Cache: Questions (1/2)
• Changing __________ has an impact on
compulsory misses.
• A victim cache is more likely to help with
________ than ________ though it can help
both (3’Cs)
• At least _____ bits are required to keep exact
track of LRU in a 5-way associative cache.
Cache question (2/2)
• A ____________ cache has a number of
sets equal to the number of lines in the
cache.
• A fully-associative cache with N lines will
miss an access that has a stack distance
of ________ (state the largest range you
can).
Multi-processor
• Amdahl’s law as it applies to MP.
• Bus-based multi-processor
– Snooping
– MESI
– Bus transaction types (BRL etc.)
• Distributed-shared
– Directory schemes
• Synchronization
– Critical sections
– Spin-locks
Multi-processor: Question
• Under the MESI protocol what is the
advantage of having a distinct clean and
dirty exclusive state?
Software techniques for ILP
(1/2)
• Pipeline scheduling
– Reordering instructions in a basic block to remove
pipe stalls
– Loop unrolling
• Static information passed to processor
– Static branch prediction
– Static dependence information
• Loop issues
– Detecting loop dependencies
– Software pipelining
Software techniques for ILP
(2/2)
• Global code scheduling
– Predicated instruction and CMOV
– Memory reference speculation
– Issues with preserving exception behavior
• IA-64 as a case study of hardware support for
software ILP techniques
– Speculative loads
– Advanced loads
– Software pipelining optimizations
Software techniques for ILP: Questions
• What is the most significant disadvantage of
loop unrolling?
• Using CMOV re-write the following code
snippet, removing the branch. Don’t change
exception behavior and assume DIV only
causes an exception if R3=0
BNE R1 R2 skip
R1=R2/R3
skip: nop
Power
• Understand why it’s important
• Power vs. Energy
• How it’s related to the existence of multicore
• Understand voltage scaling issues