Finishing out EECS 470

Transcript Finishing out EECS 470

Finishing out EECS 470
A few snapshots of the real world
Real processors:
How they are different than your project.
• What we’ve talked about so far isn’t grounded
by the real world in any meaningful way.
– That is, we haven’t really looked at how real
processors do things
• Today we’ll look at two processors
– We’ll start with a 2003 core from AMD
• Lots of details available, close to your project
– Jump to the latest Intel core.
• Look at performance issue
AMD 64-bit core
Most taken from
http://www.chip-architect.com/
Bit-interleaved
busses running
“North-South”
Integer
Decode/Dispatch
• 3 types of instructions
– Direct path
• RISC-like
– Vector path
• Broken into smaller instructions via micro code.
– Double
• 128-bit instructions which can be broken into 2 64-bit independent
instructions are (called Double)
• Others are done via microcode
• Most 128-bit SSE and SSE2 are made into doubles.
RS
• Each cycle an instruction is issued into one of
3 lanes.
– Each lane has
• 8 RSs
• 1 ALU
• 1 AGU (Address Generation Unit)
– Each RS sees broadcasts from all ALUs, AGUs, L/S
units etc.
Rename
• Break the physical register file into 2 parts
(sort of like P6 scheme with ARF/RoB)
– 72 in-flight instructions are kept in the RoB
• The other structure is the IFFRF: Integer Future
File and Register File
– 16 registers of committed state
– 16 “future registers”
– 8 scratch-pad registers
Future file
• In the P6 scheme we had to look 3 places for the
data
– The PRF
– The RoB
– The CDB (later)
• Here we look in the FF or the CDB-like-things later.
– The FF holds the speculative value if it is known.
– At execution complete instructions check to see if they
were the last thing to dispatch that writes to a given
physical register.
• This is done by tagging the FF with the RoB number.
– If they were the last to have that AR as a destination, they
update the FF.
How does the
• At issue we:
–
–
–
–
Check the FF for source operands
Reserve a spot in the RoB
Place our tag (RoB number) in the FF
Mark the FF entry as invalid
• At EX complete we:
– Send RoB number and data to the CDB
– Send data to the RoB
– Update FF if tag matches
• At retire
– update ARF value (from RoB)
• At mispredict
– Copy ARF value into FF.
What did the FF buy us?
• P6-like advantages
– No free-list for PRF
– Can just clear the RAT on mis-predict.
• But no need to access the RoB looking for data
– RoB data only written once (EX complete) and
only read once (Commit)
• Some pain
– Early branch resolution looks hard
ROB
• It uses an 8-bit descriptor for 72 entries.
Re-Order-Buffer Tag definition
Instruction In Flight Number
wrap
bit
bit 7
re-order buffer index 0...23
bit 6
bit 5
bit 4
bit 3
sub-index 0..2
bit 2
bit 1
bit 0
1) A sub-index 0,1 or 2 which identifies from which of the three lanes the
instruction was dispatched.
2) A value 0..23 that identifies the “cycle" in which the instruction was
dispatched. The "cycle counter" wraps to 0 after reaching 23.
3) A wrap bit. When two instructions have different wrap bits then the cycle
counter has wrapped between the dispatches.
More on the RoB
• What is basically happening is that we have
three RoBs
– Each one size 24
– We cycle through each one so that none get
ahead of the other.
– Reduces read/write ports!
Mispredictions
• It looks like they wait until retirement to
resolve all exceptions.
– Mispredictions are treated as exceptions!
• They just clear everything and have the retired
registers overwrite the speculative ones in the
IFFRF
More details.
• Each x86 instruction can launch both an ALU
and an AGU operation
– Because x86 has lots of memory operations this
makes sense.
• ALUs broadcast result tag one cycle early
– So RS can launch data to the ALU before data
arrives.
Lane
8
Intel’s Haswell
• Latest Intel microarchtecture
– 22nm process
– 4-wide OoO processor
– x86
• An evolution, not revolution
– Very similar to architectures from the last 8 years.
http://www.anandtech.com/show/6355/intels-haswell-architecture
Intel
Basics
• Converts x86 instructions into microops
– RISC-like instructions
– Even more basic than RISC in some cases
• Loads and Stores generally turn into two instructions
– Address compute and memory access
What’s interesting?
• Seeing how things have changed compared to
previous microarchitectures
• Transactional support
• Power issues
The three recent frontends
Buffer sizes
• 192 RoB entries
• 60 RS
• 72 Loads
• 42 stores
Other key features
• Transactional
synchronization
– Execute lock-protected
section
– Don’t acquire lock
– If someone else is doing
the same thing at the
same time
• Undo all memory
accesses
• Do again with locks.
• Why?
• New sleep states
– More like handheld
devices.
Microarchitecture and performance
void tightloop() {
unsigned j;
for (j = 0; j < N; ++j)
tightloop() runs in .68 sec
counter += j;
}
loop_with_extra_call runs in .60 sec
void foo() { }
Why
void loop_with_extra_call() {
unsigned j;
for (j = 0; j < N; ++j)
{ __asm__("call foo");
counter += j;
}
}
http://eli.thegreenplace.net/2013/12/03/intel-i7-loop-performance-anomaly/
0000000000400530 <tightloop>:
400530: xor %eax,%eax
400532: nopw 0x0(%rax,%rax,1)
400538: mov 0x200b01(%rip),%rdx
# 601040 <counter>
40053f: add %rax,%rdx
400542: add $0x1,%rax
400546: cmp $0x17d78400,%rax
40054c: mov %rdx,0x200aed(%rip)
# 601040 <counter>
400553: jne 400538 <tightloop+0x8>
400555: repz retq
400557: nopw 0x0(%rax,%rax,1)
0000000000400560 <foo>:
400560: repz retq
0000000000400570 <loop_with_extra_call>:
400570: xor %eax,%eax
400572: nopw 0x0(%rax,%rax,1)
400578: callq 400560 <foo>
40057d: mov 0x200abc(%rip),%rdx
# 601040 <counter>
400584: add %rax,%rdx
400587: add $0x1,%rax
40058b: cmp $0x17d78400,%rax
400591: mov %rdx,0x200aa8(%rip)
# 601040 <counter>
400598: jne 400578 <loop_with_extra_call+0x8>
40059a: repz retq
40059c: nopl 0x0(%rax)