Computer Engineering Fundamentals Computer Architecture Fall 2006 R. Venkatesan EN-4036 737-8900 [email protected] Course details  Objective: Review basic computer architecture topics and        thus prepare students for.

Download Report

Transcript Computer Engineering Fundamentals Computer Architecture Fall 2006 R. Venkatesan EN-4036 737-8900 [email protected] Course details  Objective: Review basic computer architecture topics and        thus prepare students for.

Computer Engineering Fundamentals
Computer Architecture
Fall 2006
R. Venkatesan
EN-4036 737-8900 [email protected]
0
Course details
 Objective: Review basic computer architecture topics and







thus prepare students for ENGR9861 (hi-perf comp arch)
Course evaluation: final exam (60%), assignments (2x20%)
Classes: Mondays 11 – 12 EN-4033
Problem sets: 3 (selected problems in the first 2 => submit)
Tutorials: Thursdays 11– 12 EN-4033 Wang Guan
Plagiarism: consult course information sheet
Textbook: Hennessy & Patterson – a quantitative approach
Syllabus: Chapters 1, 2 & 5 in the textbook




Computer organization
Measuring performance, speed up
instruction set architecture, MIPS
memory organization: cache, main memory, virtual memory
 Course notes: will make available on the web
Fall 2006
ENGR9859 R. Venkatesan
Computer Architecture
1
Computer Organization
 Input & Output units: mic, keyboard, mouse, monitor,
speaker, etc.
 CPU: processor (microprocessor) that includes ALU,
registers, internal buses, cache, controller


Program Counter: contains address instruction to be executed
General purpose registers: temporary storage (not memory)
 Memory unit: kernel of the operating system on ROM,
higher levels of cache, DRAM (main memory), disks
 Network: LAN, WLAN, WAN: ethernet, Internet, ATM
Fall 2006
ENGR9859 R. Venkatesan
Computer Architecture
2
Ways to improve performance of computer
 Use faster material: silicon, GaA, InP
 Use faster technology: photochemical lithography
 Employ better architecture within one processor




Selection of instruction set: RISC/CISC, VLIW
Cache (levels of cache): higher throughput
Virtual memory: relocatability, security
Pipelining: k stages gives a maximum speedup of k



Superpipelining,
Superscalar (multiple pipelines) with dynamic scheduling
Branch prediction
 Use multiple processors: emphasis of ENGR9861


Scalability, level of parallelism
Shared memory, array processing, multicomputers, MPP
 Employ better software: compilers, etc.
Fall 2006
ENGR9859 R. Venkatesan
Computer Architecture
3
Speedup
 Any (architectural) enhancement will hopefully lead to
better performance, and speedup is a measure of this
improvement.
 Performance improvement should be based on the
total CPU time taken to execute the application, and
not just any of the component times like memory
access time or clock period.
 If the whole processor is replicated, then the fraction
enhanced is 100%, as the whole computation will be
impacted.
 If an enhancement affects only a part of the
computation, then we need to determine the fraction
of the CPU time impacted by the enhancement.
Fall 2006
ENGR9859 R. Venkatesan
Computer Architecture
4
Amdahl’s Law
 The following simple, but important law tells us that we need to
always aim at making enhancements that will affect a large
fraction of the computation, if not the whole computation.
Speedup 
Performance for entire task using the enhancement when possible
Performance for entire task without using the enhancement
Speedup 
Fall 2006
1
(1 - Fraction enhanced) 
Fraction enhanced
Speedup of the enhancement
ENGR9859 R. Venkatesan
Computer Architecture
5
CPU (Computation) time
 CPU time is the product of three quantities:



Number of instructions executed or Instruction Count (IC):
remember this is not the code (program) size
Average number of clock cycles per instruction (CPI): if CPI
varies for different instruction, a weighted average is needed
Clock period (τ)
 CPU time = IC * CPI * τ
 An architectural (or compiler-based) enhancement
that is aimed to decrease one of the above two
factors might end up increasing one or both of the
other two. It is the product of the three quantities
after applying the enhancement that gives us the new
CPU time.
Fall 2006
ENGR9859 R. Venkatesan
Computer Architecture
6
Measuring performance of computers
 CPU clock speed is an indicator, but not sufficient
 CPU speed, memory speed, other devices, system
configuration: gives better idea, but not sufficient
 Metrics such as millions of instructions per second (MIPS),
billions of floating point instructions per second (GFLOPS),
trillions of operations per second (TOPS), thousands of
polygons per second (kpolys), millions of network transactions
per second, etc. give only a partial picture – they leave out IC
 Benchmark programs: real applications such as gcc, Tex; toy
benchmarks such as sieve of Eratosthenes, Puzzle; kernels
such as Livermore loops, Linpack; synthetic benchmarks such
as Whetstone, Dhrystone, Dhampstone
 Benchmark suites such as SPECint95 and SPEC CPU2000
offer collections of the above. Performance is compared with a
selected standard (using geometric mean), and given as a
dimensionless number.
Fall 2006
ENGR9859 R. Venkatesan
Computer Architecture
7
Speedup example
 Three enhancements for different parts of





computation are contemplated, with speedups of 40,
20 and 5, respectively. E1 improves 20%, E2
improves 30% and E3 improves 70% of the
computation. Assuming both cost the same, which is
a better choice?
Speedup due to E1 = 1 / ((1-0.2) + 0.2/40) = 1.242
Speedup due to E2 = 1 / ((1-0.3) + 0.3/20) = 1.399
Speedup due to E3 = 1 / ((1-0.7) + 0.7/5) = 2.272
So, a higher fraction enhanced is more beneficial
than a huge speedup for a small fraction.
So, frequency of execution of different instructions
becomes important – statistics.
Fall 2006
ENGR9859 R. Venkatesan
Computer Architecture
8
RISC/CISC
 Complex instruction set computers have 1000s of
instructions, whereas RISC processors are designed
to efficiently execute about 100 instructions.
 Now, most processors have certain RISC and certain
CISC features. However, the architects / designers
should remember Amdahl’s law and implement most
frequently executed instructions efficiently using
hardware. Some of the rarely executed instructions
can even be microcoded.
 Simplicity and uniformity (consistency and
orthogonality) in architectural choices will lead to
hardware that has a short critical path (worst-case
propagation delay), resulting in high performance.
Fall 2006
ENGR9859 R. Venkatesan
Computer Architecture
9
Instruction types
 Arithmetic & Logic (ALU) instructions


Only register operands
Register operands plus one immediate operand (with instn.)
 Memory access instructions: Load & Store
 Transfer of control instructions: branches and jumps





that are conditional or unconditional, procedure calls,
returns, interrupts, exceptions, traps, etc.
Special instructions: flag setting, etc.
Floating point instructions
Graphic, digital signal processing (DSP) instructions
Input / Output instructions: Memory-mapped I/O?
Other instructions
Fall 2006
ENGR9859 R. Venkatesan
Computer Architecture
10
Instruction occurrence
 ALU (including ALU immediates):
 Loads:
 Stores:
 Conditional Branches:
 Other instructions:
25-40%
20-30%
10-15%
15-30%
>2%
 Data processing applications use loads and stores
more frequently than ALU instructions; numeric
(scientific & engineering) applications use more ALU.
 Graphic processors, network processors, DSP, vector
coprocessors will have a different mix.
Fall 2006
ENGR9859 R. Venkatesan
Computer Architecture
11
Size for word, instruction & memory access
 Most common word sizes for general purpose CPU
are 32 bits and 64 bits; even now, 8-bit and 16-bit
microcontrollers are common.
 Instruction size is variable or fixed, the latter being
common in high-performance processors; can be
smaller than, equal to or larger than word size.
 Most GP CPUs use 64-bit words and 32-bit
instructions; very large instruction word (VLIW)
processors are emerging.
 Memory access, especially writing, needs to be done
one byte (8 bits) at a time due to character data type.
That is why, memory is byte-addressable, even when
the word size is larger.
Fall 2006
ENGR9859 R. Venkatesan
Computer Architecture
12
Memory organization
 lsb is bit 0 or msb is bit 0: only a convention.
 Every byte of memory has a unique address, and so
each word spans several addresses. For example, in
a 64-bit computer, each word spans 8 addresses –
must be consecutive 8 addresses.
 Little endian (the lowest address points to the LSB of
a word) or big endian: only a convention
 Aligned or non-aligned memory access:


Aligned: efficient storage but complex and slow hardware
Nonaligned: gaps in memory but fast and simple hardware
 Most (but not all) modern processors restrict that the
size of all operands in ALU instructions be equal to
the word size – again, for fast and simple hardware.
Fall 2006
ENGR9859 R. Venkatesan
Computer Architecture
13
A typical modern CPU’s instruction set
 MIPS-64 follows GPR load-store architecture.
 Uses 64-bit word, 32-bit instructions.
 Has 32 64-bit GPRs; R0 is read-only and has 0; R31
is used with procedure calls; 32 more 32-bit FPRs
that can be accessed in pairs as 64-bit entities.
 Instruction formats:



R-type: 6-bit opcode, 3x5-bit register ids, 6-bit function
This is used by ALU reg-reg instructions
I-type: 6-bit opcode, 2x5-bit register ids, 16-bit disp./imm.
This is used by ALU reg-imm., load, store, cond. branch.
J-type: used only for J and JAL
 Load, store: byte, half word, word, double word!
 ALU, FP instructions; BEQ & BNE; J, JR, JAL. JALR.
Fall 2006
ENGR9859 R. Venkatesan
Computer Architecture
14
Sample code in MIPS-64
here:
DADDI R1, R0, #8000
SUBDI R1, R1, #8
LD
R2, 30000(R1)
DADD R2, R2, R2
SD
30000(R1), R2
BNEZ R1, here
; 1000 words in array
; Word has 8 bytes
; Array starts at 30000
; Double contents
; Store back in array
; if not done, repeat
A 1000-word array, starting at the memory location 30000, is accessed wordby-word (64-bit words), and the value of each word is doubled.
Memory access: Approximately, 40% of all instructions executed
access memory. Without cache, memory is accessed 2*1000
times during the execution of the code. If L1 data cache is at
least 64kB large, then there would be only compulsory misses of
the cache during the execution of the code.
Fall 2006
ENGR9859 R. Venkatesan
Computer Architecture
15
Compilers and architecture
 Architectures must be developed in such a manner
that compilers can be easily developed. Architects
should be aware of how compilers work. Compiler
designers can design good products if they are aware
of the details of the architecture.
 Simple compilers perform table look-up. Optimizing
compilers can come up with machine code that is up
to 50 times faster in execution.
 Optimization is done by compilers at various levels:




Fall 2006
Front-end: transform to common intermediate form
High-level: loop transformations and procedure in-lining
Global optimizer: global and local optimizations;
register allocation
Code generator: machine-dependent optimizations
ENGR9859 R. Venkatesan
Computer Architecture
16
Example – compiler & architecture
 A program execution entails running 10 million instructions. 45%
of these are ALU instructions. An optimizing compiler removes
a third of them. The processor runs at 1 GHz clock speed.
Instruction execution times are: 1 cc for ALU, 2 cc for load &
store that comprise 25% of all operations, and 3 cc for
conditional branches. Computer MIPS ratings and CPU times
before and after optimization.
 Before optimization:
CPIavg = 0.45*1 + 0.25*2 + 0.3*3 = 1.85
MIPS = 10^-6 / (1.85 * 10^-9) = 540.5
CPU time = 10^-9*10^7*(4.5*1 + 2.5*2 + 3.0*3) = 0.0185 s
 After optimization:
CPIavg = (0.3*1 + 0.25*2 + 0.3*3)/0.85 = 2.176
MIPS = 10^+6 / (2.176 * 10^-9) = 459.6 (smaller!?!)
CPU time = 10^-9 * 10^7 * (3.0*1 + 2.5*2 + 3.0*3) = 17 ms
Fall 2006
ENGR9859 R. Venkatesan
Computer Architecture
17