Document 7671069
Download
Report
Transcript Document 7671069
Computer Architecture
“The architecture of a computer is the
interface between the machine and
the software”
- Andris Padges
IBM 360/370 Architect
Course Outline
Computer Architecture
Quarter
Autumn 2006-7
Instructor Muhammad Jahangir Ikram
Office:
Room 424
e-mail: [email protected]
Office Hours: Monday and Wednesday, 3:00 – 4:30pm
Course Outline (Contd..)
Description
This course focuses on the principles, practices and issues in
Computer Architecture, while examining computer design
tradeoffs both qualitatively and quantitatively.
The course starts with a quick overview of computer design
fundamentals and instruction set principles, the materials which
the student has already covered in the pre-requisite of this
course.
The following topics are covered in greater detail:
Advanced Pipelining
Instruction-level parallelism and Compiler Support
Memory - hierarchy design
SIMD, VLIW, Superscalar Architectures
Code Optimization and Compiler Issues
Course Outline (Contd..)
Text Book
Hennessy, J. L, and Patterson, D. A.,
Computer Architecture: A Quantitative
Approach, 2nd Edition. Morgan
Kaufmann, 1996.
Course Outline (Contd..)
Lectures
There will be two 75 minutes
lecturers per week and 50 minutes
Lecture/ 100 minutes lab.
TOTAL SESSIONS = 29
There will be four Labs during weeks
2, 3, 4, 5.
Course Outline (Contd..)
Grading
Quizzes & assignments 17+3%
Laboratory 10% (Atten 3 + Lab Task 3 + HW 4)
Midterm exam 30%
Final exam 40%
Schedule
Fundamentals of Computer Design
Measuring and Reporting Performance
Quantitative Principles of Computer Design
Instruction Set Principles and Examples
Classifying Instruction Set Architectures
Memory Addressing
Operations in the Instruction Set
Encoding an Instruction Set
3-5
2.1 – 2.8
7-14
Single Cycle Computer Study
9
LAB 2: Study of Pipelining
12
The Major Hurdle of Pipelining – Pipeline Hazards
Data Hazards
1.1 – 1.10
LAB 1: MIPS Instruction Format and Instruction Study 6
Pipelining Overview
What Is Pipelining?
1,2
A.1 to A10
Schedule
Control Hazards and Static Branch Prediction
LAB 3: Pipeline Studies and Control Hazards
Scoreboarding
MIDTERM
15
ILP and Dynamic Exploitation
17-19
Static Branch Prediction
Tomasulo’s Dynamic Scheduling
Dynamic Branch Prediction
Superscalar and VLIW architectures
Advanced Pipelining And ILP (Cont’d.)
20-22
Taking Advantage of More ILP with Multiple Issue
P6 Architecture
Advanced Pipelining And ILP (Cont’d.)
23-25
Compiler Support for Exploiting ILP
Hardware Support for Extracting More Parallelism
Putting It All Together: The PowerPC 620, and Itanium
3.1 – 3.5
3.6 – 3.10
4.1, 4.7
Schedule
Memory-Hierarchy Design
The ABCs of Caches
Reducing Cache Misses
Reducing Cache Miss Penalty
Virtual Memory System
Computer I/O
26-29 5.1 – 5.7
30
6.1 - ?
Background
Emergence of the first microprocessor in
late 1970’s
Roughly 35% growth per year
Important changes in the marketplace:
Virtual elimination of assembly language
programming reduced the need for object code
compatibility
Creation of standardized, vendor-independent
operating systems, such as UINX, LINX lowered the
risk of bringing out a new architecture
Development of RISC
These changes lead to the development
of a new set of architectures, called the
RISC (Reduced Instruction Set Computer)
architecture
RISC uses two performance techniques:
Instruction level parallelism (pipelining)
Use of Cache
Growth in microprocessor performance
Moore’s Law
Technology Scaling
Scaling of Transistors
Feature Size has reduced to 3 micron in
1985 to 0.09 micron.
Reducing Feature-size means quadratic
increase in Transistor Count and better
Performance.
But higher routing Delays and poor
performance of Long Wires
Also means More Power Consumption (Less
load Capacitance)
The Itanium Processor
Intel microprocessor die
IC Cost Trends
(Source: IC Knowledge)
Measuring performance
Definition of time:
Response time, elapse time:
CPU time:
The latency to complete
the task, including disk access, input/output, operating system
overhead etc.
User CPU Time
System CPU Time:
Time spent in the program
Time Spent by operating system.
Unix Time Command:
90.7s 12.9s 2:39 (159s) 65%
(User, System, Elapsed Time)
(90.7+12.9)/159
What is a Benchmark?
A benchmark is "a standard of measurement or
evaluation" (Webster’s II Dictionary).
A computer benchmark is typically a computer program
that performs a strictly defined set of operations - a
workload - and returns some form of result - a
metric - describing how the tested computer
performed.
Computer benchmark metrics usually measure
speed: how fast was the workload completed; or
throughput: how many workload units per unit time
were completed.
Running the same computer benchmark on multiple
computers allows a comparison to be made.
Source: Standards Performance Evaluation Corporation
Programs to Evaluate Performance
Real Applications
Modified (or scripted) applications
Kernels
Toy benchmarks
Synthetic benchmarks
Programs to evaluate performance
Real Applications
Example: Compliers for C, text-processing
software etc.
Modified (or scripted) applications
CPU oriented bench mark, I/O may be
removed to minimize its impact on execution
Programs to evaluate performance
Kernels
Toy benchmarks
To isolate performance of individual features of a
machine.
Produces a result that the user already knows
Synthetic benchmarks
Try to match the average frequency of operations
and operands of a large set of programs
Benchmark Suites
SPEC95, SPEC2000 (11 Integer, 14 FP),
SPEC2006 (12 Integer, 17 FP)
C Compiler, Router, FEM
Desktop (CPU and Graphics Intensive)
Server (File Servers, Web Servers,
Transaction Processing)
Embedded (EEMBC)
34 Kernels
What is SPEC
SPEC is the Standard Performance Evaluation
Corporation. SPEC is a non-profit organization
whose members include computer hardware
vendors, software companies, universities,
research organizations, systems integrators,
publishers and consultants. SPEC's goal is to
establish, maintain and endorse a standardized
set of relevant benchmarks for computer
systems. Although no one set of tests can fully
characterize overall system performance, SPEC
believes that the user community benefits from
objective tests which can serve as a common
reference point.
What does a benchmark measure?
the computer processor (CPU),
the memory architecture, and
the compilers.
SPEC CPU2006 contains two components that
focus on two different types of compute
intensive performance:
The CINT2006 suite measures computeintensive integer performance, and
The CFP2006 suite measures computeintensive floating point performance
Source: Standards Performance Evaluation Corporation
Reference Machine
Source: Standards Performance Evaluation Corporation
SPEC uses a historical Sun system, the "Ultra Enterprise
2" which was introduced in 1997, as the reference
machine. The reference machine uses a 296 MHz
UltraSPARC II processor, as did the reference machine
for CPU2000. But the reference machines for the two
suites are not identical: the CPU2006 reference
machine has substantially better caches, and the
CPU2000 reference machine could not have held
enough memory to run CPU2006.
It takes about 12 days to do a rule-conforming run of
the base metrics for CINT2006 and CFP2006 on the
CPU2006 reference machine. SPEC2000 now takes
less a minute on latest High Performance M/Cs
Example Result for SPEC 2000
Source: Standards Performance Evaluation Corporation
SYSTEM
Intel SE440BX-2 (800 MHz
Pentium III)
1 core, 1 chip, 1 core/chip
Base
340
Peak
344
Intel D850GB motherboard(1.4
GHz, Pentium 4 processor)
1 core, 1 chip, 1 core/chip
502
512
Sun Blade 2500 (1.28GHz)
1 core, 1 chip, 1 core/chip
604
696
Intel D850EMV2 motherboard
(2.0A GHz, Pentium 4
processor)
1 core, 1 chip, 1 core/chip
756
759
PowerEdge 2650 (3.06 GHz
Xeon) DELL
1 core, 1 chip, 1 core/chip
(Hyper-Threading
Technology disabled)
1014
1056
Precision WorkStation 350 (2.8
GHz P4) DELL
1 core, 1 chip, 1 core/chip
1017
1061
SGI Altix 3000 (1300MHz,
Itanium 2)
1 core, 1 chip, 1 core/chip
1019
--
Example Result for SPEC 2000
Source: Standards Performance Evaluation Corporation
SYSTEM
Precision Workstation 690
(Intel® Xeon® processor 5160,
3.0
#CPU
4 cores, 2 chips,
2 cores/chip
BASE
3057
PEAK
3063
PowerEdge 1950 (Intel Xeon
processor 5160, 3.00GHz)
4 cores, 2 chips,
2 cores/chip
3061
3065
Intel(R) DG965WH
motherboard( 2.93 GHz, Intel(R)
Core(TM) 2
2 cores, 1 chip, 2
cores/chip
3099
3109
Intel(R) DG965WH
motherboard( 2.93 GHz, Intel(R)
Core(TM) 2
2 cores, 1 chip, 2
cores/chip
3106
3111
Precision Workstation 390 (Intel
Core 2 Extreme processor X6
2 cores, 1 chip, 2
cores/chip
3108
3119
Summarizing Performance
Amdahl’s Law
The performance improvement to be
gained from using faster mode of
execution is limited by the fraction of the
time the faster mode can be used
Amdahl’s Law:
Law of Diminishing Returns
Speedup
Performance for entitre task with the enhancement when possible
Performance for entitre task without the enhancement when possible
Fraction Enhanced
Execution time new ExecutionTime old 1 Fraction Enhanced
Speedup Enhanced
SpeedUp
Execution Time Old
1
Execution Time new
Fraction Enhanced
1 Fraction Enhanced
Speedup Enhanced
CPU performance Equations
Instructions
Clock Cycle
Seconds
Instruction
Clock Cycle
CPU Time =
Program
Example:
Frequency of FP operations = 25%
Average CPI of FP operations = 4.0
Average CPI of other instructions = 1.33
Frequency of FPSQR = 2%
CPI of FPSQR = 20
Assume CPI of FPSQR decreased to 2 OR the CPI
of all FP operations to 2.5
Compare these two designs using the CPU
performance equations
Example: Solution
n
ICi
CPIorignal
CPIi
i 1 Instruction Count
4 25% 1.33 75% 2.0
CPI for enhanced FPSQR
CPIFPSQR CPI orignal 2% CPIold FPSQR CPI new FPSQRonly
2.0 2% 20 2 1.64
CPI for enhanced FP operation
CPInewFP 75% 1.33 25% 2.51.625
Example: Solution
SpeedupnewFP
CPUtimeorignal IC Clockcycle CPIorignal
CPUtimenewFP IC Clockcycle CPInewFP
CPIorignal 2.0
1.23
CPInewFP 1.625
Another Measure -- MIPS
Instruction Count
MIPS =
Execution Time 10
6
Example:
An Embedded Processor
120 MIPS for single processor.
80 MIPS for Processor –Co-Processor
Combination (That is how they are measured for combined)
I= Number of Integer Instructions
F = Number of Floating Point Instructions
(8M)
Y = No. of Integer Instructions to Emulate
one FP Instruction (50)
W = Time for choice 1 (4 seconds)
B = Time for Choice 2
End of Lecture 1
CINT 2006
400.perlbench
401.bzip2
C
C
PERL Programming Language
Compression
403.gcc
429.mcf
445.gobmk
456.hmmer
C
C
C
C
C Compiler
Combinatorial Optimization
Artificial Intelligence: go
Search Gene Sequence
458.sjeng
462.libquantum
464.h264ref
C
C
C
Artificial Intelligence: chess
Physics: Quantum Computing
Video Compression
471.omnetpp
473.astar
483.xalancbmk
C++
C++
C++
Discrete Event Simulation
Path-finding Algorithms
XML Processing
CFP 2006
410.bwaves
Fortran
Fluid Dynamics
416.gamess
Fortran
Quantum Chemistry
433.milc
C
Physics: Quantum Chromodynamics
434.zeusmp
Fortran
Physics/CFD
435.gromacs
C/Fortran
Biochemistry/Molecular Dynamics
436.cactusADM
C/Fortran
Physics/General Relativity
437.leslie3d
Fortran
Fluid Dynamics
444.namd
C++
Biology/Molecular Dynamics
447.dealII
C++
Finite Element Analysis
450.soplex
C++
Linear Programming, Optimization
453.povray
C++
Image Ray-tracing
454.calculix
C/Fortran
Structural Mechanics
459.GemsFDTD
Fortran
Computational Electromagnetics
465.tonto
Fortran
Quantum Chemistry
470.lbm
C
Fluid Dynamics
481.wrf
C/Fortran
Weather Prediction
482.sphinx3
C
Speech recognition