WELCOME TO PARALLEL COMPUTER ARCHITECTURE

Download Report

Transcript WELCOME TO PARALLEL COMPUTER ARCHITECTURE

Performance
Evaluation
Performance
• Two metrics
– Latency (how long to do X)
• response time
• execution time
– Throughput (how often can it do X)
• Example: car assembly line
– Takes 6 hours to make a car
• (latency is 6 hours)
– A car leaves every 30 minutes
• (throughput is 2 cars per hour)
– Overlap results in Throughput >1/Latency
2
Metrics
• For desktop and workstations in most cases  execution
time
– Also known as response time
– Reciprocal of performance
• For servers  throughput
– How many jobs can get done in unit time
– Almost always a response time limit (per job) is also imposed
• For embedded processors  execution time
– In many situations a hard or soft real time deadline is imposed
– Hard deadlines cannot be missed, soft deadlines can be missed in a
limited cases
• Goal: maximize performance within power budget
3
Benchmark
• Want to compare two processors by measuring
their performance
– Need some standardized set of programs
– These are called benchmark programs
• Each market sector has a different focus: needs
different set of benchmark
– Desktop PC
– Server
– Embedded system
• Two famous industry standard benchmarks
– SPEC: CPU performance
– TPC: OLTP (On-Line Transaction Processing)
performance
4
•
Benchmark
Types of benchmark
– Real applications
• taken from day to day applications; may have portability problems; ideal
for final performance report, not good for diagnosis purpose: in many
cases overly complex and do not provide any insight into what’s wrong
– Modified applications
• enhanced portability, possible to focus on particulars aspects of CPU
(e.g. less I/O if purpose is to measure CPU)
– Kernels
• frequently used code segments e.g., Livermore loops, Linpack; can
focus on one CPU feature at a time
– Toys
• small code snippets to quickly test your intuition; must not be used for
performance measurement
– Synthetic
• same philosophy as kernels, but not part of any real application; cooked
up to model average application
– Microbenchmark
• Small, specially written programs to isolate a specific aspect of
performance characteristics: Processing: integer, floating point, local
5
memory, input/output, etc.
Desktop benchmarks
• SPEC CPU is the standard
– Provided by Standard Performance Evaluation
Corporation (SPEC) http://www.spec.org/
• Currently in the 4th generation
– SPEC89, SPEC92, SPEC95, SPEC2000
– Has two types of programs
• integer and floating-point to stress different CPU units
– For desktop or uniprocessor workstation/server
• SPEC2000 has 12 integer and 14 floating-point
programs
– For graphics performance two available benchmarks
• SPECviewperf, SPECapc
6
SPEC: System Performance
Evaluation Cooperative
The most popular and industry-standard set of CPU
benchmarks.
• SPECmarks, 1989:
–
10 programs yielding a single number (“SPECmarks”).
• SPEC92, 1992:
–
SPECInt92 (6 integer programs) and SPECfp92 (14 floating point programs).
• SPEC95, 1995:
– SPECint95 (8 integer programs):
• go, m88ksim, gcc, compress, li, ijpeg, perl, vortex
– SPECfp95 (10 floating-point intensive programs):
• tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp,
wave5
– Performance relative to a Sun SuperSpark I (50 MHz) which is given a score of
SPECint95 = SPECfp95 = 1
• SPEC CPU2000, 1999:
–
CINT2000 (11 integer programs). CFP2000 (14 floating-point intensive programs)
–
Performance relative to a Sun Ultra5_10 (300 MHz) which is given a score of
SPECint2000 = SPECfp2000 = 100
7
SPEC CPU2000 Programs
CINT2000
(Integer)
CFP2000
(Floating
Point)
Benchmark
Language
164.gzip
175.vpr
176.gcc
181.mcf
186.crafty
197.parser
252.eon
253.perlbmk C
254.gap
255.vortex
256.bzip2
300.twolf
C
C
C
C
C
C
C++
C
C
C
C
Descriptions
Compression
FPGA Circuit Placement and Routing
C Programming Language Compiler
Combinatorial Optimization
Game Playing: Chess
Word Processing
Computer Visualization
PERL Programming Language
Group Theory, Interpreter
Object-oriented Database
Compression
Place and Route Simulator
168.wupwise
Fortran 77
Physics / Quantum Chromodynamics
171.swim
Fortran 77
Shallow Water Modeling
172.mgrid
Fortran 77
Multi-grid Solver: 3D Potential Field
173.applu
Fortran 77 Parabolic / Elliptic Partial Differential Equations
177.mesa
C
3-D Graphics Library
178.galgel
Fortran 90 Computational Fluid Dynamics
179.art
C
Image Recognition / Neural Networks
183.equake C
Seismic Wave Propagation Simulation
187.facerec Fortran 90 Image Processing: Face Recognition
188.ammp
C
Computational Chemistry
189.lucas
Fortran 90 Number Theory / Primality Testing
191.fma3d
Fortran 90 Finite-element Crash Simulation
200.sixtrack Fortran 77 High Energy Nuclear Physics Accelerator Design
301.apsi
Fortran 77 Meteorology: Pollutant Distribution
8
Source: http://www.spec.org/osg/cpu2000/
Example 1 – SPEC
Benchmark
9
Example 1 – SPEC
Benchmark
10
Example 1 – SPEC
Benchmark
11
Windows benchmarks for PC
• Three benchmarks are available for PC running
Windows
– Business Winstone
• A script running Netscape and other Office products;
tries to simulate multiprogramming among these
– CC Winstone
• Runs multiple applications focused on content
creation
– e.g. Photoshop, Premiere, audio-editing programs, Navigator
– Winbench
• Set of kernels to measure CPU performance, video
performance, disk performance etc.
– Check out http://www.etestinglabs.com/benchmarks/
12
Server benchmarks
• SPECrate
– Throughput-oriented benchmarks
• run a copy of the same SPEC application on each processor and
report average number of jobs finished per unit time
– One big aspect of servers that is missing in SPECrate the I/O
throughput (disk as well as network)(不考虑I/O)
• SPECSFS
– Tests the performance of NFS using a series of file server requests;
measures CPU, disk and network throughput; also puts response
time limit on each request
• SPECWeb
– Web server benchmark; simulates multiple clients requesting both
static and dynamic pages from a server; also clients can upload data
on the server
13
Server benchmarks
• Transaction processing (TP)
– Probably the most widely used benchmark in server community
– Measures database access and update throughput of a server
• Simple examples
– airline reservation, bank ATM
• Complex systems
– complicated query processing e.g. online book shops (amazon)
– Provided by Transaction Processing Council (TPC)
• TPC-A was the first one;
• TPC-C (complex queries)
• TPC-H (unrelated queries), TPC-R (DBMS is optimized based on
past query pattern), TPC-W (business-oriented transactional web
server)
– The measured metric is transactions per second along with response
time limit for individual transactions
– Check out http://www.tpc.org/
14
Example 2 – TPC
Benchmark
15
Embedded sector
• Quite difficult to come up with a good benchmark
– Wide range of applications
– Requirements vary widely from one system to another(各个系统的
要求各不相同,各有侧重)
– Reality
• in many cases an embedded system designer designs his/her
own benchmarks which are either the target applications or
kernels extracted from them
– Today the best known benchmark set is offered by EEMBC
(Embedded Microprocessor Benchmark Consortium)
• EEMBC has five types of applications
– automotive/industrial (16 kernels), consumer (5 multimedia kernels),
networking (3 kernels), office automation (4 graphics and text
processing kernels), telecommunications (6 filtering and DSP
kernels)
– Check out http://www.eembc.org/
16
Comparing
Performance
• “X is n times faster than Y”
Execution
time
Y
Execution
time
X
n
• “Throughput of X is n times that of Y”
Tasks per unit time
X
Tasks per unit time
Y
n
17
The Quantitative
Approach!
• Goal: Improve overall CPU performance
Execution
time of app A on machine
Y
Execution
time of app A on machine
X
n
• What can change the CPU performance?
– (the CPU Performance Equation)
• What should we focus on?
– (Amdahl’s Law)
18
CPU Performance
Equation (1)
• A program is comprised of a number of instructions executed ,
IC
– Measured in:
instructions/program
• The average instruction takes a number of cycles per
instruction (CPI) to be completed.
– Measured in:
cycles/instruction, CPI
• CPU has a fixed clock cycle time CC = 1/clock rate
– Measured in:
seconds/cycle
• CPU execution time
CPU time
= Seconds =
Program
Instructions × Cycles × Seconds
Program
Instruction
Cycle
T = IC ×
CPI ×
CC
19
CPU Execution Time:
Example
• A Program is running on a specific machine with the
following parameters:
– Total executed instruction count: 10,000,000 instructions Average
CPI for the program: 2.5 cycles/instruction.
– CPU clock rate: 200 MHz.
• What is the execution time for this program:
CPU time
= Seconds
= Instructions x Cycles
Program
Program
x Seconds
Instruction
Cycle
CPU time = Instruction count x CPI x Clock cycle
= 10,000,000
x 2.5 x 1 / clock rate
= 10,000,000
x 2.5 x 5x10-9
=
.125 seconds
20
Factors Affecting CPU Performance
CPU time
= Seconds
= Instructions x Cycles
Program
Program
Instruction
Instruction
Count I
CPI
Program
X
X
Compiler
X
X
X
X
(ISA)
Organization
Technology
x Seconds
X
Cycle
Clock Cycle C
X
X
21
(From 550)
CPU Performance
Equation (2)
CPU time  CPU Clock Cycles
 Clock
cycle time
 n

CPU time    IC i  CPI i   Clock cycle time
 i 1

For each kind
of instruction
How many cycles it
takes to execute an
instruction of this kind
How many instructions
of this kind are there in
the program
22
Performance Comparing
• To compare the performance of two machines (or CPUs) “A”,
“B” running a given specific program:
PerformanceA = 1 / Execution TimeA
PerformanceB = 1 / Execution TimeB
• Machine A is n times faster than machine B means:
Speedup = n =
PerformanceA
PerformanceB
=
Execution TimeB
Execution TimeA
• Example: For a given program:
Execution time on machine A: ExecutionA = 1 second
Execution time on machine B: ExecutionB = 10 seconds
PerformanceA / PerformanceB = Execution TimeB / Execution TimeA
= 10 / 1 = 10
Amdahl’s Law
• Qualifies performance gain
• Amdahl’s Law defined…
– The performance improvement to be gained from
using some faster mode of execution is limited by the
amount of time the enhancement is actually used
• Amdahl’s Law defines speedup:
Speedup =
Or
Speedup =
Perf. for entire task using enhancement when possible
Perf. For entire task without using enhancement
Execution time for entire task without enhancement
Execution time for entire task using enhancement
when possible
24
An example
• Original processor:
–
–
–
–
–
Frequency of FP operations: 25%
Average CPI of FP operations: 4.0
Average CPI of other instructions: 1.33
Frequency of FP Square Root (FPSQR) instruction: 2%
CPI of FPSQR: 20
• There are two new design alternatives to consider:
– Design 1:reduce the CPI of FPSQR to 2
– Design 2: reduce the average CPI of all FP operations to 2
• Which one is better for overall CPU performance?
25
An example continued…
• First we need to calculate a base for comparison:
CPIoriginal = (4.0 * 0.25) + (1.33 * 0.75) = 2.0
• Next, compute CPI for the enhanced FPSQR option:
CPInew with FPSQR = CPIoriginal – 0.02(CPIold FPSQR – CPInew FPSQR)
= 2.0 – 0.02(20.0 – 2) = 1.64
• Now, we can compute a new FP CPI:
CPInew with FP = (0.75 * 1.33) + (0.25 * 2) = 1.5= 1.5
– This CPI is lower than the first alternative (of reducing the
FPSQR CPI to 2)
S1=2/1.64
S2=2/1.5
26
Amdahl’s Law and
Speedup
• Speedup tells us how much faster the machine will run
with an enhancement
• 2 things to consider:
– 1st…
• Fraction of the computation time in the original machine that
can use the enhancement
– i.e. if a program executes in 30 seconds and 15 seconds of
exec. uses enhancement, fraction = ½ (always < 1)
– 2nd…
• Improvement gained by enhancement (i.e. how much faster
does the program run overall)
– i.e. if enhanced task takes 3.5 seconds and original task took 7,
we say the speedup is 2 (always > 1)
27
Amdahl’s Law
Equations
Execution timenew = Execution timeold x
Speedupoverall =
Execution Timeold
Execution Timenew
=
Use previous equation,
Solve for speedup
(1 – Fractionenhanced) +
Fractionenhanced
Speedupenhanced
1
(1 – Fractionenhanced) +
Fractionenhanced
Speedupenhanced
Please, please, please, don’t just try to memorize
these equations and plug numbers into them.
It’s always important to think about the problem too!
28
Deriving the previous
formula
Speedupoverall =
Execution Timeold
Execution Timenew
=
1 - % enhanced
(i.e. part of the task
will take the same
amount of time as
before)
(1 – Fractionenhanced) +
Fractionenhanced
Speedupenhanced
normalized old execution time
1
(1 – Fractionenhanced) +
1
Fractionenhanced
Speedupenhanced
% of task that will run faster
how much faster it will run
(note: # should be > 1)
(otherwise, performance gets worse)
29
Pictorial Depiction of
Amdahl’s Law
Enhancement E accelerates fraction F of execution time by a factor of S
Before:
Execution Time without enhancement E:
Unaffected, fraction: (1- F)
Affected fraction: F
Unchanged
Unaffected, fraction: (1- F)
F/S
After:
Execution Time with enhancement E:
Execution Time without enhancement E
1
Speedup(E) = ------------------------------------------------------ = -----------------Execution Time with enhancement E
(1 - F) + F/S
NOTE: S∞, speed1/(1-f)
30
Amdahl’s Law example 1
• Make the Common Case Fast
Overall Speedup 
Speedup
Enhanced
 20
Speedup 
Fraction
Enhanced
1

0 .1 
 1  0 . 1  

20 

1

 1  Fraction


 0.1
 1 . 105
Enhanced
VS

Fraction
Enhanced
Speedup
Enhanced
Speedup
Enhanced
Speedup 




 1.2
Fraction
1
0 .9 

 1  0 . 9  

1 .2 

Enhanced
 0 .9
 1 . 176
Important: Principle of locality
Approx. 90% of the time spent in 10% of the code
31
What does Amdahl’s
Law tell us?
• Serves as a guide as
– to how much an enhancement will improve
performance AND where to spend your resources
• Overall goal:
– Spend your resources where you get the most
improvement!
32
Performance Enhancement Example 2
• For the RISC machine with the following instruction mix given earlier:
Op
ALU
Load
Store
Branch
Freq
50%
20%
10%
20%
Cycles
1
5
3
2
CPI(i)
.5
1.0
.3
.4
% Time
23%
45%
14%
18%
CPI = 2.2
• If a CPU design enhancement improves the CPI of load instructions
from 5 to 2, what is the resulting performance improvement from
this enhancement:
Fraction enhanced = F = 45% or .45
Unaffected fraction = 100% - 45% = 55% or .55
Factor of enhancement = 5/2 = 2.5
Using Amdahl’s Law:
1
1
Speedup(E) = ------------------ = --------------------- =
(1 - F) + F/S
.55 + .45/2.5
1.37
33
An Alternative Solution Using CPU Equation
Op
Freq
ALU
50%
Load
20%
Store
10%
Branch 20%
•
Cycles
1
5
3
2
CPI(i)
.5
1.0
.3
.4
% Time
23%
45%
14%
18%
CPI = 2.2
If a CPU design enhancement improves the CPI of load instructions from 5
to 2, what is the resulting performance improvement from this
enhancement?
CPIold = 2.2
CPInew = .5 x 1 + .2 x 2 + .1 x 3 + .2 x 2 = 1.6
Original Execution Time
Instruction count x old CPI x clock cycle
Speedup(E) = -------------------------------- = ------------------------------------------------------New Execution Time
Instruction count x new CPI x clock cycle
old CPI
= ------------ =
new CPI
2.2
--------1.6
= 1.37
Which is the same speedup obtained from Amdahl’s Law in the first solution.
34
Performance Enhancement Example 3
• A program runs in 100 seconds on a machine with multiply
operations responsible for 80 seconds of this time. By how much
must the speed of multiplication be improved to make the program
four times faster?
Desired speedup = 4 =
100
----------------------------------------------------Execution Time with enhancement
Execution time with enhancement
= 25 seconds
25 seconds = (100 - 80 seconds) + 80 seconds / n
25 seconds =
20 seconds
+ 80 seconds / n
 5

n
= 80 seconds / n
= 80/5 = 16
Hence multiplication should be 16 times faster to get a speedup of
4.
35
Performance Enhancement Example 2
• For the previous example with a program running in 100 seconds on
a machine with multiply operations responsible for 80 seconds of
this time. By how much must the speed of multiplication be
improved to make the program five times faster?
Desired speedup = 5 =
100
----------------------------------------------------Execution Time with enhancement
Execution time with enhancement =
20 seconds
20 seconds = (100 - 80 seconds) + 80 seconds / n
20 seconds =
20 seconds
+ 80 seconds / n
 0
= 80 seconds / n
No amount of multiplication speed improvement can achieve this.
•
Amdahl's Law With Multiple
Enhancements: Example 4
Three CPU performance enhancements are proposed with the
following speedups and percentage of the code execution time
affected:
Speedup1 = S1 = 10 Percentage1 = F1 = 20%
Speedup2 = S2 = 15 Percentage1 = F2 = 15%
Speedup3 = S3 = 30 Percentage1 = F3 = 10%
• While all three enhancements are in place in the new design, each
enhancement affects a different portion of the code and only one
enhancement can be used at a time.
• What is the resulting overall speedup?
Speedup
1

( (1   F )  
i
i
F
i
S
i
i
)
• Speedup = 1 / [(1 - .2 - .15 - .1) + .2/10 + .15/15 + .1/30)] = 1
37
/ [.55 + .0333]
= 1 /.5833 = 1.71
Pictorial Depiction of
Example
Before:
Execution Time with no enhancements: 1
Unaffected, fraction: .55
S1 = 10
F1 = .2
/ 10
S2 = 15
S3 = 30
F2 = .15
F3 = .1
/ 15
/ 30
Unchanged
Unaffected, fraction: .55
After:
Execution Time with enhancements: .55 + .02 + .01 + .00333 = .5833
Speedup = 1 / .5833 = 1.71
Note: All fractions refer to original execution time.
38
Principle of locality
• Programs are not random pieces of code
– Rule of thumb
• 90% of time is spent in 10% of code
– locality principle applies to data accesses
• Spatial locality
– closely spaced data are accessed closely in time
• Temporal locality
– currently accessed data are likely to be accessed in near
future
– Exploit locality in design
• e.g. caches try to exploit temporal locality while
prefetching exploits spatial locality
39
Exploit Parallelism
• today’s computer systems
– Parallelism at different levels
• Have more disks to improve I/O throughput
• Have more memory banks to support parallel data
access
• Process multiple instructions in parallel
• Digital circuits are inherently parallel systems
(individual bits get operated on in parallel)
• Have more ALUs to carry out parallel additions
40
Other ways to measure
performance
• Use MIPS (millions of instructions/second)
MIPS =
Instruction Count
Exec. Time x
=
106
Clock Rate
CPI x 106
• MIPS is a rate of operations/unit time.
• Performance can be specified as the inverse of
execution time
– so faster machines have a higher MIPS rating
• So, bigger MIPS = faster machine. Right?
41
Wrong!!!
• 3 significant problems with using MIPS:
– Problem 1:
• MIPS is instruction set dependent.
• (And different computer brands usually have different
instruction sets)
– Problem 2:
• MIPS varies between programs on the same computer
– Problem 3:
• MIPS can vary inversely to performance!
• Let’s look at an example of why MIPS doesn’t work…
42
A MIPS Example (1)
• Consider the following computer:
Instruction counts (in millions) for each
instruction class
Code from:
A
B
C
Compiler 1
5
1
1
Compiler 2
10
1
1
The machine runs at 100MHz.
Instruction A requires 1 clock cycle, Instruction B requires 2
clock cycles, Instruction C requires 3 clock ncycles.
!
Note
important
formula!
CPI =
CPU Clock Cycles
Instruction Count
S
=
i =1
CPIi x Ci
Instruction Count
43
A MIPS Example (2)
count
CPI1 =
[(5x1) + (1x2) + (1x3)] x 106
(5 + 1 + 1) x 106
MIPS1 =
CPI2 =
cycles
100 MHz
1.43
cycles
= 69.9
[(10x1) + (1x2) + (1x3)] x 106
MIPS2 =
(10 + 1 + 1) x 106
100 MHz
1.25
= 10/7 = 1.43
= 80.0
= 15/12 = 1.25
So, compiler 2 has a higher
MIPS rating and should be
faster?
44
A MIPS Example (3)
• Now let’s compare CPU time:
!
Note
important
formula!
CPU Time =
CPU Time1 =
CPU Time2 =
Instruction Count x CPI
Clock Rate
7 x 106 x 1.43
100 x 106
12 x 106 x 1.25
100 x 106
= 0.10 seconds
= 0.15 seconds
Therefore program 1 is faster despite a lower MIPS!
45