Amdahl’s Law in the Multicore Era Mark D. Hill and Michael R.

Download Report

Transcript Amdahl’s Law in the Multicore Era Mark D. Hill and Michael R.

Amdahl’s Law in the Multicore Era
Mark D. Hill and Michael R. Marty
University of Wisconsin—Madison
January 2010 @ Other UW
IBM’s Dr. Thomas Puzak:
Everyone knows Amdahl’s Law
© 2010 Multifacet Project
But quickly forgets it!University of Wisconsin-Madison
HPCA 2007 Debate [IEEE Micro 11-12/2007]
Single-Threaded vs. Multithreaded:
Where Should We Focus?
Yale Patt vs. Mark Hill w/ Joel Emer, moderator
© 2010 Multifacet Project
University of Wisconsin-Madison
Executive Summary
• Develop A Corollary to Amdahl’s Law
–
–
–
–
Simple Model of Multicore Hardware
Complements Amdahl’s software model
Fixed chip resources for cores
Core performance improves sub-linearly with resources
• Research Implications
(1) Need Dramatic Increases in Parallelism (No Surprise)
• 99% parallel limits 256 (base) cores to speedup 72
• New Moore’s Law: Double Parallelism Every Two Years?
(2) Many larger chips need increased core performance
(3) HW/SW for asymmetric designs (one/few cores enhanced)
(4) HW/SW for dynamic designs (serial  parallel)
11/6/2015
4
Wisconsin Multifacet Project
Outline
• Multicore Motivation & Research Paper Trends
• Recall Amdahl’s Law
• A Model of Multicore Hardware
• Symmetric Multicore Chips
• Asymmetric Multicore Chips
• Dynamic Multicore Chips
• Caveats & Wrap Up
11/6/2015
5
Wisconsin Multifacet Project
Technology & Moore’s Law
Transistor
1947
Integrated Circuit 1958
(a.k.a. Chip)
Moore’s Law 1964:
# Transistors per Chip doubles every two years (or 18 months)
Architects & Another Moore’s Law
Microprocessor 1971
50M transistors ~2000 
Popular Moore’s Law:
Processor (core) performance doubles every two years
Multicore Chip (a.k.a. Chip Multiprocesors)
Why Multicore?
• Power  simpler structures
• Memory  Concurrent accesses
to tolerate off-chip latency
• Wires  intra-core wires shorter
• Complexity  divide & conquer
But More cores; NOT faster cores
Will effective chip performance
keep doubling every two years?
Eight 4-way cores 2006
Virtuous Cycle, circa 1950 – 2005 (per Larus)
Increased
processor
performance
Larger, more
feature-full
software
Slower
programs
Larger
development
teams
Higher-level
languages &
abstractions
World-Wide Software Market (per IDC):
$212b (2005)  $310b (2010)
11/6/2015
9
Wisconsin Multifacet Project
Virtuous Cycle, 2005 – ???
X
Increased
processor
performance
Slower
programs
Larger, more
feature-full
software
GAME OVER — NEXT LEVEL?
Larger
development
teams
Higher-level
languages &
abstractions
Thread Level Parallelism & Multicore Chips
World-Wide Software Market $212b (2005)  ?
11/6/2015
10
Wisconsin Multifacet Project
100
90
80
70
60
SMP Bulge
50
Lead up
What
to
Next?
Multicore
40
30
20
10
0
1973
1974
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
Percent Multiprocessor Papers in ISCA
How has Architecture Research Prepared?
Source: Hill & Rajwar, The Rise & Fall of Multiprocessor Papers in ISCA,
http://www.cs.wisc.edu/~markhill/mp2001.html (3/2001)
11/6/2015
11
Wisconsin Multifacet Project
70
60
50
0
Source: Hill, 2/2008
11/6/2015
12
Wisconsin Multifacet Project
2009
80
2008
1973
1974
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
Percent Multiprocessor Papers in ISCA
Reacted?
How has Architecture Research Prepared?
100
90
Will Architecture Research Overreact?
Multicore Ramp
40
30
20
10
Lead up
to
Multicore
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
What
Next?
Gentle Multicore Ramp
End of Small SMP Bulge?
1989
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1988
Percent Multiprocessor Papers
What About PL/Compilers (PLDI) Research?
PLDI Begins
Source: Steve Jackson, 3/2008
11/6/2015
14
Wisconsin Multifacet Project
100%
90%
80%
70%
60%
50%
Small SMP Bulge
40%
Lead up
What
to
Next?
Multicore
NO Multicore
Ramp (Yet)
30%
20%
10%
0%
1967
1969
1971
1973
1975
1977
1979
1981
1983
1985
1987
1989
1991
1993
1994
1995
1996
1997
1999
1999
2000
2001
2002
2003
2004
2005
2006
2007
Percent Multiprocessor Papers
What About Systems (SOSP/OSDI) Research?
 SOSP odd years only

 ODSI even & SOSP odd 
Source: Michael Swift, 3/2008
11/6/2015
15
Wisconsin Multifacet Project
Outline
• Multicore Motivation & Research Paper Trends
• Recall Amdahl’s Law
• A Model of Multicore Hardware
• Symmetric Multicore Chips
• Asymmetric Multicore Chips
• Dynamic Multicore Chips
• Caveats & Wrap Up
11/6/2015
16
Wisconsin Multifacet Project
Recall Amdahl’s Law
• Begins with Simple Software Assumption (Limit Arg.)
– Fraction F of execution time perfectly parallelizable
– No Overhead for
– Scheduling
– Communication
– Synchronization, etc.
– Fraction 1 – F Completely Serial
• Time on 1 core = (1 – F) / 1 + F / 1 = 1
• Time on N cores = (1 – F) / 1 + F / N
11/6/2015
17
Wisconsin Multifacet Project
Recall Amdahl’s Law [1967]
1
Amdahl’s Speedup =
1-F
1
+
F
N
• For mainframes, Amdahl expected 1 - F = 35%
– For a 4-processor speedup = 2
– For infinite-processor speedup < 3
– Therefore, stay with mainframes with one/few processors
• Amdahl’s Law applied to Minicomputer to PC Eras
• What about the Multicore Era?
11/6/2015
18
Wisconsin Multifacet Project
Designing Multicore Chips Hard
• Designers must confront single-core design options
–
–
–
–
Instruction fetch, wakeup, select
Execution unit configuation & operand bypass
Load/queue(s) & data cache
Checkpoint, log, runahead, commit.
• As well as additional design degrees of freedom
–
–
–
–
How many cores? How big each?
Shared caches: levels? How many banks?
Memory interface: How many banks?
On-chip interconnect: bus, switched, ordered?
11/6/2015
19
Wisconsin Multifacet Project
Want Simple Multicore Hardware Model
To Complement Amdahl’s Simple Software Model
(1) Chip Hardware Roughly Partitioned into
– Multiple Cores (with L1 caches)
– The Rest (L2/L3 cache banks, interconnect, pads, etc.)
– Changing Core Size/Number does NOT change The Rest
(2) Resources for Multiple Cores Bounded
– Bound of N resources per chip for cores
– Due to area, power, cost ($$$), or multiple factors
– Bound = Power? (but our pictures use Area)
11/6/2015
20
Wisconsin Multifacet Project
Want Simple Multicore Hardware Model, cont.
(3) Micro-architects can improve single-core
performance using more of the bounded resource
• A Simple Base Core
– Consumes 1 Base Core Equivalent (BCE) resources
– Provides performance normalized to 1
• An Enhanced Core (in same process generation)
– Consumes R BCEs
– Performance as a function Perf(R)
• What does function Perf(R) look like?
11/6/2015
21
Wisconsin Multifacet Project
More on Enhanced Cores
• (Performance Perf(R) consuming R BCEs resources)
• If Perf(R) > R  Always enhance core
• Cost-effectively speedups both sequential & parallel
• Therefore, Equations Assume Perf(R) < R
• Graphs Assume Perf(R) = Square Root of R
– 2x performance for 4 BCEs, 3x for 9 BCEs, etc.
– Why? Models diminishing returns with “no coefficients”
– Alpha EV4/5/6 [Kumar 11/2005] & Intel’s Pollack’s Law
• How to speedup enhanced core?
– <Insert favorite or TBD micro-architectural ideas here>
11/6/2015
22
Wisconsin Multifacet Project
Outline
• Multicore Motivation & Research Paper Trends
• Recall Amdahl’s Law
• A Model of Multicore Hardware
• Symmetric Multicore Chips
• Asymmetric Multicore Chips
• Dynamic Multicore Chips
• Caveats & Wrap Up
11/6/2015
23
Wisconsin Multifacet Project
How Many (Symmetric) Cores per Chip?
•
•
•
•
•
Each Chip Bounded to N BCEs (for all cores)
Each Core consumes R BCEs
Assume Symmetric Multicore = All Cores Identical
Therefore, N/R Cores per Chip — (N/R)*R = N
For an N = 16 BCE Chip:
Sixteen 1-BCE cores
11/6/2015
Four 4-BCE cores
24
One 16-BCE core
Wisconsin Multifacet Project
Performance of Symmetric Multicore Chips
• Serial Fraction 1-F uses 1 core at rate Perf(R)
• Serial time = (1 – F) / Perf(R)
• Parallel Fraction uses N/R cores at rate Perf(R) each
• Parallel time = F / (Perf(R) * (N/R)) = F*R / Perf(R)*N
• Therefore, w.r.t. one base core:
1
Symmetric Speedup =
1-F
Perf(R)
• Implications?
+
F*R
Perf(R)*N
Enhanced Cores speed Serial & Parallel
11/6/2015
25
Wisconsin Multifacet Project
Symmetric Multicore Chip, N = 16 BCEs
16
Symmetric Speedup
14
12
10
8
6
4
F=0.5
2
0
1
2
4
8
16
(16 cores)
(8 cores)
R BCEs
(4 cores)
(2 cores)
(1 core)
F=0.5
R=16,
Cores=1,
Speedup=4
F=0.5, Opt. Speedup S = 4 = 1/(0.5/4 + 0.5*16/(4*16))
Need to increase parallelism to make multicore optimal!
11/6/2015
26
Wisconsin Multifacet Project
Symmetric Multicore Chip, N = 16 BCEs
16
Symmetric Speedup
14
12
10
8
F=0.9
6
F=0.9, R=2, Cores=8, Speedup=6.7
4
F=0.5
2
0
1
2
4
8
16
R BCEs
F=0.5
R=16,
Cores=1,
Speedup=4
At F=0.9, Multicore optimal, but speedup limited
Need to obtain even more parallelism!
11/6/2015
27
Wisconsin Multifacet Project
Symmetric Multicore Chip, N = 16 BCEs
16
F=0.999
Symmetric Speedup
14
F1, R=1, Cores=16, Speedup16
F=0.99
12
F=0.975
10
8
F=0.9
6
4
F=0.5
2
0
1
2
4
8
16
R BCEs
F matters: Amdahl’s Law applies to multicore chips
MANY Researchers should target parallelism F first
11/6/2015
28
Wisconsin Multifacet Project
Need a Third “Moore’s Law?”
• Technologist’s Moore’s Law
– Double Transistors per Chip every 2 years
– Slows or stops: TBD
• Microarchitect’s Moore’s Law
– Double Performance per Core every 2 years
– Slowed or stopped: Early 2000s
• Multicore’s Moore’s Law
–
–
–
–
–
Double Cores per Chip every 2 years
& Double Parallelism per Workload every 2 years
& Aided by Architectural Support for Parallelism
= Double Performance per Chip every 2 years
Starting now
• Software as Producer, not Consumer, of Performance Gains!
11/6/2015
29
Wisconsin Multifacet Project
Symmetric Multicore Chip, N = 16 BCEs
16
F=0.999
Symmetric Speedup
14
F=0.99
12
F=0.975
10
8
F=0.9
6
Recall F=0.9, R=2, Cores=8, Speedup=6.7
4
F=0.5
2
0
1
2
4
8
16
R BCEs
As Moore’s Law enables N to go from 16 to 256 BCEs,
More cores? Enhance cores? Or both?
11/6/2015
30
Wisconsin Multifacet Project
Symmetric Multicore Chip, N = 256 BCEs
Symmetric Speedup
250
F1
R=1 (vs. 1)
Cores=256 (vs. 16)
Speedup=204 (vs. 16)
200
F=0.999
150
MORE CORES!
100
F=0.99
F=0.975
50
F=0.99
R=3 (vs. 1)
Cores=85 0(vs. 16)
Speedup=80 (vs.1 13.9) 2
MORE CORES
& ENHANCE CORES!
F=0.9
F=0.5
4
8
16
R BCEs
32
F=0.9
R=28
(vs.
64
128 2) 256
Cores=9 (vs. 8)
Speedup=26.7 (vs. 6.7)
ENHANCE CORES!
As Moore’s Law increases N, often need enhanced core designs
Some arch. researchers should target single-core performance
11/6/2015
31
Wisconsin Multifacet Project
Aside: Cost-Effective Parallel Computing
• Isn’t Speedup(C) < C Inefficient? (C = #cores)
• Much of a Computer’s Cost OUTSIDE Processor
[Wood & Hill, IEEE Computer 2/1995]
Cores
• Let Costup(C) = Cost(C)/Cost(1)
• Parallel Computing Cost-Effective:
Multicores have
•
Speedup(C) > Costup(C)
even lower
• 1995 SGI PowerChallenge w/ 500MB:
Costups!!!
•
Costup(32) = 8.6
How Might Servers/Clients/Embedded Evolve?
• Recall 1970s Watergate
–
–
–
–
Secret Source Deep Throat (W. Mark Felt @ FBI)
Helped Reporters Bob Woodward & Carl Bernstein
Confirmed, but would not provide information
Frequently recommended: Follow the Money
• Today I recommend: Follow the Parallelism!
– Where Parallelism Helps Performance
– Where Parallelism Helps Cost-Performance
• Servers can use vast parallelism
• Can clients & embedded?
• If not, computing’s center of gravity  server cloud
11/6/2015
34
Wisconsin Multifacet Project
Outline
• Multicore Motivation & Research Paper Trends
• Recall Amdahl’s Law
• A Model of Multicore Hardware
• Symmetric Multicore Chips
• Asymmetric Multicore Chips
• Dynamic Multicore Chips
• Caveats & Wrap Up
11/6/2015
35
Wisconsin Multifacet Project
Asymmetric (Heterogeneous) Multicore Chips
• Symmetric Multicore Required All Cores Equal
• Why Not Enhance Some (But Not All) Cores?
• For Amdahl’s Simple Software Assumptions
– One Enhanced Core
– Others are Base Cores
• How?
– <fill in favorite micro-architecture techniques here>
– Model ignores design cost of asymmetric design
• How does this effect our hardware model?
11/6/2015
36
Wisconsin Multifacet Project
How Many Cores per Asymmetric Chip?
•
•
•
•
•
Each Chip Bounded to N BCEs (for all cores)
One R-BCE Core leaves N-R BCEs
Use N-R BCEs for N-R Base Cores
Therefore, 1 + N - R Cores per Chip
For an N = 16 BCE Chip:
Asymmetric: One 4-BCE core
& Twelve 1-BCE base cores
Symmetric: Four 4-BCE cores
11/6/2015
37
Wisconsin Multifacet Project
Performance of Asymmetric Multicore Chips
• Serial Fraction 1-F same, so time = (1 – F) / Perf(R)
• Parallel Fraction F
– One core at rate Perf(R)
– N-R cores at rate 1
– Parallel time = F / (Perf(R) + N - R)
• Therefore, w.r.t. one base core:
1
Asymmetric Speedup =
11/6/2015
1-F
Perf(R)
38
+
F
Perf(R) + N - R
Wisconsin Multifacet Project
Asymmetric Multicore Chip, N = 256 BCEs
250
Asymmetric Speedup
F=0.999
200
150
F=0.99
100
F=0.975
50
F=0.9
F=0.5
0
1
2
4
8
16
32
64
128
256
(256 cores) (1+252 cores) R BCEs (1+192 cores) (1 core)
(1+240 cores)
Number of Cores = 1 (Enhanced) + 256 – R (Base)
How do Asymmetric & Symmetric speedups compare?
11/6/2015
39
Wisconsin Multifacet Project
Recall Symmetric Multicore Chip, N = 256 BCEs
Symmetric Speedup
250
200
F=0.999
150
100
F=0.99
F=0.975
50
F=0.9
F=0.5
Recall F=0.9, R=28, Cores=9, Speedup=26.7
0
1
2
4
8
16
32
64
128
256
R BCEs
11/6/2015
40
Wisconsin Multifacet Project
Asymmetric Multicore Chip, N = 256 BCEs
250
Asymmetric Speedup
F=0.999
F=0.99
R=41 (vs. 3)
Cores=216 (vs. 85)
Speedup=166 (vs. 80)
200
150
F=0.99
100
F=0.975
50
F=0.9
F=0.5
0
1
2
4
8
16
R BCEs
32
64
128
256
F=0.9
R=118 (vs. 28)
Cores= 139 (vs. 9)
Speedup=65.6
(vs. 26.7)
Asymmetric offers greater speedups potential than Symmetric
In Paper: As Moore’s Law increases N, Asymmetric gets better
Some arch. researchers should target asymmetric multicores
11/6/2015
41
Wisconsin Multifacet Project
Asymmetric Multicore: 3 Software Issues
1. Schedule computation (e.g., when to use bigger core)
2. Manage locality (e.g., sending code or data can sap gains)
3. Synchronize (e.g., asymmetric cores reaching a barrier)
At What Level?
–
–
–
–
–
–
–
Application Programmer
Library Author
More Leverage (?)
Compiler
Runtime System
More Info (?)
Operating System
Hypervisor (Virtual Machine Monitor)
Hardware
Outline
• Multicore Motivation & Research Paper Trends
• Recall Amdahl’s Law
• A Model of Multicore Hardware
• Symmetric Multicore Chips
• Asymmetric Multicore Chips
• Dynamic Multicore Chips
• Caveats & Wrap Up
11/6/2015
43
Wisconsin Multifacet Project
Dynamic Multicore Chips, Take 1
• Why NOT Have Your Cake and Eat It Too?
• N Base Cores for Best Parallel Performance
• Harness R Cores Together for Serial Performance
• How? DYNAMICALLY Harness Cores Together
– <insert favorite or TBD techniques here>
parallel mode
sequential mode
11/6/2015
44
Wisconsin Multifacet Project
Dynamic Multicore Chips, Take 2
• Let POWER provide the limit of N BCEs
• While Area is Unconstrained (to first order)
parallel mode
How to model
these two chips?
sequential mode
• Result: N base cores for parallel; large core for serial
– [Chakraborty, Wells, & Sohi, Wisconsin CS-TR-2007-1607]
– When Simultaneous Active Fraction (SAF) < ½
11/6/2015
45
Wisconsin Multifacet Project
Performance of Dynamic Multicore Chips
• N Base Cores with R BCEs used Serially
• Serial Fraction 1-F uses R BCEs at rate Perf(R)
• Serial time = (1 – F) / Perf(R)
• Parallel Fraction F uses N base cores at rate 1 each
• Parallel time = F / N
• Therefore, w.r.t. one base core:
1
Dynamic Speedup =
11/6/2015
1-F
Perf(R)
46
+
F
N
Wisconsin Multifacet Project
Recall Asymmetric Multicore Chip, N = 256 BCEs
250
Asymmetric Speedup
F=0.999
200
Recall
150
F=0.99
100
F=0.99
R=41
Cores=216
Speedup=166
F=0.975
50
F=0.9
F=0.5
0
1
2
4
8
16
32
64
128
256
R BCEs
What happens with a dynamic chip?
11/6/2015
47
Wisconsin Multifacet Project
Dynamic Multicore Chip, N = 256 BCEs
250
Dynamic Speedup
F=0.999
200
150
F=0.99
R=256 (vs. 41)
Cores=256 (vs. 216)
Speedup=223 (vs. 166)
F=0.99
100
F=0.975
50
F=0.9
F=0.5
0
1
2
4
8
16
32
64
128
256
R BCEs
Dynamic offers greater speedup potential than Asymmetric
Arch. researchers should target dynamically harnessing cores
11/6/2015
48
Wisconsin Multifacet Project
Dynamic
Asymmetric Multicore: 3 Software Issues
1. Schedule computation (e.g., when to use bigger core)
2. Manage locality (e.g., sending code or data can sap gains)
3. Synchronize (e.g., asymmetric cores reaching a barrier)
At What Level?
–
–
–
–
–
–
–
Application Programmer
Library Author
More Leverage (?)
Compiler
Runtime System
More Info (?)
Operating System
Hypervisor (Virtual Machine Monitor)
Hardware
Dynamic Challenges > Asymmetric Ones
Dynamic chips due to power likely
Outline
• Multicore Motivation & Research Paper Trends
• Recall Amdahl’s Law
• A Model of Multicore Hardware
• Symmetric Multicore Chips
• Asymmetric Multicore Chips
• Dynamic Multicore Chips
• Caveats & Wrap Up
11/6/2015
50
Wisconsin Multifacet Project
Three Multicore Amdahl’s Law
Parallel Section
1
Symmetric Speedup =
Sequential Section
1-F
Perf(R)
+
1 Enhanced Core
Asymmetric Speedup =
F*R
Perf(R)*N
N/R
Enhanced
Cores
1
1-F
Perf(R)
+
F
Perf(R) + N - R
1 Enhanced
& N-R Base
Cores
1
Dynamic Speedup =
11/6/2015
1-F
Perf(R)
+
51
F
N
N Base
Cores
Wisconsin Multifacet Project
Amdahl’s Software Model Charges
• Serial (parallel) fraction not totally serial (parallel)
• Can extend to model to tree algorithms (bounded parallelism)
• Serial/Parallel fraction changes Weak Scaling [Gustafson]
• Prudent architectures support Strong Scaling
• Synchronization, communication, scheduling effects?
• Can extend for overheads and imbalance
• Software challenges for asymmetric/dynamic worse
• Can extend to model overheads to facilitate
• Future software will be totally parallel (see “my work”)
• I’m skeptical; not even true for MapReduce
11/6/2015
52
Wisconsin Multifacet Project
Our Hardware Model Charges
• Naïve to bound Cores by one resource (esp. area)
• Can extend for Pareto optimal mix of area,
dynamic/static power, complexity, reliability, …
• Naïve to ignore off-chip bandwidth & L2/L3 caching
• Can extend for modeling memory system
• Naïve to use performance = square root of resources
• Can extend as equations can use any function
• We architects can’t scale Perf(R) for very large R
• True, so what should we do about it?
11/6/2015
53
Wisconsin Multifacet Project
Three-Part Charge
Architects: Build more-effective multicore hardware
• Don’t lament that we can’t do, but do it!
• Play with & trash our models [IEEE Computer, July 2008]
– www.cs.wisc.edu/multifacet/amdahl
Computer Scientists: Implement “3rd Moore’s Law”
• Double Parallelism Every Two Years
• Consider Symmetric, Asymmetric, & Dynamic Chips
Finally, We must all work together
• Keep (cost-) performance gains progressing
• Parallel Programming & Parallel Computers
11/6/2015
55
Wisconsin Multifacet Project
Dynamic Multicore Chip, N = 1024 BCEs
Dynamic Speedup
1000
F=0.999
800
F1
R1024
Cores1024
Speedup 1024!
600
F=0.99
400
F=0.975
200
F=0.9
F=0.5
0
1
4
16
64
256
1024
R BCEs
NOT Possible Today
NOT Possible EVER Unless We Dream & Act
11/6/2015
56
Wisconsin Multifacet Project
Executive Summary
• Develop A Corollary to Amdahl’s Law
–
–
–
–
Simple Model of Multicore Hardware
Complements Amdahl’s software model
Fixed chip resources for cores
Core performance improves sub-linearly with resources
• Research Implications
(1) Need Dramatic Increases in Parallelism (No Surprise)
• 99% parallel limits 256 (base) cores to speedup 72
• New Moore’s Law: Double Parallelism Every Two Years?
(2) Many larger chips need increased core performance
(3) HW/SW for asymmetric designs (one/few cores enhanced)
(4) HW/SW for dynamic designs (serial  parallel)
11/6/2015
57
Wisconsin Multifacet Project
Backup Slides
11/6/2015
58
Wisconsin Multifacet Project
Symmetric Multicore Chip, N = 16 BCEs
16
F=0.999
Symmetric Speedup
14
F=0.99
12
F=0.975
10
8
F=0.9
6
4
F=0.5
2
0
1
2
4
8
16
R BCEs
11/6/2015
60
Wisconsin Multifacet Project
Symmetric Multicore Chip, N = 256 BCEs
Symmetric Speedup
250
200
F=0.999
150
100
F=0.99
F=0.975
50
F=0.9
F=0.5
0
1
2
4
8
16
32
64
128
256
R BCEs
11/6/2015
61
Wisconsin Multifacet Project
Symmetric Multicore Chip, N = 1024 BCEs
Symmetric Speedup
1000
800
600
F=0.999
400
F=0.99
F=0.975
F=0.9
200
F=0.5
0
1
4
16
64
256
1024
R BCEs
11/6/2015
62
Wisconsin Multifacet Project
Asymmetric Multicore Chip, N = 16 BCEs
F=0.999
16
F=0.99
Asymmetric Speedup
14
F=0.975
12
10
F=0.9
8
6
4
F=0.5
2
0
1
2
4
8
16
R BCEs
11/6/2015
63
Wisconsin Multifacet Project
Asymmetric Multicore Chip, N = 256 BCEs
250
Asymmetric Speedup
F=0.999
200
150
F=0.99
100
F=0.975
50
F=0.9
F=0.5
0
1
2
4
8
16
32
64
128
256
R BCEs
11/6/2015
64
Wisconsin Multifacet Project
Asymmetric Multicore Chip, N = 1024 BCEs
Asymmetric Speedup
1000
F=0.999
800
600
F=0.99
400
F=0.975
200
F=0.9
F=0.5
0
1
4
16
64
256
1024
R BCEs
11/6/2015
65
Wisconsin Multifacet Project
Dynamic Multicore Chip, N = 16 BCEs
F=0.999
16
F=0.99
Dynamic Speedup
14
F=0.975
12
10
F=0.9
8
6
4
F=0.5
2
0
1
2
4
8
16
R BCEs
11/6/2015
66
Wisconsin Multifacet Project
Dynamic Multicore Chip, N = 256 BCEs
250
Dynamic Speedup
F=0.999
200
150
F=0.99
100
F=0.975
50
F=0.9
F=0.5
0
1
2
4
8
16
32
64
128
256
R BCEs
11/6/2015
67
Wisconsin Multifacet Project
Dynamic Multicore Chip, N = 1024 BCEs
Dynamic Speedup
1000
F=0.999
800
600
F=0.99
400
F=0.975
200
F=0.9
F=0.5
0
1
4
16
64
256
1024
R BCEs
11/6/2015
68
Wisconsin Multifacet Project
Software Model Charges 1 of 2
• Serial fraction not totally serial
• Can extend software model to tree algorithms, etc.
• Parallel fraction not totally parallel
• Can extend for varying or bounded parallelism
•
•
•
•
Serial/Parallel fraction may change
Can extend for Weak Scaling [Gustafson, CACM’88]
Run larger, more parallel problem in constant time
But prudent architectures support Strong Scaling
11/6/2015
69
Wisconsin Multifacet Project
Software Model Charges 2 of 2
• Synchronization, communication, scheduling effects?
• Can extend for overheads and imbalance
• Software challenges for asymmetric multicore worse
• Can extend for asymmetric scheduling, etc.
• Software challenges for dynamic multicore greater
• Can extend to model overheads to facilitate
• Future software will be totally parallel (see “my work”)
• I’m skeptical; not even true for MapReduce
11/6/2015
70
Wisconsin Multifacet Project
Hardware Model Charges 1 of 2
• Naïve to consider total resources for cores fixed
• Can extend hardware model to how core changes
effect The Rest
• Naïve to bound Cores by one resource (esp. area)
• Can extend for Pareto optimal mix of area,
dynamic/static power, complexity, reliability, …
• Naïve to ignore challenges due to off-chip bandwidth
limits & benefits of last-level caching
• Can extend for modeling these
11/6/2015
71
Wisconsin Multifacet Project
Hardware Model Charges 2 of 2
• Naïve to use performance = square root of resources
• Can extend as equations can use any function
• We architects can’t scale Perf(R) for very large R
• True, not yet.
• We architects can’t dynamically harness very large R
• True, not yet
• So what should computer scientists do about it?
11/6/2015
72
Wisconsin Multifacet Project