Presentation

Transcript Presentation

Panel Session:
Paving the Way for Multicore
Open Systems Architectures
James C. Anderson
MIT Lincoln Laboratory
HPEC08
Wednesday, 24 September 2008
This work was sponsored by the Department of the Air Force under Air
Force Contract #FA8721-05-C-0002. Opinions, interpretations,
conclusions and recommendations are those of the author, and are not
necessarily endorsed by the United States Government.
Reference to any specific commercial product, trade name, trademark
or manufacturer does not constitute or imply endorsement.
000523-jca-1
KAM 7/12/2016
MIT Lincoln Laboratory
Objective & Schedule
•
Objective: Assess the infrastructure (hardware, software &
support) that enables use of multicore open systems architectures
–
–
•
Schedule
–
–
–
–
–
–
–
000523-jca-2
KAM 7/12/2016
Where are we now?
What needs to be done?
1525:
1540:
1600:
1605:
1635:
1655:
1700:
Overview
Guest speaker: Mr. Markus Levy
Introduction of the panelists
Previously submitted questions for the panel
Open forum
Conclusions & the way ahead
Closing remarks & adjourn
MIT Lincoln Laboratory
Paving the Way for
Multicore Open Systems Architectures
Tools
Design paths
diverged ca. 2005
000523-jca-3
KAM 7/12/2016
MIT Lincoln Laboratory
But First, A Few Infrastructure Issues
Performance was
doubling every 18
months (Moore’s Law),
but not anymore
000523-jca-4
KAM 7/12/2016
MIT Lincoln Laboratory
2000 International Technology
Roadmap for Semiconductors (ITRS00)
On-chip local clock, GHz
40000
40
30000
30
In 2000, ITRS00 predicted a
slightly lower improvement
rate vs. historical Moore’s Law
for the 2008-2014 timeframe
20000
20
10000
10
0
Dec-06
Year 2007
(Node) (65nm)
•
Dec-08
Jan-10
2010
(45nm)
Jan-11
Jan-12
Jan-13
2013
(32nm)
Jan-14
Jan-15
2016
(22nm)
~3.5X throughput every 3 yrs predicted for multiple independent
cores (~same as 4X every 3 yrs for historical Moore’s Law)
–
–
000523-jca-5
KAM 7/12/2016
Dec-07
1.4X clock speed every 3 yrs for constant power
2.5X transistors/chip every 3 yrs (partially driven by economics) for
constant chip size (chip size growth ended ~1998)
MIT Lincoln Laboratory
2001-2002 International Technology
Roadmap for Semiconductors (ITRS01-02)
On-chip local clock, GHz
40000
40
30000
30
ITRS01-02 predicted substantially
lower improvement rate vs.
ITRS00, but higher clock speeds
20000
20
10000
10
0
Dec-06
Year 2007
(Node) (65nm)
•
Dec-07
Dec-08
Jan-10
2010
(45nm)
Jan-11
Jan-12
Jan-13
2013
(32nm)
Jan-14
Jan-15
2016
(22nm)
2.8X throughput every 3 yrs predicted for multiple independent cores
000523-jca-6
KAM 7/12/2016
–
–
1.4X clock speed every 3 yrs for constant power (same as ITRS00)
2X transistors/chip every 3 yrs for constant chip size (less than ITRS00)
MIT Lincoln Laboratory
2003-2006 International Technology
Roadmap for Semiconductors (ITRS03-06)
On-chip local clock, GHz
40000
40
30000
30
ITRS03-06 predicted same
improvement rate as ITRS01-02,
but even higher clock speeds
20000
20
10000
10
0
Dec-06
Year 2007
(Node) (65nm)
•
1.4X
increase
Dec-07
Dec-08
Jan-10
2010
(45nm)
Jan-11
Jan-12
Jan-13
2013
(32nm)
Jan-14
Jan-15
2016
(22nm)
2.8X throughput every 3 yrs predicted for multiple independent cores
000523-jca-7
KAM 7/12/2016
–
–
1.4X clock speed every 3 yrs for constant power (same as ITRS00-02)
2X transistors/chip every 3 yrs for constant chip size (same as ITRS01-02)
MIT Lincoln Laboratory
2007 International Technology
Roadmap for Semiconductors (ITRS07)
On-chip local clock, GHz
40000
40
ITRS07 predicts lower clock speeds
& improvement rate vs. ITRS00-06
30000
30
20000
20
4.3X
reduction
10000
10
0
Dec-06
Year 2007
(Node) (65nm)
•
Dec-07
Dec-08
Jan-10
2010
(45nm)
Jan-11
Jan-12
Jan-13
2013
(32nm)
Jan-14
Jan-15
2016
(22nm)
~2.5X throughput every 3 yrs predicted for multiple independent cores
000523-jca-8
KAM 7/12/2016
–
–
1.23X clock speed every 3 yrs for constant power (less than ITRS00-06)
2X transistors/chip every 3 yrs for constant chip size (same as ITRS01-06)
MIT Lincoln Laboratory
GFLOPS/W
(billion 32-bit floating point operations per
sec-watt for computing 1K-point complex FFT)
0.01
0.01
1993
000523-jca-9
KAM 7/12/2016
1/1/2007
1/1/2006
2008
Year
1/1/2022
1/1/2021
1/1/2020
1/1/2019
1/1/2018
1/1/2017
1/1/2016
1/1/2015
1/1/2014
1/1/2013
1/1/2012
1/1/2011
1/1/2010
1/1/2009
2023
MIT Lincoln Laboratory
1/1/2023
1000
nm
1/1/2008
1
1
1/1/2005
1/1/2004
1/1/2003
1/1/2002
1/1/2001
1/1/2000
1/1/1999
1/1/1998
1/1/1997
1/1/1996
1/1/1995
1/1/1994
1/1/1993
COTS Compute Node (processor, memory &
I/O) Performance History & Projections (2Q08)
100
100
10
nm
10
10
IBM Cell
(90nm)
100
nm
Intel i860
µP (1µm)
0.1
0.1
Projected improvement rates,
although smaller than historical
values, are still substantial
Notional Cost (cumulative) & Schedule
for COTS 90nm Cell Broadband Engine
1000
1000
Technology evaluation
systems shipped
Austin (Texas)
design center
opens ($400M
joint investment
in Cell design)
100
100
10
10
09
Ja
n-
08
2008
Ja
n-
07
Cell
Ja
n-
06
2006
Ja
n-
05
Year
Ja
n-
04
2004
Ja
n-
03
Cell Broadband Engine:
205 GFLOPS (peak, 32bit) @ 100W (est.),
~2 GFLOPS/W
Ja
n-
02
2002
Ja
n-
00
Ja
n-
IBM, Sony &
Toshiba hold
architectural
discussions
01
11
0.1
0.1
2000
000523-jca-10
KAM 7/12/2016
Sony exits future Cell
development after
investing $1.7B
Ja
n-
Development Cost (millions of $)
10000
MIT Lincoln Laboratory
Multicore Open Systems Architecture Example
•
LEON3
–
–
–
•
0.25µm LEON3FT
–
–
•
32-bit SPARC V8 processor developed by Gaisler Research (Aeroflex as
of 7/14/08) for the European Space Agency
Synthesizable VHDL (GNU general public license) & documentation
downloadable from www.gaisler.com
Open source software support (embedded Linux, C/C++ cross-compiler,
simulator & symbolic debugger)
Commercial fault-tolerant implementation of LEON3
75 MFLOPS/W (150 MIPS & 30 MFLOPS @ 150 MHz for 0.4W)
90nm quad-core LEON3FT
–
–
–
–
System emulated with a single SRAM-based FPGA
133 MFLOPS/W (4x500 MIPS & 4x100 MFLOPS for 3W)
Each core occupies <1mm2 including caches
MOSIS fabricates 65nm & 90nm die up to 360mm2 (IBM process)
How can we improve performance (FLOPS/W),
which lags COTS by up to 9 yrs (15X) in this example?
000523-jca-11
KAM 7/12/2016
MIT Lincoln Laboratory
Notional Cost (cumulative) & Schedule for
90nm LEON3FT Multicore Processor
$3M estimated development cost
is mostly staff expense, with
schedule determined by foundry
1000
1000
100
100
10
10
11
65nm fab access
09
Ja
n-
08
Ja
n-
07
2008
Ja
n-
06
2006
Ja
n-
05
Year
Ja
n-
04
2004
Ja
n-
03
Ja
n-
02
2002
Ja
n-
00
Ja
n-
01
0.25µm
LEON3FT
0.1
0.1
2000
000523-jca-12
KAM 7/12/2016
90nm quadcore LEON3FT
Ja
n-
Development Cost (millions of $)
10000
MIT Lincoln Laboratory
Objective & Schedule
•
Objective: Assess the infrastructure (hardware, software &
support) that enables use of multicore open systems architectures
–
–
•
Schedule
–
–
–
–
–
–
–
000523-jca-13
KAM 7/12/2016
Where are we now?
What needs to be done?
1525:
1540:
1600:
1605:
1635:
1655:
1700:
Overview
Guest speaker: Mr. Markus Levy
Introduction of the panelists
Previously submitted questions for the panel
Open forum
Conclusions & the way ahead
Closing remarks & adjourn
MIT Lincoln Laboratory
Panel Session: Paving the Way for
Multicore Open Systems Architectures
Moderator: Dr. James C. Anderson
MIT Lincoln Laboratory
Prof. Saman Amarasinghe
MIT Computer Science & Artificial
Intelligence Laboratory (CSAIL)
Mr. Markus Levy
The Multicore Association &
The Embedded Microprocessor
Benchmark Consortium (EEMBC)
Dr. Steve Muir
Chief Technology Officer
Vanu, Inc.
Dr. Matthew Reilly
Chief Engineer
SiCortex, Inc.
Mr. John Rooks
Air Force Research Laboratory
(AFRL/RITC)
Emerging Computing Technology
Panel members & audience may hold diverse, evolving opinions
000523-jca-14
KAM 7/12/2016
MIT Lincoln Laboratory
Objective & Schedule
•
Objective: Assess the infrastructure (hardware, software &
support) that enables use of multicore open systems architectures
–
–
•
Schedule
–
–
–
–
–
–
–
000523-jca-15
KAM 7/12/2016
Where are we now?
What needs to be done?
1525:
1540:
1600:
1605:
1635:
1655:
1700:
Overview
Guest speaker: Mr. Markus Levy
Introduction of the panelists
Previously submitted questions for the panel
Open forum
Conclusions & the way ahead
Closing remarks & adjourn
MIT Lincoln Laboratory
Conclusions & The Way Ahead
•
Despite industry slowdown, embedded processors are still
improving exponentially (2/3 of historical Moore’s Law rate)
•
Although performance improvements in multicore designs (2.5X
every 3 yrs) continue to outpace those of uni-processors (2X every
3 yrs), the “performance gap” is less than previously projected
•
New tools and methodologies will be needed to maximize the
benefits of using multicore open systems architectures
–
–
–
–
•
Power & packaging issues
Cost & availability issues
Training & ease-of-use issues
Platform independence issues
Although many challenges remain in reducing the performance
gap between highly specialized systems vs. multicore open
systems architectures, the latter will help insulate users from
manufacturer-specific issues
Success still depends on ability of foundries to provide
smaller geometries & increasing speed for constant
power (driven by large-scale COTS product economics)
000523-jca-16
KAM 7/12/2016
MIT Lincoln Laboratory
Backup Slides
000523-jca-17
KAM 7/12/2016
MIT Lincoln Laboratory
COTS ASIC: 90nm IBM
Cell Broadband Engine (4Q06)
•
•
•
100W (est.) @ 3.2 GHz
170 GFLOPS sustained for 32-bit flt pt 1K cmplx FFT (83% of peak)
16 Gbyte memory options (~10 FLOPS/byte)
–
COTS Rambus XDR DRAM (Cell is designed to use only this memory)
256 chips
690W (note: Rambus devices may not be 3D stackable due to 2.7W/chip
power consumption)
–
Non-COTS solution: Design a bridge chip ASIC (10W est.) to allow
use of 128 DDR2 SDRAM devices (32W)
128 chips in 3D stacks to save space (0.25W/chip)
Operate many memory chips in parallel
Buffer to support Rambus speeds
Increased latency vs. Rambus
•
•
40W budget for external 27 Gbytes/sec simultaneous I&O (using
same non-COTS bridge chip to handle I/O with Cell)
Single non-COTS CN (compute node) using DDR2 SDRAM
–
–
000523-jca-18
KAM 7/12/2016
170 GFLOPS sustained for 200W (182W est. for CN plus 18W for 91%
efficient DC-to-DC converter)
0.85 GFLOPS/W & 56 GFLOPS/L
MIT Lincoln Laboratory
COTS Compute Node Performance
History & Projections (2Q08)
Texas Memory
Systems TM-44
Blackbird ASIC
(180nm)
16nm
Intel Polaris
(65nm)
22nm
16nm
Virtex-4
(90nm)
10
10
Clearspeed
Motorola MPC7400 CS301 ASIC
IBM Cell
PowerPC RISC with
(130nm)
(90nm)
AltiVec (180-220nm)
4X in
3 yrs
Catalina
1
1
Research
Pathfinder-1
ASIC (350nm)
000523-jca-19
KAM 7/12/2016
MIT Lincoln Laboratory
1/1/2023
2023
1/1/2022
1/1/2021
1/1/2020
1/1/2019
1/1/2018
1/1/2017
1/1/2016
1/1/2015
1/1/2014
1/1/2013
2013
1/1/2012
1/1/2011
1/1/2010
Year
1/1/2009
1/1/2008
1/1/2007
1/1/2006
1/1/2005
1/1/2004
1/1/2003
2003
1/1/2002
1/1/2001
1/1/2000
1/1/1999
1/1/1998
1/1/1997
1/1/1996
1/1/1995
0.01
0.01
1993
Virtex II
(150nm) MPC7448
(90nm)
MPC7410
(180nm)
Xilinx Virtex
Freescale
FPGA (180nm) MPC7447A
Intel i860
(130nm)
µP (1000nm)
1/1/1994
0.1
0.1
Compute Node
includes FFT (fast
Fourier transform)
processor, memory (10
FLOPS/byte),
simultaneous I&O (1.28
bits/sec per FLOPS) &
DC-to-DC converter
Analog Devices
SHARC DSP
(600nm)
1/1/1993
GFLOPS/W
(billion 32-bit floating point operations per
sec-watt for computing 1K-point complex FFT)
100
100
World’s Largest Economies: 2000 vs. 2024
2000
population
2000
GDP*
6 billion total
2024
population
2024
GDP*
8 billion total
* Gross domestic product
(purchasing power parity)
“Europe’s Top 5” are
Germany, Great Britain,
France, Italy & Spain
U.S. population grows by 1/3 & income shrinks from 5X to <4X world average
000523-jca-20
KAM 7/12/2016
MIT Lincoln Laboratory
Effective Number of Bits
Highest-performance COTS (commercial off-theshelf) ADCs (analog-to-digital converters), 3Q08
20
1986-1990
1991-1995
1996-2000
2003-2007:
~0.25 bit/yr
@ 400 MSPS
16
2001-2005
2006-2008
12
8
2000-2007: 2X speed in 6.75 yrs (but
up to 0.5 bit processing gain is more
than offset by loss of 1 effective bit)
4
0
0 .1
1
10
100
1000
Sampling Rate (million samples/sec)
000523-jca-21
KAM 7/12/2016
10000
Historic
device-level
improvement
rates may not
be sustainable
as technical &
economic
limits are
approached
MIT Lincoln Laboratory
Spur-free dynamic range (dB)
SFDR (spur-free dynamic range) for
Highest-performance COTS ADCs, 3Q08
120
1986-1990
110
1991-1995
1996-2000
100
2001-2005
2006-2008
90
80
70
60
50
40
30
0.1
1
10
100
1000
Sampling rate (million samples/sec)
000523-jca-22
KAM 7/12/2016
10000
SFDR
performance
often limits
ability to
subsequently
achieve
“processing
gain”
MIT Lincoln Laboratory
Quantization energy (pJ)
Energy per Effective Quantization Level for
Highest-performance COTS ADCs, 3Q08
350
1986-1990
300
1991-1995
1996-2000
250
2001-2005
2006-2008
200
150
100
50
0
0.1
1
10
100
1000
Sampling rate (million samples/sec)
000523-jca-23
KAM 7/12/2016
10000
Recent power
decrease
(driven by
mobile devices
market) from
smaller
geometry &
advanced
architectures
MIT Lincoln Laboratory
Resolution Improvement Timeline for
Highest-performance COTS ADCs, 1986-2008
Sampling Rate (MSPS)
Sampling Rate (MSPS)
10,000
10000
Non-COTS chip set
(ENOB=4.6 @ 12 GSPS)
1000
1000
100
100
1010
11
0.1
0.1
86
Jan-86
88
Jan-88
90
Jan-90
92
Jan-92
94
Jan-94
96
Jan-96
98
Jan-98
00
Jan-00
02
Jan-02
04
Jan-04
06
Jan-06
08
Jan-08
Year
Year
000523-jca-24
KAM 7/12/2016
MIT Lincoln Laboratory

Presentation

Transcript Presentation

Directory