Tilera TILE Gx Presentation
Download
Report
Transcript Tilera TILE Gx Presentation
Tile Processors:
Many-Core for Embedded
and Cloud Computing
Richard Schooler
VP Software Engineering
Tilera Corporation
[email protected]
Exploiting Natural Parallelism
High-performance applications have lots of
parallelism!
– Embedded apps:
• Networking: packets, flows
• Media: streams, images, functional & data parallelism
– Cloud apps:
• Many clients: network sessions
• Data mining: distributed data & computation
Lots of different levels:
– SIMD (fine-grain data parallelism)
– Thread/process (medium-grain task parallelism)
– Distributed system (coarse-grain job parallelism)
2
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Every one is going Manycore, but can the
architecture scale?
n
Cores
The computing world
is ready for radical
change
32
#Cores
Sun IBM Cell
Larrabee
Intel
4
Performance
Performance
&
Performance/W
Gap
2
1
2005
2006
3
2010
2014
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
2020
Time
Key “Many-Core” Challenges: The 3 P’s
Performance challenge
– How to scale from 1 to 1000 cores – the number of
cores is the new Megahertz
Power efficiency challenge
– Performance per watt is the new metric – systems are
often constrained by power & cooling
Programming challenge
– How to provide a converged many core solution in a
standard programming environment
4
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
“Problems cannot be solved by the same
level of thinking that created them.”
Current technologies fail to deliver
–
–
–
–
Incremental performance increase
High power
Low level of Integration
Increasingly bigger cores
We need to have a new thinking to get
–
–
–
–
10 x performance
10 x performance per watt
Converged computing
Standard programming models
5
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Stepping Back: How Did We Get Here?
Moore’s Conundrum:
More devices =>? More performance
Old answers: More complex cores; bigger
caches
– But power-hungry
New answers: More cores
– But do conventional approaches scale?
Diminishing returns!
6
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
The Old Challenge: CPU-on-a-chip
20 MIPS CPU
in 1987
Few thousand gates
7
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
The Opportunity: Billions of Transistors
Old
CPU:
What to do with all those transistors?
8
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Take Inspiration from ASICs
mem
mem
mem
mem
mem
ASICs have high performance and low power
• Custom-routed, short wires
• Lots of ALUs, registers, memories – huge on-chip parallelism
But how to build a programmable chip?
9
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Replace Long Wires with Routed
Interconnect
Ctrl
[IEEE Computer ’97]
10
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
From Centralized Clump of CPUs …
11
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Bypass Net
ALU
RF
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
ALU
ALU
ALU
ALU
ALU
R
ALU
… To Distributed ALUs, Routed Bypass
Network
Scalar Operand Network (SON) [TPDS 2005]
12
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
From a Large Centralized Cache…
13
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
…to a Distributed Shared Cache
14
ALU
ALU
ALU
ALU
ALU
R
ALU
$
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
[ISCA 1999]
Distributed Everything + Routed
Interconnect Tiled Multicore
ALU
ALU
ALU
ALU
ALU
R
ALU
$
Each tile is a processor, so programmable
15
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Tiled Multicore Captures ASIC Benefits
and is Programmable
Scales to large numbers of cores
Modular – design and verify 1 tile
Power efficient
– Short wires plus locality opts –
CV2f
– Chandrakasan effect, more cores at
2
lower freq and voltage –
CV f
Processor
Core
Current Bus Architecture
S
Core + Switch = Tile
16
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Tilera processor portfolio
Demonstrating the scale of many-core
Gx64 & Gx100
Up to 8x performance
TILEPro64
TILE64
Gx16 & Gx36
2x the performance
TILEPro36
2008
2007
17
2009
2010
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
...
TILE-Gx100™:
4x GbE
SGMII
mPIPE
10 GbE
XAUI
PCIe 2.0
8-lane
4x GbE
SGMII
10 GbE
XAUI
4x GbE
SGMII
PCIe 2.0
4-lane
Flexible
I/O
10 GbE
XAUI
4x GbE
SGMII
10 GbE
XAUI
MiCA
Memory Controller (DDR3)
18
Memory Controller (DDR3)
4x GbE
SGMII
10 GbE
XAUI
SerDes
SerDes
4x GbE
SGMII
80-120 Gbps packet I/O
– 8 ports XAUI / 2 XAUI
– 2 40Gb Interlaken
– 32 ports 1GbE (SGMII)
SerDes
10 GbE
XAUI
SerDes
Interlaken
4x GbE
SGMII
10 GbE
XAUI
PCIe 2.0
8-lane
Interlaken
SerDes
SerDes
SerDes
UART x2,
USB x2,
JTAG,
I2C, SPI
1.2GHz – 1.5GHz
32 MBytes total cache
546 Gbps peak mem BW
200 Tbps iMesh BW
80 Gbps PCIe I/O
– 3 StreamIO ports (20Gb)
SerDes
10 GbE
XAUI
MiCA
SerDes
4x GbE
SGMII
Memory Controller (DDR3)
SerDes
Memory Controller (DDR3)
SerDes
Complete System-on-a-Chip with 100 64-bit cores
Wire-speed packet eng.
– 120Mpps
MiCA engines:
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
– 40 Gbps crypto
– compress & decompress
™
TILE-Gx36 :
SerDes
36 Processor Cores
866M, 1.2GHz, 1.5GHz clk
12 MBytes total cache
SerDes
Scaling to a broad range of applications
40 Gbps total packet I/O
4x GbE
SGMII
10 GbE
XAUI
4x GbE
SGMII
SerDes
PCIe 2.0
4-lane
PCIe 2.0
4-lane
Flexible
I/O
mPIPE
PCIe 2.0
8-lane
SerDes
SerDes
UARTx2,
USBx2,
JTAG,
I2C, SPI
10 GbE
XAUI
4x GbE
SGMII
10 GbE
XAUI
4x GbE
SGMII
Memory Controller (DDR3)
10 GbE
XAUI
– 4 ports 10GbE (XAUI)
– 16 ports 1GbE (SGMII)
48 Gbps PCIe I/O
– 2 16Gbps Stream IO ports
SerDes
Memory Controller (DDR3)
SerDes
MiCA
Wire-speed packet engine
– 60Mpps
MiCA engine:
– 20 Gbps crypto
– Compress & decompress
19
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Full-Featured General Converged Cores
Processor
–
–
–
–
Each core is a complete computer
3-way VLIW CPU
SIMD instructions: 32, 16, and 8-bit ops
Instructions for video (e.g., SAD) and
networking
– Protection and interrupts
Memory
–
–
–
–
Core
Register File
Three Execution Pipelines
L1 cache and L2 Cache
Virtual and physical address space
Instruction and data TLBs
Cache integrated 2D DMA engine
Cache
16K L1-I
I-TLB
8K L1-D
D-TLB
Runs SMP Linux
Runs off-the-shelf C/C++ programs
Signal processing and general apps
20
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
2D
DMA
64K L2
Terabit
Switch
Software must complement the hardware
Enable re-use of existing code-bases
Standards-based development environment
– e.g. gcc, C, C++, Java, Linux
– Comprehensive command-line & GUI-based tools
Support multiple OS models
– One OS running SMP
– Multiple virtualized OS’s with protection
– Bare metal or “zero-overhead” with background OS environment
Support range of parallel programming styles
–
–
–
–
Threaded programming (pThreads, TBB)
Run-to-Completion with load-balancing
Decomposition & Pipelining
Higher-level frameworks (Erlang, OpenMP, Hadoop etc.)
21
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Software Roadmap
Standards & open source integration
– Compiler: gcc, g++ 4.4+
– Linux:
• Kernel: Tile architecture integrated to 2.6.36
• User-space: glibc, broader set of standard
packages
Extended programming and runtime
environments
– Java: porting OpenJDK
– Virtualization: porting KVM
22
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Tile architecture:
The future of many-core computing
Multicore is the way forward
– But we need the right architecture to utilize it
The Tile architecture addresses the challenges
– Scales to 100’s of cores
– Delivers very low power
– Runs your existing code
Standards-based software
– Familiar tools
– Full range of standard programming environments
23
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Thank you!
Questions?
Research Vision to Commercial Product
The future?
1B
Transistors
in 2007
MiCA
4x GbE
SGMII
(DDR3)
10 GbE
XAUI
4x GbE
SGMII
Interlaken
4x GbE
SGMII
4x GbE
SGMII
mPIPE
PCIe 2.0
4-lane
10 GbE
XAUI
10 GbE
XAUI
10 GbE
XAUI
4x GbE
SGMII
10 GbE
XAUI
4x GbE
SGMII
Interlaken
SerDes
SerDes
PCIe 2.0
8-lane
MiCA
10 GbE
XAUI
4x GbE
SGMII
10 GbE
XAUI
4x GbE
SGMII
Memory Controller
(DDR3)
Memory Controller
(DDR3)
10 GbE
XAUI
2010
Tile Processor
64 cores
Mem
1997
2007
2002
25
Memory Controller
Flexible
I/O
MIT Raw
16 cores
CPU
(DDR3)
PCIe 2.0
8-lane
SerDes
2018
1996
A blank
slate
Memory Controller
UART x2,
USB x2,
JTAG,
I2C, SPI
SerDes SerDes SerDes SerDes SerDes
TILE-Gx100
100 cores
100B
transistors
SerDes SerDes SerDes
The opportunity
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Standard tools and programming model
Multicore Development Environment
Standards-based tools
Standard programming
SMP Linux 2.6
ANSI C/C++
Java, PHP
Standard application stack
Application layer
Open source apps
Standard C/C++ libs
Operating System layer
Integrated tools
GCC compiler
Standard gdb gprof
Eclipse IDE
64-way SMP Linux
Zero Overhead Linux
Bare metal environment
Hypervisor layer
Innovative tools
Multicore debug
Multicore profile
26
Virtualizes hardware
I/O device drivers
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Applications
Applications
libraries
Operating System
kernel drivers
Hypervisor
Virtualization and high
speed I/O drivers
Tile Processor
Tile Tile Tile Tile …
Standard Software Stack
Management
Protocols
IPMI 2.0
Infrastructure
Apps
Transcoding
Language
Support
Compiler, OS
Hypervisor
27
Network
Monitoring
Perl
Commercial Linux
Distribution
gcc & g++
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
High single core performance
Comparable to Atom & ARM Cortex-A9 cores
Single-Core Single thread CoreMark™ Comparison
CoreMark Score
3,500
3,000
2,500
2,000
1,500
1,000
500
Tilera
TILEPro64
866 MHz
Tilera
TILE-Gx36
1.25 GHz
ARM
Cortex-A9
1 GHz
Intel
Atom N270
1600 MHz
- Data for TILEPro, ARM Cortex-A9, Atom N270 is available on the CoreMark website http://coremark.org/home.php
- TILE-Gx and single thread Atom results were measured in Tilera labs
- Single core, single thread result for ARM is calculated based on chip scores
28
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Significant value across multiple markets
Networking
• Classification
• L4-7 Services
• Load Balancing
• Monitoring/QoS
• Security
Multimedia
• Video
Conferencing
• Media Streaming
• Transcoding
Wireless
• Base Station
• Media Gateway
• Service Nodes
• Test Equipment
Cloud
• Apache
• Memcached
• Web Applications
• LAMP stack
High Performance
Low Power
Standard Programming
Over 100 customers
Over 40 customers going into production
Tier 1 customers in all target markets
29
29
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Targeting markets with highly parallel
applications
Web
Media delivery
Government
Web Serving
In Memory Cache
Data Mining
Transcoding
Video delivery
Wireless media
Lawful interception
Surveillance
Other
Common Themes
Hundreds and Thousands of servers running each application
thousands of parallel transactions
All need better performance and power efficiency
30
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
The Tile Processor Architecture
Mesh interconnect, power-optimized cores
2 Dimensional mesh network
Processor
Core
S
Core + Switch = Tile
Scales to large numbers of cores
Modular: Design-and-verify 1 tile
Power efficient:
– Short wires & locality optimize CV2f
– Chandrakasan effect, more cores at
lower freq and voltage –
31
200 Tbps on-chip bandwidth
Traditional Bus/Ring Architecture
CV2f
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Distributed “everything”
Cache, memory management, connectivity
Big centralized caches don’t scale
– Contention
– Long latency
– High power
Distributed caches have numerous
benefits
– Lower power (less logic lit-up per access)
– Exploit locality (local L1 & L2)
– Can exploit various cache placement
mechanisms to enhance performance
32
Distributed caches
Completely
HW Coherent
Cache
System
Global 64-bit address space
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Highest compute density
2U form factor
4 hot pluggable modules
8 Tilera TILEPro processors
512 general purpose cores
1.3 trillion operations /sec
33
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Power efficient and eco-friendly server
10,000 cores in a 8 Kilowatt rack
35-50 watts max per node
Server power of 400 watts
90%+ efficient power supplies
Shared fans and power supplies
34
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
Coherent distributed cache system
Globally Shared Physical Address Space
–
–
–
–
Each tile has local L1 and L2 caches
Aggregate of L2 serves as a globally shared L3
Any cache block can be replicated locally
Hardware tracks sharers, invalidates stale copies
Dynamic Distributed Cache (DDC™)
–
DDR2 Controller 0
Memory pages distributed across all cores or
homed by allocating core
Coherent I/O
–
–
–
–
PCIe 0
XAUI 0
Completely
Coherent
System
Flexible
I/O
UART
JTAG
SPI, I2C
PCIe 1
GbE 0
GbE 1
XAUI 1
DDR2 Controller 3
Hardware maintains coherence
I/O reads/writes coherent with tile caches
Reads/writes delivered by HW to home cache
Header/packet delivered directly to tile caches
35
DDR2 Controller 1
HPEC, 15 September 2010
© 2010 Copyright Tilera Corporation. All Rights Reserved.
DDR2 Controller 2
SerDes
Distributed cache
SerDes
Full Hardware Cache Coherence
Standard shared memory programming model
SerDes
–
–
SerDes