Intel® Xeon Phi™ Coprocessor Architecture Overview Shuo Li, Mahesh Bhat Financial Services Engineering SSG, Intel.

Download Report

Transcript Intel® Xeon Phi™ Coprocessor Architecture Overview Shuo Li, Mahesh Bhat Financial Services Engineering SSG, Intel.

Intel® Xeon Phi™ Coprocessor
Architecture Overview
Shuo Li, Mahesh Bhat
Financial Services Engineering
SSG, Intel
Legal Disclaimer
•
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL
PROPETY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY
WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL ® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO
FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
•
Intel may make changes to specifications and product descriptions at any time, without notice.
•
All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.
•
Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current
characterized errata are available on request.
•
Sandy Bridge and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and
other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the
sole risk of the user
•
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more
information go to http://www.intel.com/performance
•
Intel, Core, Xeon, VTune, Cilk, Intel and Intel Sponsors of Tomorrow. and Intel Sponsors of Tomorrow. logo, and the Intel logo are trademarks of Intel Corporation in the United States and
other countries.
•
*Other names and brands may be claimed as the property of others.
•
Copyright ©2011 Intel Corporation.
•
Hyper-Threading Technology: Requires an Intel® HT Technology enabled system, check with your PC manufacturer. Performance will vary depending on the specific hardware and software
used. Not available on all Intel® Core™ processors. For more information including details on which processors support HT Technology, visit http://www.intel.com/info/hyperthreading
•
Intel® 64 architecture: Requires a system with a 64-bit enabled processor, chipset, BIOS and software. Performance will vary depending on the specific hardware and software you
use. Consult your PC manufacturer for more information. For more information, visit http://www.intel.com/info/em64t
•
Intel® Turbo Boost Technology: Requires a system with Intel® Turbo Boost Technology capability. Consult your PC manufacturer. Performance varies depending on hardware, software and
system configuration. For more information, visit http://www.intel.com/technology/turboboost
iXPTC 2013
2
Intel® Xeon Phi™ Coprocessor
Agenda
• Intel® Many Integrated Core Architecture
• Intel® Xeon Phi™ Coprocessor Overview
• Core, Vector Processing Unit and Intel® IMCI
• Interconnect and Cache Hierarchy
• Performance
• Summary
iXPTC 2013
3
Intel® Xeon Phi™ Coprocessor
Intel Many Integrated Core Architecture
Intel Architecture Multicore and Manycore
More cores. Wider vectors. Co-Processors.
Images do not reflect actual die sizes. Actual production die may differ from images.
Intel®
Xeon®
processor
Intel Xeon
processor
64-bit
5100
series
Core(s)
1
2
Threads
2
2
Intel Xeon
processor
5500
series
Intel Xeon
processor
Intel Xeon
processor
5600
series
E5
Product
Family
4
6
8
12
Intel Xeon
processor
code name
Intel Xeon
processor
code name
Ivy
Bridge
Haswell
8
10
16
20
To be
deter
mined
Intel®
Xeon Phi™
Coprocessor
61
244
Intel® Xeon Phi™ Coprocessor extends established CPU architecture
and programming concepts to highly parallel applications
iXPTC 2013
5
Intel® Xeon Phi™ Coprocessor
Intel® Multicore
Architecture
Intel® Many Integrated Core
Architecture
 Suited for full scope of workloads
 Performance and performance/watt
optimized for highly parallelized
compute workloads
 Industry leading performance and
performance/watt for serial & parallel
workloads
 Common software tools with Xeon
enabling efficient application readiness
and performance tuning
 Foundation of HPC Performance
 Focus on fast single core/thread
performance with “moderate” number of
cores
 IA extension to Manycore
 Many cores/threads with wide SIMD
iXPTC 2013
6
Intel® Xeon Phi™ Coprocessor
Consistent Tools & Programming Models
Compiler
Libraries
Parallel Models
Code
Multicore
Intel® Xeon
Processors
Manycore
Intel® Xeon
Processor
Intel®
Xeon Phi™
Coprocessor
Standards Programming Models
Vectorize, Parallelize, & Optimize
iXPTC 2013
7
Intel® Xeon Phi™ Coprocessor
Intel® Xeon Phi™ Coprocessor
Overview
Introducing Intel® Xeon Phi™ Coprocessors
Highly-parallel Processing for Unparalleled Discovery
Groundbreaking: differences
Up to 61 IA cores/1.1 GHz/ 244 Threads
Up to 8GB memory with up to 352 GB/s bandwidth
512-bit SIMD instructions
Linux operating system, IP addressable
Standard programming languages and tools
Leading to Groundbreaking results
Over 1 TeraFlop/s double precision peak performance1
Up to 2.2x higher memory bandwidth than on an Intel® Xeon®
processor E5 family-based server.2
Up to 4x more performance per watt than with an Intel® Xeon®
processor E5 family-based server. 3
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific
computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you
in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
For more information go to http://www.intel.com/performance Notes 1, 2 & 3, see backup for system configuration details.
9
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Intel® Xeon Phi™ Architecture Overview
8 memory controllers
16 Channel GDDR5 MC
PCIe GEN2
High-speed bi-directional
ring interconnect
Fully Coherent L2 Cache
10
Cores: 61 core s, at 1.1 GHz
in-order, support 4 threads
512 bit Vector Processing Unit
32 native registers
Reliability Features
Parity on L1 Cache, ECC on memory
CRC on memory IO, CAP on memory
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Core Architecture Overview
Instruction Decode
Scalar
Unit
Vector
Unit
• 60+ in-order, low power IA cores in a
ring interconnect
• Two pipelines
– Scalar Unit based on Pentium® processors
– Dual issue with scalar instructions
Scalar
Registers
Vector
Registers
32K L1 I-cache
32K L1 D-cache
512K L2 Cache
Ring
– Pipelined one-per-clock scalar throughput
• SIMD Vector Processing Engine
• 4 hardware threads per core
– 4 clock latency, hidden by round-robin
scheduling of threads
– Cannot issue back to back inst in same
thread
• Coherent 512KB L2 Cache per core
iXPTC 2013
11
Intel® Xeon Phi™ Coprocessor
Core and Vector Processing Unit
Vector Processing Unit Extends the Scalar IA Core
PPF
Thread 0 IP
Thread 1 IP
Thread 2 IP
Thread 3 IP
L1 TLB and
L1 instruction
cache 32KB
PF
D0
D1
D2
E
WB
Instruction Cache Miss
TLB miss
16B/cycle ( 2 IPC)
4 threads in-order
Decoder
Pipe 1 (v-pipe)
Pipe 0 (u-pipe)
VPU RF
VPU
512b SIMD
uCode
X87 RF
X87
HWP
L2
CRI
512KB
L2 Cache
L2 TLB
Scalar RF
ALU 0
On-Die Interconnect
ALU 1
TLB miss
L1 TLB and L1 Data Cache
32 KB
13
TLB Miss
Handler
Data Cache Miss
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Vector Processing Unit and Intel® IMCI
• Vector Processing Unit Execute Intel® IMCI
– Intel® Initial Many Core Instructions
• 512-bit Vector Execution Engine
– 16 lanes of 32-bit single precision and integer operations
– 8 lanes of 64-bit double precision and integer operations
– 32 512-bit general purpose vector registers in 4 thread
– 8 16-bit mask registers in 4 thread for predicated execution
• Read/Write
– One vector length (512-bits) per cycle from/to Vector Registers
– One operand can be from the memory free
• IEEE 754 Standard Compliance
– 4 rounding Model, even, 0, +∞, -∞
– Hardware support for SP/DP denormal handling
– Sets status register VXCSR flags but not hardware traps
14
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Core extension Vector Processing Unit
PPF
D2
DEC
E
PF
VC1
D0
VC2
D1
V1
D2
E
WB
D2
E
VC1
V2
VC2
V3
V1-V4
WB
V4
LD
VPU
RF
3R,1W
EMU
Vector ALUs
16 X 32-bit Wide
8 X 64-bit Wide
ST
Fuse Multiply Add
Mask
RF
Scatter
Gather
iXPTC 2013
15
Intel® Xeon Phi™ Coprocessor
Examples of Intel® IMCI
• Ternary Operands
– vop ::: zmm1, zmm2, zmm3
zmm1 = zmm2:::vop:::zmm3
– vop ::: zmm1, zmm2, [ptr]
zmm1 = zmm2::: vop:::MEM[ptr]
• Fused operation Multiply-Add, Multiply-subtract
– vfmadd132ps::: zmm1, zmm2, zmm3 zmm1=zmm1Xzmm3+zmm2
– vfmadd213ps::: zmm1, zmm2, zmm3 zmm1=zmm2Xzmm1+zmm3
– vfmadd231ps::: zmm1, zmm2, zmm3 zmm1=zmm2Xzmm3+zmm1
– Standard IEEE 754-2008R 0.5 ulps not 1 upls as two operations
• Prefetching
– Memory Prefetching minimize the likelihood of L1, L2 cache misses
– Intel® Xeon Phi Coprocessor has a hardware prefetcher
– L1 prefetch: vprefetch1::: ptr, hint
– L2 prefetch: vprefetch2::: ptr, hint
16
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
EMU - Extended Math Unit
• Single Precision Transcendental function
• Minimax quadratic polynomial approximation
• Directly implement 4 Elementary functions
–
–
–
–
vrcp23ps v1 {k1}, v0
vrsqrt23ps v1 {k1}, v0
vlog223ps v1 {k1}, v0
vexp223ps v1 {k1}, v2
// Reciprocal
// Reciprocal square root
// Logarithmic
// Exponential
• Benefit other Derived Functions
– pow(x,y), sqrt(), div(), ln()
17
Function name Latency Throughput
exp2()
8
2
log2()
4
1
rcp()
4
1
rsqrt()
4
1
sqrt()
8
2
pow()
16
4
div()
8
2
ln()
8
2
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Vector Instruction Performance
• VPU contains 16 SP ALUs, 8 DP ALUs,
• Most VPU instructions have a latency of 4 cycles and TPT 1 cycle
– Load/Store/Scatter have 7-cycle latency
– Convert/Shuffle have 6-cycle latency
• VPU instruction are issued in u-pipe
• Certain instructions can go to v-pipe also
– Vector Mask, Vector Store, Vector Packstore, Vector Prefetch, Scalar
18
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Interconnect and Cache Hierarchy
Ring Interconnect Distributed Tag Directories
TAG
Core Valid Mask
State
TAG
Core Valid Mask
State
TAG
Core Valid Mask
State
Core
Core
Core
Core
L2
L2
L2
L2
TD
TD
TD
TD
Data
Command
Address
Coherence
Coherence
TD
TD
L2
L2
L2
Core
Core
Core
TD
L2
Core
20
TD
Tag Directories track
the Cache line in all
L2 caches
Command
Address
Data
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Cache Hierarchy
21
Parameter
L1
L2
Coherence
MESI
MESI
Size
32KB + 32 KB
512 KB
Associativity
8-way
8-way
Line Size
64 Bytes
64 Bytes
Banks
8
8
Access Time
2 cycle
23 cycle
Policy
Pseudo LRU
Pseudo LRU
Duty Cycle
1 per clock
1 per clock
Ports
Read or Write
Read or Write
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Power and Performance
Theoretical Maximum
(Intel® Xeon® processor E5-2670 vs. Intel® Xeon Phi™ coprocessor 5110P & SE10P/X)
Single
Precision
Memory
Bandwidth
Double
Precision
(GF/s)
(GB/s)
(GF/s)
Up to 3.2x
Up to 3.45x
Up to 3.2x
1200
Higher is Better
2,147
Higher is Better
2,022
1,074
1,011
2000
Higher is Better
350
352
320
1000
300
800
1500
250
200
600
1000
400
666
150
333
100
500
102
200
50
0
E5-2670
(2x 2.6GHz, 8C,
115W)
5110P
(60C,
1.053GHz,
225W)
SE10P/X
(61C, 1.1GHz,
300W)
0
E5-2670
(2x 2.6GHz, 8C,
115W)
5110P
(60C,
1.053GHz,
225W)
SE10P/X
(61C, 1.1GHz,
300W)
0
E5-2670
5110P
SE10P/X
(2x 2.6GHz, 8C, (60C, 1.053GHz, (61C, 1.1GHz,
115W)
225W)
300W)
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions.
Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Source: Intel as of October 17, 2012 Configuration Details: Please reference slide speaker notes.
For more information go to http://www.intel.com/performance
23
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Synthetic Benchmark Summary
SGEMM
(GF/s)
Up to 2.9X
1,860
1000
(GF/s)
(GF/s)
Up to 2.8X
Up to 2.6X
1000
Higher is Better
800
5110P
SE10P
(60C,
(61C, 1.1GHz,
1.053GHz,
300W)
225W)
0
150
400
303
100
80
50
200
E5-2670
Baseline
(2x 2.6GHz,
8C,
115W)
5110P
SE10P
(60C,
(61C, 1.1GHz,
1.053GHz,
300W)
225W)
0
E5-2670
Baseline
(2x 2.7GHz,
8C,
115W)
5110P
SE10P
(60C,
(61C, 1.1GHz,
1.053GHz,
300W)
225W)
0
E5-2670
Baseline
(2x 2.6GHz,
8C,
115W)
ECC On
E5-2670
Baseline
(2x 2.6GHz,
8C,
115W)
722
ECC On
82% Efficient
309
82% Efficient
86% Efficient
85% Efficient
400
200
0
159
600
600
500
174
803
75% Efficient
1500
640
200
833
800
1000
Up to 2.2X
Higher is Better
883
1,729
Triad
(GB/s)
Higher is Better
71% Efficient
2000
Higher is Better
STREAM
SMP Linpack
DGEMM
5110P
(60C,
1.053GHz,
225W)
SE10P
(61C, 1.1GHz,
300W)
Coprocessor results: Benchmark run 100% on coprocessor, no help from Intel® Xeon® processor host (aka native)
24
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Intel® Xeon Phi™ Coprocessor vs. Intel®
Xeon® Processor
12
Financial Services Workloads
10.75
Higher is Better
Relative Performance
(Normalized to 1.0 Baseline of a 2 socket Intel®
Xeon® processor E5-2687
10
8.92
8
7.52
6
4.48
3.94
4
3.45
2
Intel® Xeon Phi™ Coprocessor
vs.
2 Socket Intel® Xeon® processor
1.00
0
2S Intel® Xeon®
Processor
BlackScholes
Compute DP
BlackScholes
Compute & BW
DP
Monte Carlo
Simulation DP
BlackScholes
Compute SP
BlackScholes
Compute & BW
SP
Monte Carlo
Simulation SP
Coprocessor results: Benchmark run 100% on coprocessor, no help from Intel® Xeon® processor host (aka native)
Notes
1.
2 X Intel® Xeon® Processor E5-2670 (2.6GHz, 8C, 115W)
2.
Intel® Xeon Phi™ coprocessor SE10 (ECC on) with pre-production SW stack
25
Higher SP results are due to certain Single Precision
transcendental functions in the Intel® Xeon Phi™
coprocessor which are not present in the Intel®
Xeon® processor
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Summary
Summary
• Intel® Xeon Phi™ coprocessor provides
Performance and Performance/Watt for highly
parallel HPC with cores/threads, wide-SIMD,
caches, memory BW
27
iXPTC 2013
Intel® Xeon Phi™ Coprocessor