Optimizing Matrix Multiply

Download Report

Transcript Optimizing Matrix Multiply

多核与多线程
1
今天想介绍的内容
•为什么发展多核技术
•多核是什么,需要解决的问题
•现在的多核处理器什么样
•多核新技术
2
Today’s Processor
• Voltage level
• A flashlight (~1 volt)
• Current level
• An oven (~250 amps)
• Power level
• A light bulb (~100 watts)
• Area
• A postage stamp (~1 square inch)
• Performance
• GFLOPS
3
What is the future need ?
• Performance need is never ending
• Complains from end-users nowadays
• Tomorrow’s killer application
4
Tomorrow’s killer Application (RMS)
5
What is the future need ?
• Performance need is never ending
• Next Step: How can we get to 1TFLOPS ?
6
为什么多核-线延迟
1 Tflop/s, 1
Tbyte sequential
machine
r = 0.3
mm
• Consider the 1 Tflop/s sequential machine:
• Data must travel some distance, r, to get from memory
to CPU.
• To get 1 data element per cycle, this means 1012
times per second at the speed of light, c = 3x108 m/s.
Thus r < c/1012 = 0.3 mm.
• Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area:
• Each word occupies about 3 square Angstroms, or the
size of a small atom.
• No choice but parallelism
7
为什么多核-发热问题
Dissipated Power ~ CV2f
Rocket Nozzle
1000
Watts/cm2
Nuclear Reactor
Pentium 4
(Prescott)
100
Pentium 4
(Willamette)
10
Hot Plate
Pentium III
Pentium II
Pentium Pro
Pentium
i486
i386
1
1.5
1.0
0.7
0.5 0.35 0.25 0.18 0.13 0.1 0.07
Increasing Frequency
8
Managing the Heat Load
Liquid cooling system in
Apple G5s
Heat sinks in 6XX series
Pentium 4s
9
为什么多核-漏电流
Leakage Current
From Minor Nuisance to Chip Killer
Dissipated Power ~ CV2f
Power (W)
300
250
200
Dynamic Power
150
Leakage Power
100
50
0
250
180
130
90
70
Process Technology (nm)
10
为什么多核-制造技术成本
•
Moore’s 2nd law
(Rock’s law)
Demo of
0.06
micron
CMOS
11
Technology Trends: Microprocessor Capacity
2X transistors/Chip Every 1.5 years
Called “Moore’s Law”
Microprocessors have
become smaller, denser,
and more powerful.
Not just processors,
bandwidth, storage, etc
Gordon Moore (co-founder of
Intel) predicted in 1965 that the
transistor density of
semiconductor chips would
double roughly every 18 months.
12
Moore’s Law Still Holds
No Exponential is Forever,
But perhaps we can Delay it Forever
1011
2G 4G
10
Transistors Per Die
10
Memory
Microprocessor
109
108
107
4M
512M 1G
256M
128M
Itanium®
64M
Pentium® 4
16M
®
1M
6
10
256K
64K
4K 16K
5
10
104
80286
i386™
8080
1K
103
i486™
Pentium III
Pentium® II
Pentium®
4004
8086
102
101
100
’60
’65
Source: Intel
’70
’75
’80
’85
’90
’95
’00
’05
’10
13
Means of Increasing Performance
• Increasing Clock Frequency
• From 60 MHz to 3,800 MHz in 12 years
• Has resulted in expected performance increase
• Execution Optimization
• The kernel is Instruction Level Parallelism
14
A brief history of micro-architecture evolution
many core
multi core
Dual core
64bit data
SMT
Super-pipeline
superscalar
Pipeline, out-order
Pipeline, inorder
32bit data
4bit data
VLIW, speculartion, predication
Where we
are
• Two axes:
• Exploring the parallelism, much of the performance from parallelism
• Bit-Level Parallelism
• Instruction-Level Parallelism (ILP)
• Thread-Level Parallelism (TLP)
• Hiding the memory latency
15
What is Pipelining?
Dave Patterson’s Laundry example: 4 people doing laundry
wash (30 min) + dry (40 min) + fold (20 min) = 90 min Latency
6 PM
7
8
9
Time
T
a
s
k
O
r
d
e
r
• In this example:
• Sequential execution takes
4 * 90min = 6 hours
• Pipelined execution takes
30+4*40+20 = 3.5 hours
30 40 40 40 40 20
A
B
C
D
•
•
•
•
Bandwidth = loads/hour
BW = 4/6 l/h w/o pipelining
BW = 4/3.5 l/h w pipelining
Pipelining helps bandwidth
but not latency (90 min)
• Bandwidth limited by slowest
pipeline stage
• Potential speedup = Number
pipe stages
16
VLIW
• Compiler schedules the parallel execution
17
SIMD – SSE, SSE2, SSE3 Support
2x doubles
4x floats
1x dqword
SSE2
SSE3
16x bytes
SSE
8x words
MMX*
4x dwords
2x qwords
* MMX actually used the x87 Floating Point Registers - SSE, SSE2, and SSE3 use the new SSE registers
18
Means of Increasing Performance
• Execution Optimization
• More powerful instructions
• Execution optimization (pipelining, branch
prediction, execution of multiple
instructions, reordering instruction stream,
etc.)
• The gain from exploring ILP is diminishing
• The inherent barrier ILP need to tackle
• Control dependence, data dependence…
19
Means of Increasing Performance
What is the next ?
• Need to feed TLP for the processor
• Here the problem is essentially the
same as parallel programming
• Technologies for TLP
• Simultaneous Multi-threading (SMT) ->
Example: Intel Hyper-threading
• Chip multiprocessing (CMP) -> MultiCore Processor
20
Micro-architecture Trends
MIPS
106
Multi-Threaded, Multi-Core
105
Pentium 4 and Xeon Architecture with HT
Multi-Threaded
104
Pentium 4 Architecture
Trace Cache
Pentium Pro Architecture
Speculative Out-of-Order
103
102
Pentium Architecture
Super Scalar
Era of
Thread
Parallelism
Era of
Instruction
Parallelism
101
1980
1985
1990
1995
2000
2005
2010
Adapted from Johan De Gelas, Quest for More Processing Power,
AnandTech, Feb. 8, 2005.
21
Understanding SMT and CMP
Make clear Concurrency vs. Parallelism
• Concurrency: two or more threads are in progress at the same
time:
Thread 1
Thread 2
• Parallelism: two or more threads are executing at the same time
Thread 1
Thread 2
• Multiple cores needed
22
Simultaneous Multithreading (SMT)
• Minimal resource replication
• Provides instructions to overlap memory latency
• Separate threads exploit idle resources
Context1
Functional Units
Context2
L1 Cache
L2 Cache
…
Main Memory
23
SMT: simultaneous multithreading
Time (Processor cycle)
Superscalar
Multithreaded
SMT
Unutilized
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Issue slots
24
Go to the era of Multicore
• Concurrency in the form of hardware
multithreading has been around for a while.
• Useful for hiding memory latencies.
• Only about 30% performance improvement for special
application.
• How can we continue to utilize the ever-higher
transistor densities predicted by Moore’s Law?
• Current View: Can continue performance
improvements by packing multiple processing
cores onto a single chip, i.e., multicore.
• multi-core == chip multiprocessing == Tera-scale
computing
25
Chip Multiprocessing
• Much larger degree of resource replication
• Two complete processing cores on each chip
• Outer levels of cache and external interface are shared
• Greatly reduced resource contention compared to SMT
Context1
Functional Units
Context2
L1 Cache
Functional Units
L1 Cache
L2 Cache
…
Main Memory
26
What we benefit from Multi-Core ?
New Target for Micro-architecture – high performance/power
27
Multi-Core Processors
• Improved cost/performance ratio
• Minimal increases in architectural complexity provide
significant increases in performance
• Minimizes performance stalls, with a dramatic
increase in overall effective system performance
• Greater EEP (energy efficient performance) and
scalability
• Cores enable thread-level parallelism
• Multi-core architecture enables divide-andconquer strategy to perform more work in a
given clock cycle.
28
• What’s special for many-cores?
• Explicit multi-threads required to speedup
single application performance
• Core to core communication
• Latency reduce
• Bandwidth increase
• Cache size per-core will also reduce
29
30
Intel Clovertown 上的延迟测试
0-1
0-2
0-3
0-4
0-5
0-6
0-7
2.5
Latency(usec)
2
1.5
1
0.5
0
1Byte
4Byte
16Byte
64Byte
256Byte
31
What is the problem ? Where is the innovation?
• How about the core ?
• Equal to the original one or not ?
• Simple core may be a good choose
P
• How about the power control on chip?
MEM/$
Hierarchy
• Fine granularity power control
• How about the interconnection between cores
and other units ?
• X cores means X times of memory references
• Requires higher throughputs between cores and
caches, within cache hierarchy, and between lastlevel cache and memory
• Requires less latencies in those places
• Four basic kinds of interconnects
• Buses, crossbars, tiny-networks, and rings
• Each has its own tradeoffs in throughput, latency,
resource occupation, and ease of implementation
• May be suitable at different levels
P
P
P
P
MEM/$
Hierarchy
32
P
What is the problem ? Where is the innovation?
• How about the Cache? (NUCA: non-uniform cache arch.)
A NUCA Substrate for Flexible CMP Cache Sharing, Proc. the 19th Annual
International Conference on Supercomputing, June 2005, pp. 31-40
33
多核处理器的问题
• 多核处理器实际上是一个片上并行系统
• 分层性
• 分布性
• 加速单个应用需要显式多线程
• 多内核处理器系统对软件技术的核心问题是并
行程序的开发问题,包括并行程序的编程与调
试-多核处理器的软件挑战
34
What is the problem ? Where is the innovation?
• Where are the threads? – Maybe the most largest
challenge
• Make programmer write threading programs
•
The World may be confused.
• Automatic Parallelism
•
Mission impossible, but can improve in some sense.
• Make module with threading for use
•
How to control high level behavior of our programs ?
• Try to ease the burden of programmer
•
Looks good, but how can ?
35
如何应对多核上的软件挑战
• 让程序员进行并行编程
• 继承和优化OpenMP和MPI等
• 新的编程语言X10等
• 事务内存(Transactional Memory)
• 自动并行化
• 难度大,经过20年的发展通用性仍不好
• 猜测多线程(Speculative multi-threading)
• 实现并行库
• INTEL MKL、SCALAPACK
• 如何控制程序的高级行为?
• 其他有价值的工作
• 函数语言、数据流、领域语言
36
•ALL of above are still open issues
37
Multicore Products Nowadays
• Lots of dual-core products now:
• Intel:Pentium D and Pentium Extreme Edition, Core Duo(2),
Woodcrest, Montecito
• IBM PowerPC
• AMD Opteron / Athlon 64
• Sun UltraSPARC IV.
• Systems with more than two cores are here with more
coming:
• IBM Cell (asymmetric).
•
Dual-core PowerPC plus eight “synergistic processing
elements”.
• Sun Niagara
•
Eight cores, four hyper-threaded threads per core.
• General Purpose Computation on Graphics Processors
(GPGPU)
• Intel expects to produce 16- or even 32-core chips within a
decade.
38
Architecture of Dual-Core Chips
FP Unit
FP Unit
EXE Core
EXE Core
L1 Cache
L1 Cache
•
•
•
•
INTEL CORE DUO
Two physical cores in a package
Each with its own execution resources
Each with its own L1 cache
• 32K instruction and 32K data
• Both cores share the L2 cache
L2 Cache
• 2MB 8-way set associative; 64-byte line size
• 10 clock cycles latency; Write Back update policy
System Bus
(667MHz, 5333MB/s)
• AMD Opteron
• Separate 1 Mbyte L2 caches
• Improvement for Memory
affinity and Thread affinity
39
Intel Multi-core Plan
40
Intel Multi-core Plan
41
Cell from IBM and Sony
42
Cell from IBM and Sony
43
Niagara from SUN
44
GPU Fundamentals: The Modern Graphics Pipeline
Graphics State
• Programmable vertex processor!
GPU
Fragment
Shade
Processor
Final Pixels (Color, Depth)
CPU
Rasterize
Fragments (pre-pixels)
Geometry
Assemble
Processor
Primitives
Screenspace triangles (2D)
Xformed, Lit Vertices (2D)
Vertices (3D)
Application
Vertex
Transform
Processor
Video
Memory
(Textures)
Render-to-texture
• Programmable pixel processor!
45
GPU Fundamentals: The Modern Graphics Pipeline
46
The technologies underway
• Rethink the concurrency and parallelism for
multi-core
• New programming model and programming
languages
• Hardware support (and software) for
multithreading
• Control-driven speculation
• Speculative multithreading
• Data-driven speculation
• Program demultiplexing
• Architectural thread enhancement
• Support for hardware threads
• Lightweight synchronization (monitor/mwait)
47
Rethink the C and P for multi-core
• What we have seen for multi-core
• More parallelism need to be exploited
• Scaling maybe more important
• More heterogeneity need to be exploited
• Task mapping may be revisited
• Low latency and high bandwidth between
cores on chip
• Fine granularity parallelism may be rethinked
48
Rethink the C and P for multi-core
•Make full use of Multi-core resources
• More parallelism
• Hide Memory access stall – well-known
Memory Wall
49
索引计算在clovertown上的测试
• 索引计算是计算密集与
IO密集并重的应用
阶段
主要使用资源
读文档数据
磁盘
分词(中日韩)
CPU
解析文档
CPU
建内存索引
CPU,内存
写磁盘索引
磁盘
• 网页数据32GB,生成
的索引大小为4.5GB
读入文档数据
处理文档,生
成内存索引
内存索引
将内存索引写
回到磁盘,生
成磁盘索引
磁盘索引合
并,生成最终
索引
磁盘索引
文件1
文件2
索引词1
倒排列表
索引词1
倒排列表
索引词2
倒排列表
索引词2
倒排列表
索引词1
倒排列表
索引词3
倒排列表
索引词3
倒排列表
索引词2
倒排列表
索引词3
倒排列表
索引词4
倒排列表
索引词n
倒排列表
磁盘索引
...
文件K
内存索引
文件M
索引词2
倒排列表
磁盘索引
索引词2
倒排列表
文件M+1
...
索引词4
倒排列表
索引词4
倒排列表
索引词n
倒排列表
索引词n
倒排列表
文件N
50
索引计算在clovertown上的测试
读文档数据
分词(中文)
解析文档
创建内存索引
写磁盘索引
索引合并
• 索引各个阶段,有的以计算为主,有的以IO为主
• 考虑将索引过程划分为多个流水段,实现流水索引
算法,充分利用系统计算资源
• 流水段的划分原则
• 资源独立:各个流水段使用独立的资源
• 时间接近:各个流水段的用时比较接近
• 细粒度流水算法:
• 利用流水段的重叠执行,实现并行化。
51
Intel clovertown测试环境
CPU
Genuine(R) Intel(R) 2.66GHz
quad core
dual processors
Disk
Ultra320 SCSI磁盘
Memory
6G
OS
Red Hat Linux(2.6.5-1.358smp)
Compiler
gcc version 3.3.3
52
索引计算在clovertown上的测试
• 单核上的性能提高
• 流水线隐藏部分读文档I/O时间
• 多核下的性能提高
性能提高8.2%
• 计算并行化
串行
性能提高53.4%
细粒度流水
2500
时间(s)
2000
1500
1000
500
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
核的数目
测试时,使用1.5G内存,且待测数据和索引位于同一块磁盘
53
Rethink the C and P for multi-core
• Processor affinity benefit for task mapping
• Parallel FFT computation in NPB get 14%
performance increase for MPICH
54
Rethink the C and P for multi-core
• Exploit dynamic and adaptive out-of-order execution
patterns on multi-core and heterogeneous system
55
The technologies underway
• Rethink the concurrency and parallelism for
multi-core
• New programming model and programming
languages
• Hardware support (and software) for
multithreading
• Control-driven speculation
• Speculative multithreading
• Data-driven speculation
• Program demultiplexing
• Architectural thread enhancement
• Support for hardware threads
• Lightweight synchronization (monitor/mwait)
56
Programming Model and PLs
• Bridge the application software to system
software and hardware for better expressing the
parallelism for such heterogeneous systems
• Transactional Memory
• IBM X10
• SUN Fortress
• 其它有意义的探索
• 函数语言
• 数据流
• 领域语言
57
Transactional memory
a way to ease thread programming
• Thread programming is a boring thing
58
Transactional memory
a way to ease thread programming
• Thread programming is a boring thing
59
Transactional memory
a way to ease thread programming
• Thread programming is a boring thing
60
Transactional memory
• A transaction is a sequence of memory loads and stores
that either commits or aborts
• If a transaction commits, all the loads and stores appear
to have executed atomically
• If a transaction aborts, none of its stores take effect
• Transaction operations aren't visible until they commit or
abort
• Simplified version of traditional ACID database
61
transactions (no durability, for example)
Transactional memory example
62
Problems in Transactional Memory
63
Solutions for Transactional Memory
64
X10
• 对多内核系统与集群系统提供统一的支持
• 高生产率
• 语言设计注重可移植性和安全性
• 性能
• 扩展了Java虚拟机
• 提供手工性能调整的手段
• 在 Java 语言基础上开发
• 继承了JAVA语言的核心价值 --- 高生产率,可
移植性,成熟、安全
• 面向主流Java/C/C++程序员
X10 Vision: Portable Productive Parallel Programming
X10 Data
Structures
X10 language defines
mapping from X10
objects & activities to
X10 places
X10 Places
X10 deployment defines
mapping from virtual X10
places to physical
processing elements
Physical PEs
Homogeneous
Multi-core
Heterogeneous
Accelerators
Clusters
SPE
PEs,
L1 $
...
PEs,
L1 $
SPU
...
SPU
SPU
SPU
SPU
SPU
SXU
SXU
SXU
SXU
SXU
SXU
LS
LS
LS
LS
LS
LS
LS
SMF
SMF
SMF
SMF
SMF
SMF
SMF
SMF
...
...
16B/cycle
PPE
PPU
L2
L1
SMP Node
MIC
16B/cycle (2x)
SMP Node
PEs,
PEs,
...
EIB (up to 96B/cycle)
16B/cycle
PEs,
L1 $
SPU
SXU
LS
16B/cycle
L2 Cache
PEs,
L1 $
SPU
SXU
Memory
...
...
Memory
BIC
PXU
32B/cycle 16B/cycle
L2 Cache
Dual
XDRTM
64-bit Power Architecture with VMX
66
FlexIOTM
PEs,
PEs,
Interconnect
Overview of X10 (x10.sf.net)
Storage classes:

Activity-local

Place-local

Partitioned
global

Immutable
• Dynamic parallelism with a Partitioned Global Address Space
• Places encapsulate binding of activities and globally addressable data
• async (P) S --- run statement S asynchronously at place P
• finish S --- execute statement S, and wait for descendant async’s to
terminate
• atomic S --- execute statement S atomically
• No place-remote accesses permitted in atomic section
Deadlock safety: any X10 program written with async, atomic, and finish
67 never deadlock
can
X10 程序示例
finish
Activity A0 (Part 1)
Activity A0 (Part 3)
finish
async
Activity A0 (Part 2)
Activity A1
async
async
Activity A3
Activity A4
async
Activity A2
IndexOutOfBounds
exception
// X10 pseudo code
main(){ // implicit finish
Activity A0 (Part 1);
async {A1; async A2;}
try {
finish {
Activity A0 (Part 2);
async A3;
async A4;
}
catch (…) { … }
Activity A0 (Part 3);
}
The technologies underway
• Rethink the concurrency and parallelism for
multi-core
• New programming model and programming
languages
• Hardware support (and software) for
multithreading
• Control-driven speculation
• Speculative multithreading
• Data-driven speculation
• Program demultiplexing
• Architectural thread enhancement
• Support for hardware threads
• Lightweight synchronization (monitor/mwait)
69
Speculative multithreading
Original program
execution:
time
Speculative parallel thread (SPT)
execution:
Main thread
Speculative thread
Speculative execution
A
Spawn
A
B
C
C
B
Commit
speculative
results
Speculative Threading for memory dependences
Speculative Threading for values with pre-computation
slice
70
Problems in Speculative multithreading
• Locate the section of the program that can efficiently be
executed in parallel
• Pre-computation slice has low computational overhead
• Workload balance
• Low overhead for pre-computation slice
• Buffering and multi-versioning in the memory hierarchy
• Buffering will keep the speculative status until the thread is
verified and can be committed
• Multi-versioning allow each variable to have a different value for
each of the threads running in parallel
• Check data dependence mis-speculations quickly
71
Summary for current trends
• To many core
• Hardware support for multithreading
• Transaction Memory
• Hard to write fast threaded programs
• Locks create fundamental problems
• Transactional memory shields programmers
• Hardware speeds up transactional memory
• Energy-efficient design
72
Add more axes to the micro-architecture evolution
• Reliability
• Hardware failure will be more and more intensive
when feature size continuously shrinking. For
example, 10 out of 1000 cores might be unfunctional,
and 10 another might produce incorrect result…
• Using multi-core to do redundant computation
• And others?
73
Today’s Conclusion
• It’s an age of cores and threads
• Many challenges there
We will have more opportunities for
innovation than we have ever had
Before
• But
• Multi-core processor implementation (inherent
parallelism) has significant impact on software
applications
• Full potential harnessed by programs that migrate to
a threaded software model
• Efficient use of threads (kernel or system / user
threads) is KEY to dramatically increase effective
system performance
74
Future Reading Materials
• Go to Google, 
• General Reading
• The Landscape of parallel computing research: a view from
Berkeley
• Intel Developer Forum, spring, 2006, Beijing
• http://www.intel.com/multi-core/docs.htm
• Special Interested
• ISCA05 panel for multi-core related
• Clock Rate versus IPC: the end of the road for conventional microarchitectures, ISCA 2000 (27th)
• The impact of multi-core on Math Software
• http://en.wikipedia.org/wiki/Parallel_programming_model
• NUMA
•
•
C. Kim, D. Burger, and S. W. Keckler. An adaptive, non-uniform cache structure for wire-delay
dominated on-chip caches. In Proceedings of the 10th International Conference on Architectural
Support for Programming Languages and Operating Systems (ASPLOS), October 2002.
A NUCA Substrate for Flexible CMP Cache Sharing, Proc. the 19th Annual International
Conference on Supercomputing, June 2005
• http://www-128.ibm.com/developerworks/power/cell
75
Thanks, 
76