Kein Folientitel

Download Report

Transcript Kein Folientitel

Computer Architecture
Slide Sets
WS 2010/2011
Prof. Dr. Uwe Brinkschulte
Prof. Dr. Klaus Waldschmidt
Part 10
Thread and Task
Level Parallelism
Computer Architecture – Part 10 – page 1 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Basic concepts
Thread:
Threads are lightweight processes. They consist of several instructions.
The threads share a common (virtual) address space. Threads can
communicate via this common address space.
Task:
Tasks are heavyweight processes. Each task has its own address space.
Tasks can only communicate via inter task communication channels like
shared memory, pipes, message queues or sockets. A task can contain
several threads
Computer Architecture – Part 10 – page 2 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Basic concepts
Instruction level parallelism is limited. To further exploit parallel processing,
thread or task level parallelism can be used.
Two major architectures are known:
• Multithreaded processors exploit thread level parallelism
• Chip multiprocessors (multi core processors, many core processors)
exploit task level parallelism
Both concepts are also used in combination
Computer Architecture – Part 10 – page 3 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Basic concepts
In a multi-threaded processor instructions of several threads of the
program are candidates for concurrent issuing.
This can be done in a classical scalar pipeline to hide the latencies of
memory access.
Here, instructions from several threads can be processed in the different
pipeline stages.
In can be as well combined with a superscalar pipeline to increase the
level of possible parallelism from the intra thread level to the inter thread
level.
This is called SMT (Simultaneous Multithreading).
Computer Architecture – Part 10 – page 4 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Basic concepts
Chip multiprocessors combine multiple processor cores on a single
chip.
Therefore these processors are also called multi core processors.
Today's multicore processors integrate 2 - 8 cores on a chip.
By increasing the number of cores in the future (e.g. > 100), the term
many core processors is used.
These cores can execute several tasks in parallel.
Cores can be homogeneous or heterogeneous.
Having multithreaded cores, multithreading and chip multiprocessing
can be combined.
Computer Architecture – Part 10 – page 5 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Multithreaded Architectures
Multithreaded processor:
Supports the execution of multiple threads by hardware
It can store the context information of several threads in separate
register sets and execute instructions of different threads at the same
time in the processor pipeline
Different stages of the processor pipeline can contain instructions from
different threads
This exploits thread level parallelism on basis of parallelism in time
(pipelining)
Computer Architecture – Part 10 – page 6 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Multithreaded Architectures
Goal:
Reduction of latencies caused by memory accesses or dependencies
Such latencies can be bridged by switching to another thread
During the latency, instructions from other threads are feed into the
pipeline
=> the processor ultilzation is raised, the throughput of a load
consisting of multiple threads increases
(while the throughput of a single thread remains the same)
• Explicit multithreaded processors: each thread is a real thread of
the application program
• Implicit multithreaded processors: speculative parallel threads are
created dynamically out of a sequential program
Computer Architecture – Part 10 – page 7 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Basic multithreading techniques
(1)
(1) (2)(3)(4)
(1)
(1)
(1)
(2)
(fine-grain multithreading):
(3)
Context is switched each
(4)
clock cycle
(1)
(c) Block-Interleaving-Technik (coarse-grain multithreading):
(1)
(2)a latency.
Instructions of a thread are executed until an event causes
Interleaving-Technik
Then context is switched.
(a)
Computer Architecture – Part 10 – page 8 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Context switches
(b) Cycle-by-cycle-
Time (processor cycles)
(a) single threaded
prozessor
(b)
Hier wird Wissen Wirklichkeit
Comparing multithreading to
superscaler and VLIW
(1
)(1
)
(1
)
a: four times superscalar processor
c: four times superscaler processor
with cycle by cycle interleaving
(1
)(1
)(1
)(1
)
(1
)
(1
)(2
)(3
)(4
)
(1
)(1
) NN (1
)(1
)
(1
) NNN (2
)(2
)(2
)
(3
)
(4
)(4
)
b: four times VLIW processor
(1
)
(1
)(1
)(1
)(1
) (2
)(2
)(2
)(2
)
d: four times VLIW processor
with cycle by cycle interleaving
Computer Architecture – Part 10 – page 9 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Context switches
Time (processor cycles)
(1
)
Hier wird Wissen Wirklichkeit
Classification of block interleaving
techniques
B
lo
c
kI
n
te
r
le
a
v
in
g
s
ta
tisc
h
E
x
p
lic
its
w
itc
h
I
m
p
lic
its
w
itc
h
(
s
w
itc
h
o
n
lo
a
d
,
s
w
itc
h
o
n
sto
r
e
,
Computer Architecture – Part 10 – page 10 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
S
w
itc
h
o
n
c
a
c
h
e
m
is
s
Hier wird Wissen Wirklichkeit
Simultaneous multithreading (SMT)
A simultaneous multithreaded processor is able to issue instructions of
multiple threads to multiple execution units in a single clock cycle.
This exploits thread level and instruction level parallelism in time and
Reservation
Stations
Issue
Reservation
Stations
1 2 3 4
Execution
1
...
...
...
Instruction
Fetch
Instruction
Decode
and
Rename
Instruction Window
space
Retire
and
Write
Back
Execution
4
Computer Architecture – Part 10 – page 11 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Comparing SMT to
chip multiprocessing
Simultaneous multithreading (a) and chip multiprocessing (b)
rocessor cycles)
(1)(2) (3)(4)
(1)(2)(3) (4)
(1)
)(1) (1)(2) (2)(2) (2)(3)
(4)(4)(4) (4) (1)(1) (2) (4)
(1)(1) (2)(2) (3)
(4) (4)(4) (1) (1)(1)(2) (3)
(2)(4) (1) (1)(1)(2)(2)
(3)
(1)
(2)
(2)
(1)(1)
Computer Architecture – Part 10 – page 12 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Other applications of multithreading
Resulting from the ability of fast context switching more application
fields for multithreading arise
• Reduction of energy consumption
Mispredictions in superscaler processors cost energy.
Multithreaded processors can execute instructions from other
threads instead
• Event handling
Helper threads handle special events (e.g. carbage collection)
• Real-time processing
Allows efficient real-time scheduling polocies like LLF or GP
Computer Architecture – Part 10 – page 13 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Chip multiprocessing architectures
A Chip-Multiprocessor (CMP) combines several processors on a
single chip
Instead of chip-multiprocessor, today this is also called Multi-CoreProcessor, where a core denotes a single processor on the multi-core
processor chip
Each core can have the complexity of today‘s microprocessors and
holds ist own primary cache for instructions and data
Usually, the cores are organized as memory coupled multi processors
with a shared address space
Furthermore, a secondary cache is contained on the chip
For future multi-core processors containing a large number of cores
(>100), the term Many-Core-Processor is used
Computer Architecture – Part 10 – page 14 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Possible multi-core-configurations (1)
shared-main memory
shared-secondary cache
Processor
Processo
r
Processor
Processor
Processor
Processor
Processor
Processor
Primary
Cache
Primary
Cache
Primary
Cache
Primary
Cache
Primary
Cache
Primary
Cache
Primary
Cache
Primary
Cache
Secondary
Cache
Secondary
Cache
Secondary
Cache
Secondary
Cache
Global Memory
Secondary Cache
Global Memory
Computer Architecture – Part 10 – page 15 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Possible multi-core-configurations (2)
shared-primary cache
Processor
Processor
Processor
Processor
Primary Cache
Secondary Cache
Global Memory
Computer Architecture – Part 10 – page 16 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Chip-Multiprocessor / Multi-Core
Simulations show the shared secondary cache architecture superior
to shared primary cache and shared main memory
Therefore, mostly a large shared secondary cache is implemented
on the processor chip
Cache coherency protocols known from symmetric multi-processor
architectures (e.g. MESI protocol) guarantee a correct access to the
shared memory cells from inside and outside the processor chip
Today, chip multiprocessing is often combined with simultaneous
multithreading
There, each core is a SMT core giving the advantages of both
approaches
Computer Architecture – Part 10 – page 17 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
An early single chip multiprocessor
proposal: Hydra
A Single Chip
Centralized Bus Arbitration Mechanisms
CPU 0
Primary
I-cache
Primary
D-cache
CPU0 Mem.Controller
On-chip Secondary
Cache
CPU 1
Primary
I-cache
Primary
D-cache
CPU1 Mem.Controller
Off-chip L3
Interface
Cache SRAM Array
CPU 2
Primary
I-cache
CPU 3
Primary
D-cache
Primary
I-cache
CPU2 Mem.Controller
Rambus Mem.
Interface
Primary
D-cache
CPU 3 Mem.Controller
DMA
DRAM Main Memory
Computer Architecture – Part 10 – page 18 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
I/O Bus
Interface
I/O Device
Hier wird Wissen Wirklichkeit
Multi-Core examples
IBM Power5
Symmetric multi-core processor with two 64-bit 2 times SMT
processors having 64 kBytes instruction cache and 32 kBytes data
cache
Both cores share a 1.41. MByte on-chip secondary cache
Controller for third level cache as well
on chip
Four Power5 chips and four L3 cache
chips are combined in a
multi-chip module
Computer Architecture – Part 10 – page 19 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Multi-Core examples
IBM Power6
Similar to Power5, but superscaler
in-order-execution
Level 1 cache size raised to
64 kBytes for instructions and
data on each core
65 nm process
5 GHz clock frequency
Computer Architecture – Part 10 – page 20 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Multi-Core examples
IBM Power7
released in 2010
4, 6 or 8 cores
Turbo mode deactivates 4 out of 8 cores, but gives access to all
memory controllers for the remaining 4 cores => improves single
core performance
Each core supports 4 times SMT
45 nm process
4 GHz clock frequency
Computer Architecture – Part 10 – page 21 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Multi-Core examples
Intel Core 2 Duo (Wolfdale)
2 processor cores of Intel Core 2 architecture
32 kBytes data and instruction cache for each core
Core 1
6 MBytes L2 cache
45 nm process
3 Ghz clock frequency
L2 Cache
Shared by
both cores
Core 2
Computer Architecture – Part 10 – page 22 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Multi-Core examples
Microarchitecture of
Intel Core 2 family
(a single core)
Source: c’t 16/2006
Computer Architecture – Part 10 – page 23 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Multi-Core examples
Intel Core 2 Quad (Yorkfield)
2 Wolfdale dices in a multi-chip module
=> 4 processor cores of Intel Core 2 architecture
32 kBytes data and instruction cache for each core
6 MBytes L2 cache for each dice
45 nm process
3 Ghz clock frequency
Computer Architecture – Part 10 – page 24 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Heterogeneous multi-cores
While homogeneous multi-core processors are commonly used for general
purpose computing, heterogeneous multi-core processors are seen as a
future trend for embedded systems
A first member of this technology is the IBM Cell processor containing a
Power processor (Power Processor Element, PPE) and 8 dependend
processors (Synergistic Processing Elements, SPE)
PPE: based on Power architecture, two times SMT, controls the 8 SPEs
SPE: contains a RISC processor with 128 bit SIMD (multimedia)
instructions, a memory flow controller and a bus controller
Originally designed for Sony Playstation 3, the cell processor is now used
in various application domains
Computer Architecture – Part 10 – page 25 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Cell Processor Die
Computer Architecture – Part 10 – page 26 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Multi-Core discussion: performance
Due to multithreading in PC and server operating systems, two to four
cores significantly increase the processor throughput
Exploiting eight or more cores requires parallel application programs
Hence, software development is challenged to deliver the necessary
number of parallel threads by either parallelizing compilers or parallel
applications
Experiences from multiprocessors show a moderate number of parallel
threads resulting in high performance improvement, but this does not
scale to a higher amount of parallelism
Beginning with 4 to 8 threads, the performance improvement is
dramatically reduced
Using 8 cores, except for very computing intensive applications some
cores will be temporarily idle
Furthermore, memory bandwidth can become a bottleneck
Computer Architecture – Part 10 – page 27 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Multi-Core discussion: hardware
While current multi-core processors use cache coupled interconnection,
future processors might rely on grid structures (network on chip) to
improve performance
Adaptive and reconfigurable MPSoC (Multi-Processor Systens-on-.Chip)
will gain importance for embedded systems and general purpose
computing
Reconfigurable cache memories might allow variable connections to
different cores
Available input/output bandwidth is still an open problem for throughput
oriented programs
Computer Architecture – Part 10 – page 28 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Multi-Core discussion: hardware
For data access, transactional memory might be is a model for future
multi-core processors
•
Similar to database systems, memory access is organized as a
transaction being executed completely or not at all
•
Hardware support for checkpointing and rollback is necessary
•
As an advantage, concurrent access is simplified (no locks)
Furthermore, fault tolerance and dependability techniques will become
more important as the error probability will increase with decreasing
transistor dimensions
On chip power management will keep the importance it has already today
Computer Architecture – Part 10 – page 29 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Multi-Core discussion: software
Currently, operating system concepts known from memory coupled
multiprocessor systems are used. Here, the operating system scheduler
assigns independent processes to the available processors
Different to these concepts, the closer core connection of multi-core
processors leads to a different „computation versus synchronization“ ratio
allowing to use more fine grain parallelism
Parallel computing will become the future standard programming model
Most of the currently existing software is sequential, thus can run only on
one core
Programming languages and tools to exploit the fine grain parallelism of
multi-core processors need to be developed
Furthermore, software engineering techniques are needed to allow the
development of safe parallel programs
Computer Architecture – Part 10 – page 30 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit
Multi-Core discussion: software
The application development for multi-core processors will become
one of the main future market places for computer scientists
Today‘s applications have to be proceeded with the goal to exploit
parallelism, gain performance and increase comfort
New applications currently not realizable due to a lack of processor
performance will arise
These are hard to predict
Possible applications must have the need for high computational
performance reachable by parallelism
Such applications might come from speech recognition, image
recognition, data mining, learning technologies or hardware synthesis
Computer Architecture – Part 10 – page 31 of 31 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt
Hier wird Wissen Wirklichkeit