Šiuolaikinių kompiuterių architektūra

Download Report

Transcript Šiuolaikinių kompiuterių architektūra

COMPUTER
ARCHITECTURE
Assoc.Prof. Stasys Maciulevičius
Computer Dept.
[email protected]
AMD road




The company began as a producer of logic chips,
then entered the RAM chip business in 1975. That
same year, it introduced a reverse-engineered clone
of the Intel 8080 microprocessor
In February 1982, AMD becomes a licensed secondsource manufacturer of Intel 8086 and 8088
processors
In 1991, AMD released the Am386, its clone of the
Intel 386 processor
AMD's first in-house x86 processor was the K5,
which was launched in 1996
2014
©S.Maciulevičius
2
AMD roud




Later were processors K6 (1997), Athlon (K7, 1999),
Athlon XP (2001) released
First server processor was dual core Opteron (2005)
After K8 came K10. In 2007, AMD released the first
K10 processors: quad-core 3rd generation Opteron
processors. This was followed by the Phenom
processor for desktop. K10 processors came in dualcore, triple-core, and quad-core versions, with all
cores on a single die
In January 2009, AMD released a new processor line
Phenom II, which came in dual-core, triple-core and
quad-core variants
2014
©S.Maciulevičius
3
AMD K10 architecture

The new K10 architecture was based on the K8
architecture with some enhancements:
 The
fetch unit fetches 32 bytes (256 bits) of data per
clock cycle from the L1 instruction cache – this is the
double CPUs based on K8 architecture could fetch per
clock cycle (Intel CPUs based on Core
microarchitecture, like Core 2 Duo, also fetches 32
bytes per clock cycle)
 The use of a true 128-bit internal datapath. On previous
CPUs based on K8 microarchitecture the internal
datapath was of 64 bits only. This was a problem for
SSE instructions, since SSE registers, called XMM, are
128-bit long
2014
©S.Maciulevičius
4
Barcelona
2014
©S.Maciulevičius
5
AMD’s APU
An accelerated processing unit (APU) is a
processing system that includes additional
processing capability designed to accelerate
one or more types of computations outside
of a CPU
 This may include a graphics processing unit
(GPU) used for general-purpose computing
(GPGPU), a field-programmable gate array
(FPGA), or similar specialized processing
system

2014
©S.Maciulevičius
6
AMD’s APU
At the most basic level, AMD’s new
Accelerated Processing Units combine
general-purpose x86 CPU cores with
programmable vector processing engines
on a single silicon die
 AMD’s APUs also include a variety of
critical system elements, including memory
controllers, I/O controllers, specialized
video decoders, display outputs, and bus
interfaces

2014
©S.Maciulevičius
7
AMD view on APUs
2014
©S.Maciulevičius
8
AMD’s APU
AMD announced the first generation APUs,
Llano for high-performance and Brazos for
low-power devices in January 2011
 The second-generation Trinity for highperformance and Brazos-2 for low-power
devices were announced in June 2012
 The third-generation Kaveri for high
performance devices was launched in
January 2014, while Kabini and Temash for
low-power devices were announced in
summer 2013.

2014
©S.Maciulevičius
9
AMD Fusion

AMD Fusion is the
marketing name for a
series of APUs by
AMD, aimed at
providing good
performance with low
power consumption,
and integrating a
CPU and a GPU
based on a mobile
stand-alone GPU
2014
©S.Maciulevičius
10
AMD Fusion
First
demonstration
of AFU Fusion
was on
Computex 2010
(Taipei, Taiwan,
June 2. 2010 )
2014
©S.Maciulevičius
11
New AMD core - Bulldozer




Bulldozer is the codename AMD has given to one of
the CPU cores based on the AMD family 15h
microarchitecture
Bulldozer is designed from scratch, not a
development of earlier processors
AMD has introduced a new microarchitecture
building block called module
In terms of hardware complexity and functionality, a
module is midway between a dual-core processor (in
which each core is fully independent) and a single
processor core that has two SMT threads (in which
each thread shares most of the hardware resources
with the other thread)
2014
©S.Maciulevičius
12
AMD Bulldozer core


A module consists of two
tightly coupled,
"conventional" x86 outof-order processing
engines
The processing engine
shares the early pipeline
stages (eg. instruction
fetch, decode), the
FPUs, and the L2 cache
2014
©S.Maciulevičius
13
AMD Bulldozer core

Two dedicated integer cores

each consists of two ALU and two AGU which
are capable for total of 4 independent arithmetic
and memory operations per clock per core
 duplicating integer schedulers and execution
pipelines offers dedicated hardware to each of
two threads which significantly increase
performance in multithreaded integer
applications
 second integer core increases Bulldozer module
die by around 12%, which at chip level adds
about 5% of total die space
2014
©S.Maciulevičius
14
AMD Bulldozer core



Two symmetrical 128-bit FMAC (fused multiply–add
capability) floating-point pipelines per module that
can be unified into one large 256-bit-wide unit if one
of integer cores dispatch AVX instruction and two
symmetrical x87/MMX/SSE capable FPPs for
backward compatibility with SSE2 non-optimized
software
Multiple modules share an L3 cache as well as an
Advanced Dual-Channel Memory Sub-System (IMC
- Integrated Memory Controller)
A dual-core Bulldozer processor has a single
module, a quad-core processor has two modules
and an octo-core processor has four modules
2014
©S.Maciulevičius
15
AMD Bulldozer core




The first shipments of Bulldozer-based Opteron
processors begun on September 2011
On 12 October 2011, AMD released the first four
FX-series processors of the Bulldozer line (FX8150, FX-8120, FX-6100, FX-4100)
AMD stated on its blog that “there are some in our
community who feel the product performance did
not meet their expectations”
AMD said that the remaining FX series AMD
processors would be released at the end of the first
quarter of 2012
2014
©S.Maciulevičius
16
2014
©S.Maciulevičius
17
AMD Piledriver
2014
©S.Maciulevičius
18
AMD Piledriver
2014
©S.Maciulevičius
19
AMD Piledriver
2014
©S.Maciulevičius
20
Improvements in the Piledriver







Improved branch prediction precision due to the use of
Hybrid Predictor augmented with 2nd level predictor;
128 and 256-bit FMA3 instructions extensions (fused
multiply-add) and F16C SSE5 instructions extensions (halfprecision floating-point conversion);
Optimized schedulers;
Accelerated division by modifying a corresponding
execution unit;
Increased L1 TLB;
Improved L1 and L2 pre-fetchers that can work with variable
length patterns, including those on page boundaries;
Improved L2 cache efficiency by more aggressive removal
of the unused data, which the pre-fetcher algorithms loaded
into the cache by mistake.
2014
©S.Maciulevičius
21
New micro-architecture - x86
Steamroller

Steamroller is the third modular x86 architecture
from AMD
 promises
a yield per cycle/watt from 15% to 20%
higher than the micro-architecture Piledriver released
in Trinity,
 come with a new memory controller integrated DDR32133, plus have a PCI Express (PCIe) 3.0.

Kaveri Steamroller possess up to 2 modules (4
cores of processing whole “ALUs”) and 2 floating
point units Flex-FP third generation.
2014
©S.Maciulevičius
22
AMD Steamroller

The focus of Steamroller is for greater
parallelism. Improvements will center on:
 independent
instruction decoders for each core within
a module,
 25% more of the maximum width dispatches per
thread,
 better instruction schedulers,
 improved branch predictor,
 larger and smarter caches,
 ….
2014
©S.Maciulevičius
23
AMD Steamroller

…
 up
to 30% less instruction cache misses,
 branch misprediction rate reduced by 20%,
 dynamically resizable L2 cache,
 micro-operations queue
 more internal register resources and improved
memory controller
2014
©S.Maciulevičius
24
From APU to HSA



AMD's first mainstream APU combined the CPU
and a capable GPU — each with a separate slice
of system memory — on the same chip
In Trinity APU, a memory management unit
allowed the GPU to see all of the physical system
memory, shared power management, and support
for OpenCL C++ and Microsoft C++ AMP)
But the basic software model has remained the
same; the CPU and GPU can't work together on
the same data
2014
©S.Maciulevičius
25
From APU to HSA


The next step for HSA, heterogeneous Uniform Memory
Access (hUMA), promises to solve this problem with three
features:
 the CPU and GPU use the same pointers (addresses) to
access the entire memory space to read and write data;
 they are cache coherent, so they can work on data at the
same time without issues; and, like the CPU,
 the GPU supports paged virtual memory, which makes it
possible to work with larger datasets
The net result is that the CPU and GPU can work together
much more efficiently, and it should be easier to write
applications that take advantage of both
2014
©S.Maciulevičius
26
AMD HSA
2014
©S.Maciulevičius
27
Kaveri APU implements the HSA
2014
©S.Maciulevičius
28
ARM architecture



ARM is a family of instruction set architectures for computer
processors based on a RISC architecture
ARM Holdings' (it is a British multinational semiconductor
and software design company) primary business is selling
IP cores, which licensees use to create microcontrollers and
CPUs based on those cores. The original design
manufacturer combines the ARM core with other parts to
produce a complete CPU
Today, the ARM architecture is licensed for use by many
companies, including Apple, Intel, LG, Microsoft, NEC,
Nintendo, Nvidia, Sony, Samsung, Sharp, Texas
Instruments, Yamaha, and many more
2014
©S.Maciulevičius
29
ARM architecture


Processors based on designs licensed from ARM, are used
in all classes of computing devices from microcontrollers in
embedded systems – including real-time safety systems,
smartTVs and all modern smartwatches – up to
smartphones, tablets, laptops, servers and
supercomputers/HPC
According to ARM Holdings, in 2010 alone, producers of
chips based on ARM architectures reported shipments of
6.1 billion ARM-based processors, representing 95% of
smartphones, 35% of digital televisions and set-top boxes
and 10% of mobile computers
2014
©S.Maciulevičius
30
ARM architecture




The ARM architecture is one of the most successful
on the planet
The original ARM architecture was heavily
influenced by the Berkeley RISC architecture
ARM has a number of RISC features, such as a
large register set, fixed-length instructions, and a
purely load-store architecture
A modern ARM chip supports several instruction
sets (this increases complexity of the instruction
decoder )
2014
©S.Maciulevičius
31
ARM architecture




The ARM architecture is one of the most successful
on the planet
The original ARM architecture was heavily
influenced by the Berkeley RISC architecture
ARM has a number of RISC features, such as a
large register set, fixed-length instructions, and a
purely load-store architecture
A modern ARM chip supports several instruction
sets (ARM, Thumb, or Thumb-2; this increases
complexity of the instruction decoder)
2014
©S.Maciulevičius
32
ARM processor
Here we
see a quadcore Cortex
processor
for a wide
range of
devices from mobile
devices to
servers
2010-2014
©S.Maciulevičius
33
ARM A15 MPCore
Main components of Cortex-A15 MPCore are:
 floating-point unit, performing operations with
conventional and double-precision numbers; NEON
expanded instruction system is realized here with
media and signal processing operations, additional
64 and 128-bit registers, SIMD operations are
carried out with 8, 16 and 32-bit integers and 32-bit
floating point numbers;
 integer ALU, which generates 40-bit physical
addresses, enabling to address up to 1 TB of
memory (separate thread uses a 32-bit address
only);
2010-2014
©S.Maciulevičius
34
ARM A15 MPCore


32 kB data and 32 kB instruction L1 cache on each
core, designed for the minimum time delay and
power consumption, they realize data transparency
measures supporting multi-core environments, as
well as error control and correction (ECC);
SCU (Snoop Control Unit) is responsible for
managing the interconnect, arbitration,
communication, cache-2-cache and system
memory transfers, cache coherence and other
capabilities for the processor ;
2010-2014
©S.Maciulevičius
35
ARM A15 MPCore


128-bit CoreLink CCI-400 provides AMBA 4 AXI™
Coherency Extensions (ACE) compliant ports for
full coherency between multiple Cortex-A15
MPCore processors, better utilizing caches and
simplifying software development
This is essential for high bandwidth applications
including gaming, servers and networking that
require clusters of coherent single and multicore
processors
2010-2014
©S.Maciulevičius
36
ARM big.LITTLE



big.LITTLE is a heterogeneous computing
architecture developed by ARM Holdings coupling
(relatively) slower, low-power processor cores with
(relatively) more powerful and power-hungry ones
Each pair operates as one virtual core, and only
one real core is (fully) powered up and running at a
time
The 'big' core is used when demand is high, the
'LITTLE' core when demand is low
2014
©S.Maciulevičius
37
ARM big.LITTLE
2014
©S.Maciulevičius
38
Processors for servers



Serveriu vadinama sistema (operacinė sistema
plius atitinkama techninė įranga), skirta teikti per
tinklą įvairias paslaugas (servisus) – duomenų
failus, skaičiavimus, elektroninį paštą ir t.t.
Serveriams keliami ypač aukšti patikimumo,
spartos, išorinės atminties talpos reikalavimai
Nors serveriuose gali būti naudojami įprasti
procesoriai, vis tik kuriami specialūs procesoriai
serveriams
2010-2014
©S.Maciulevičius
39
Processors for servers



Server is the system (operating system plus the
appropriate hardware), which is designed to
provide a variety of services over the network data files, computing, email, etc.
Servers have to meet extremely high
requirements for reliability, speed, external
memory space
Although conventional processors can be used in
servers, however, a special server processors are
developed and produced
2010-2014
©S.Maciulevičius
40
Processors for servers

Server processors distinguish by:






a larger number of cores,
a larger L3 cache,
support or hyperthreading,
higher reliability,
ability to work in multi-processor system,
higher energy consumption and the high price.
2010-2014
©S.Maciulevičius
41
Processors for servers

Intel server processors are known as Xeons (the
recent processors - Xeon E3, E5, E7; Xeon E7-2870,
2400 MHz, 10 cores, $4227, up to 30 MB L3, 130 W, 4
channel DDR3 support)


AMD server processors are known as Opterons
(the recent processors – series 4300, 6100, 6200,
6300, A1100; Opteron™ 6180 SE, 2500 MHz, 12 cores,
$1514, 2x6 MB L3, 140 W, 4 channel DDR3 support)
IBM server processors are known as Power
processors (Power 6, Power 7)
2010-2014
©S.Maciulevičius
42