Introduction to Wireless System Technologies
Download
Report
Transcript Introduction to Wireless System Technologies
The NUMAchine Multiprocessor
ICPP 2000
Westin Harbour Castle, August 24, 2000
Outline
2
Presentation Overview
Architecture
System Overview
Key Features
Fast ring routing
Hardware Cache Coherence
Memory Model: Sequential Consistency
Simulation Studies
Ring performance
Network Cache performance
Coherence overhead
Prototype Performance
Hardware Status
Conclusion
University
of
Toronto
Arch:Sys
3
System Architecture
Hierarchical ring network, based on clusters ( NUMAchine’s
‘Stations’) which are themselves bus-based SMPs
University
of
Toronto
Arch:Features
4
NUMAchine’s Key Features
Hierachical rings
Allow for very fast and simple routing
Provide good support for broadcast and multicast
Hardware Cache Coherence
Hierarchical, directory-based, CC-NUMA system
Writeback/Invalidate protocol, designed to use the
broadcast/ordering properties of rings
Sequentially Consistent Memory Model
The most intuitive model for programmer’s trained on uniprocessors
Simple, low cost, but with good flexibility, scalability and performance
University
of
Toronto
Arch:Fmask
5
Fast Ring Routing: Filtermasks
Fast ring routing is achieved by the use of Filtermasks (I.e. simple
bit-masks) to store cache-line location information (imprecision
reduces directory storage requirements)
These Filtermasks are used directly by the routing hardware in the
ring interfaces
University
of
Toronto
CC
6
Hardware Cache Coherence
Hierarchical, directory-based, writeback/invalidate
Directory entries are stored in both the per-station memory (‘home’ location),
and cached in the network interfaces (hence the name, Network Cache)
The Network Cache stores both the remotely cached directory information,
as well as the cache lines themselves, and allows the network interface to
perform coherence operations locally (on-Station), avoiding remote accesses
to the home directory
Filtermasks indicate which Stations (I.e. clusters) may potentially have a
copy of a cache line (with the fuzziness due to the imprecise nature of the
filter masks)
Processor Masks are used only within a Station, to indicates which particular
caches may contain a copy (with the fuzziness here due to Shared lines that
may have been silently ejected)
University
of
Toronto
SC
7
Memory Model: Sequential Consistency
The most intuitive model for the normally trained programmer:
increases the usability of the system
Easily supported by NUMAchine’s ring network: the only change
necessary is to force invalidates to pass through a global
‘sequencing point’ on the ring, increasing the average invalidation
latency by 2 ring hops (40 ns with our default 50 MHz rings)
University
of
Toronto
SS:RP1
8
Simulation Studies: Ring Performance 1
Use the SPLASH-2 benchmarks suite, and a cycle-accurate
hardware simulator with full modeling of the coherence protocol
Applications with high communication-to-computation ratios (e.g.
FFT, Radix) show high utilizations, particularly in the Central Ring
(indicating that a faster Central Ring would help)
University
of
Toronto
SS:RP2
9
Simulation Studies: Ring Performance 2
Maximum and average ring interface queue depths indicate the
network congestion, which correlates to bursty traffic
Large differences between the maximum and average values
indicates large variability in burst size
University
of
Toronto
SS:NC
10
Simulation Studies: Network Cache
Graphs show a measure of the Network Cache’s effect by looking
at the hit rate (I.e. reduction in remote data and coherence traffic)
By categorizing the hits by the coherence directory state, we also
see where the benefits come from: caching shared data, or
reducing invalidations and coherence traffic
University
of
Toronto
SS:CO
11
Simulation Studies: Coherence Overhead
Measure the overhead due to cache coherence, by allowing all
writes to succeed immediately without checking cache-line state,
and comparing against runs with the full cache coherence protocol
in place (both using infinite-capacity Network Caches to avoid
measurement noise due to capacity effects)
Results indicate that in many cases it is basic data locality and/or
poor parallelizability that are impeding performance, not cache
coherence
University
of
Toronto
PP
12
Prototype Performance
Speedups from the hardware prototype, compared against
estimates from the simulator
University
of
Toronto
Status
13
Hardware Prototype Status
Fully operational running the custom Tornado OS, with a 32processor system shown below
University
of
Toronto
Fin
14
Conclusion
4- and 8-way SMPs are fast becoming commodity items
The NUMAchine project has shown that a simple, cost-effective, CC-NUMA
multiprocessor can be built using these SMP building blocks and a simple
ring network, and still achieve good performance and scalability
In the medium-scale range (a few tens to hundreds of processors), rings are
a good choice for a multiprocessor interconnect
We have demonstrated an efficient hardware cache coherence scheme,
which is designed to make use of the natural ordering and broadcast
capabilities of rings
NUMAchine’s architecture efficiently supports a sequentially consistent
memory model, which we feel is essential for increasing the ease of use and
programmability of multiprocessors
University
of
Toronto
Ack
15
Acknowledgments: The NUMAchine Team
Operating Systems
Hardware
Prof. Michael Stumm
Prof. Zvonko Vranesic
Orran Krieger (IBM)
Prof. Stephen Brown
Ben Gamsa
Robin Grindley (SOMA Networks)
Jonathon Appavoo
Alex Grbic
Robert Ho
Prof. Zeljko Zilic (McGill)
Steve Caranci (Altera)
Derek DeVries (OANDA)
Compilers
Prof. Tarek Abdelrahman
Prof. Naraig Manjikian (Queens)
Guy Lemieux
Kelvin Loveless (GNNettest)
Prof. Sinisa Srbljic (Zagreb)
Paul McHardy
Applications
Prof. Ken Sevcik
University
of
Toronto
Mitch Gusat (IBM)