Space Filling Curves

Download Report

Transcript Space Filling Curves

Space Filling Curves
cache efficiency and parallelization of
numerical simulations
7/18/2015
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
1
Agenda


Motivation
Caching



Numeric without SFC
Numeric using SFC / stack architecture
Parallelization



7/18/2015
Partitioning without SFC
Partitioning using SFC
Repartitioning due to adaptivity
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
2
Computer architecture
address
Memory
instruction
CPU
data
In-/Outputunit
data
7/18/2015
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
3
Command execution





Read instruction
(bus access)
Interprete instruction
Read operands
(bus access)
Calculate / Shift / …
Write results back
(bus access)
=> memory bus is the bottle neck
7/18/2015
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
4
Cycle times of CPU vs. Memory I
cycle times - CPU vs. main memory
cycle time in 10 -9s
10000
1000
1350
166
100
factor 2,5
250
100
145110
70
25
10
Memory
CPU
5
factor > 200
1
0,3
02
20
99
19
96
19
93
19
90
19
87
19
84
19
81
19
78
19
75
19
19
72
0,1
year
7/18/2015
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
5
Cycle times of CPU vs. Memory II

Different development of cpu and memory cycle
times



Main memory access wastes cpu cycles
Fast memory is available, but to expensive and small
Solution: Keep data different memories


Try to keep frequently used data in fast memory
Use of memory hierarchy


7/18/2015
Big slow memory
Small fast memory
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
6
Memory hierarchy
Available size
~1KB
16KB – 4MB
7/18/2015
Speed
Registers
Cache
~0.5ns
L1
L2
L3
0.5-25ns
~1GB
Main memory
~80ns
~1TB
Disk memory
~5ms
>> 1TB
Archiv memory
>> 1s
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
7
Caching

Keep copy of recently used data in fast accessible
memory (cache)



CPU  Cache  Memory
Also: Websurfer  HTTP-Proxy  Webserver
Use of locality properties of programms

Temporal locality


Spatial locality

7/18/2015
Recently used variables are probably used again soon
Memory locations near just used memory is likely to be used
soon
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
8
Cache efficency



Cache hit: memory access can be supplied from the
cache
Cache miss: requested data doesn‘t exist in cache
and must be fetched from memory
Cache efficency:


7/18/2015
Ratio between Cache hits and misses
Aim: >>95% cache hits
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
9
Contents


Motivation
Caching



Numeric without SFCs
Numeric using SFC / stack architecture
Parallelization



7/18/2015
Partitioning without SFC
Partitioning using SFC
Repartitioning due to adaptivity
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
10
Numerical algorithms


Discretizing of PDE leads to a LES Au=b
LES is solved by iterative algorithm like




Jacobi
Gauss-Seidel
Repeatedly evaluation of 5-point stencil on the two
dimensional field u
Assume large field u in main memory
7/18/2015
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
11
5 point stencil





Field u (asume n = 5000)
Calculating û5,4
Needs: u5,3,u4,4,u5,4,u6,4,u5,5
At memory posistions: 10005, 15004,
15005, 15006, 20005
Memory needs:


1
7/18/2015
5
…
u: 200MB >> cache size
3 lines of u: 120KB > L1 Cache
n
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
12
Benchmark – 5 point stencil

Tested on Pentium IV Xeon with:



7/18/2015
3D with 1,25·108 elements
512 KBytes (128 Bytes each line)
L2 cache miss rate: 15,00%
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
13
Contents


Motivation
Caching



Numeric without SFC
Numeric using SFC / stack architecture
Parallelization



7/18/2015
Partitioning without SFC
Partitioning using SFC
Repartitioning due to adaptivity
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
14
FEM using SFC


Now calculate element (cell) based
value and distribute them onto
nodes of the grid
Read write only few top elements of
stack


1
7/18/2015
5
…
n
4
3
Should be in cache
Are used several times
2
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
1
0
15
FEM using SFC II

1
7/18/2015
5
…
n
4
Elements are stored in caches
according to the number of
accesses
3
2
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
1
0
16
FEM using SFC III


1
7/18/2015
5
…
n
4
Chache 4 stores nodes which were
accessed from all surrounding cells
These nodes can be stored on
harddisk, as they won‘t be needed
again, during actual iteration
3
2
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
1
0
17
FEM using SFC IV



1
7/18/2015
5
…
n
4
Now the watched points are
„covered“ by other nodes
Distance from the top of the stack is
important
How big do stacks grow?
3
2
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
1
0
18
SFC


Due to construction SFC fill quadrats (cubes)
 Covered areas stay compact
 Borders (surface) tend to be small
Worst case:
7/18/2015
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
19
Results



Number of nodes on border is small
 n number of nodes in one dimension
 #nodes in grid:
> nd
 #nodes in border:
= O(n(d-1))
Stacks stay small
Always elements from top of the stack are used
 The less elements lay above some element the more probable
it is used soon
 Elements near top of stack are used several times in a short
periode
 This can be used to implement stack efficiently
7/18/2015
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
20
Implementation of SFC – stacks

Parts of stack
 Top:


top

Center:


center

Used in near future
Should be loaded into
main memory
Bottom:

bottom

7/18/2015
Used soon
Stays in Cache /
Registers
Will be used in „far“
future
Can be stored on Disk
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
Registers
Cache
L1
L2
L3
Main memory
Disk memory
Archiv memory
21
Implementation of Peano – stacks

Tested on Pentium IV Xeon with:



7/18/2015
3D with 108 elements
512 KBytes (128 Bytes each line)
L2 cache miss rate: ~0,01 %
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
22
Contents


Motivation
Caching



Numeric without SFC
Numeric using SFC / stack architecture
Parallelization



7/18/2015
Partitioning without SFC
Partitioning using SFC
Repartitioning due to adaptivity
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
23
Parallelization of FEM I

7/18/2015
Stored nodes contain
calculated contribution of
the neighbour elements
(cells)
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
24
Parallelization of FEM I


7/18/2015
Stored nodes contain
calculated contribution of
the neighbour elements
(cells)
Grid can be unregular
adaptivly refined
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
25
Parallelization of FEM II

Requirements of partition



Handle adaptive (not regularly)
refinined grid
Load balancing (same
cellnumber for each process)
Minimal border size (min.
communication)
process 1
process 2
process 3
7/18/2015
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
26
Partition algorithms

NP complete => use of heuristics






7/18/2015
Scheduling
Partitions-Processor
Recursive spectral bisection
Recursive coordinate bisection
Inertial recursive bisection
Space filling curves
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
27
Contents


Motivation
Caching



Numeric without SFC
Numeric using SFC / stack architecture
Parallelization



7/18/2015
Partitioning without SFC
Partitioning using SFC
Repartitioning due to adaptivity
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
28
Parallelization using SFC I



7/18/2015
SFC fills adaptivly refined
grid
Cutting one dimensional
SFC pieces with same
length, can be done
easily in O(n)
SFC tend to have small
surfaces (as seen before)
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
29
Parallelization using SFC II
process 1
process 2
process 3
process 4
process 5
7/18/2015
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
30
Parallelization using SFC III

At entrance into
processing area


7/18/2015
border values must be
in correct stack
How to bring
bordervalues into correct
stack, without run along
the hole curve?
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
31
Parallelization using SFC IV


7/18/2015
Finest level at the border
Rest coarse level
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
32
Contents


Motivation
Caching



Numeric without SFC
Numeric using SFC / stack architecture
Parallelization



7/18/2015
Partitioning without SFC
Partitioning using SFC
Repartitioning due to adaptivity
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
33
Repartitioning


During iteration, based on error estimations by using
extensions of element values, the algorithm adjusts
refinement locally
On the fly repartitioning

Obey same requirements as partitioning







7/18/2015
Small borders
Same call number for all processes
Capable to handle adaptivly refined grids
Fast
Using small amount of memory
Distributed in parallel
Minimizing data transfer
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
34
Repartition algorithms

Scratch remap algorithms

Idea:



Can change the inititial partition completly




New partition of area
Intelligent remap into old repartition to minimize data transfer
Losts of datatransfer needed
Fast
Usefull results, when repartitioned after each adaptiv refinement
Diffusion based repartitioning

Idea:


Only appropriate, when:



7/18/2015
Exchanging workload with direct neighbours
Refinements are globally distributed
Only slightly refinements preceded the repartitioning
=> Good results, even when seldom repartitioned
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
35
Focus of research

Repartitioning field which is:



Partitioned by SPC
Traversed using presented stack algorithms
Problems:



7/18/2015
Only adjustment is one dimension possible
How to reorganize stacks
Almost sure several not yet discovered problems 
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
36
Any questions?
7/18/2015
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
37