Space Filling Curves
Download
Report
Transcript Space Filling Curves
Space Filling Curves
cache efficiency and parallelization of
numerical simulations
7/18/2015
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
1
Agenda
Motivation
Caching
Numeric without SFC
Numeric using SFC / stack architecture
Parallelization
7/18/2015
Partitioning without SFC
Partitioning using SFC
Repartitioning due to adaptivity
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
2
Computer architecture
address
Memory
instruction
CPU
data
In-/Outputunit
data
7/18/2015
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
3
Command execution
Read instruction
(bus access)
Interprete instruction
Read operands
(bus access)
Calculate / Shift / …
Write results back
(bus access)
=> memory bus is the bottle neck
7/18/2015
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
4
Cycle times of CPU vs. Memory I
cycle times - CPU vs. main memory
cycle time in 10 -9s
10000
1000
1350
166
100
factor 2,5
250
100
145110
70
25
10
Memory
CPU
5
factor > 200
1
0,3
02
20
99
19
96
19
93
19
90
19
87
19
84
19
81
19
78
19
75
19
19
72
0,1
year
7/18/2015
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
5
Cycle times of CPU vs. Memory II
Different development of cpu and memory cycle
times
Main memory access wastes cpu cycles
Fast memory is available, but to expensive and small
Solution: Keep data different memories
Try to keep frequently used data in fast memory
Use of memory hierarchy
7/18/2015
Big slow memory
Small fast memory
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
6
Memory hierarchy
Available size
~1KB
16KB – 4MB
7/18/2015
Speed
Registers
Cache
~0.5ns
L1
L2
L3
0.5-25ns
~1GB
Main memory
~80ns
~1TB
Disk memory
~5ms
>> 1TB
Archiv memory
>> 1s
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
7
Caching
Keep copy of recently used data in fast accessible
memory (cache)
CPU Cache Memory
Also: Websurfer HTTP-Proxy Webserver
Use of locality properties of programms
Temporal locality
Spatial locality
7/18/2015
Recently used variables are probably used again soon
Memory locations near just used memory is likely to be used
soon
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
8
Cache efficency
Cache hit: memory access can be supplied from the
cache
Cache miss: requested data doesn‘t exist in cache
and must be fetched from memory
Cache efficency:
7/18/2015
Ratio between Cache hits and misses
Aim: >>95% cache hits
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
9
Contents
Motivation
Caching
Numeric without SFCs
Numeric using SFC / stack architecture
Parallelization
7/18/2015
Partitioning without SFC
Partitioning using SFC
Repartitioning due to adaptivity
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
10
Numerical algorithms
Discretizing of PDE leads to a LES Au=b
LES is solved by iterative algorithm like
Jacobi
Gauss-Seidel
Repeatedly evaluation of 5-point stencil on the two
dimensional field u
Assume large field u in main memory
7/18/2015
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
11
5 point stencil
Field u (asume n = 5000)
Calculating û5,4
Needs: u5,3,u4,4,u5,4,u6,4,u5,5
At memory posistions: 10005, 15004,
15005, 15006, 20005
Memory needs:
1
7/18/2015
5
…
u: 200MB >> cache size
3 lines of u: 120KB > L1 Cache
n
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
12
Benchmark – 5 point stencil
Tested on Pentium IV Xeon with:
7/18/2015
3D with 1,25·108 elements
512 KBytes (128 Bytes each line)
L2 cache miss rate: 15,00%
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
13
Contents
Motivation
Caching
Numeric without SFC
Numeric using SFC / stack architecture
Parallelization
7/18/2015
Partitioning without SFC
Partitioning using SFC
Repartitioning due to adaptivity
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
14
FEM using SFC
Now calculate element (cell) based
value and distribute them onto
nodes of the grid
Read write only few top elements of
stack
1
7/18/2015
5
…
n
4
3
Should be in cache
Are used several times
2
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
1
0
15
FEM using SFC II
1
7/18/2015
5
…
n
4
Elements are stored in caches
according to the number of
accesses
3
2
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
1
0
16
FEM using SFC III
1
7/18/2015
5
…
n
4
Chache 4 stores nodes which were
accessed from all surrounding cells
These nodes can be stored on
harddisk, as they won‘t be needed
again, during actual iteration
3
2
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
1
0
17
FEM using SFC IV
1
7/18/2015
5
…
n
4
Now the watched points are
„covered“ by other nodes
Distance from the top of the stack is
important
How big do stacks grow?
3
2
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
1
0
18
SFC
Due to construction SFC fill quadrats (cubes)
Covered areas stay compact
Borders (surface) tend to be small
Worst case:
7/18/2015
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
19
Results
Number of nodes on border is small
n number of nodes in one dimension
#nodes in grid:
> nd
#nodes in border:
= O(n(d-1))
Stacks stay small
Always elements from top of the stack are used
The less elements lay above some element the more probable
it is used soon
Elements near top of stack are used several times in a short
periode
This can be used to implement stack efficiently
7/18/2015
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
20
Implementation of SFC – stacks
Parts of stack
Top:
top
Center:
center
Used in near future
Should be loaded into
main memory
Bottom:
bottom
7/18/2015
Used soon
Stays in Cache /
Registers
Will be used in „far“
future
Can be stored on Disk
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
Registers
Cache
L1
L2
L3
Main memory
Disk memory
Archiv memory
21
Implementation of Peano – stacks
Tested on Pentium IV Xeon with:
7/18/2015
3D with 108 elements
512 KBytes (128 Bytes each line)
L2 cache miss rate: ~0,01 %
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
22
Contents
Motivation
Caching
Numeric without SFC
Numeric using SFC / stack architecture
Parallelization
7/18/2015
Partitioning without SFC
Partitioning using SFC
Repartitioning due to adaptivity
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
23
Parallelization of FEM I
7/18/2015
Stored nodes contain
calculated contribution of
the neighbour elements
(cells)
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
24
Parallelization of FEM I
7/18/2015
Stored nodes contain
calculated contribution of
the neighbour elements
(cells)
Grid can be unregular
adaptivly refined
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
25
Parallelization of FEM II
Requirements of partition
Handle adaptive (not regularly)
refinined grid
Load balancing (same
cellnumber for each process)
Minimal border size (min.
communication)
process 1
process 2
process 3
7/18/2015
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
26
Partition algorithms
NP complete => use of heuristics
7/18/2015
Scheduling
Partitions-Processor
Recursive spectral bisection
Recursive coordinate bisection
Inertial recursive bisection
Space filling curves
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
27
Contents
Motivation
Caching
Numeric without SFC
Numeric using SFC / stack architecture
Parallelization
7/18/2015
Partitioning without SFC
Partitioning using SFC
Repartitioning due to adaptivity
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
28
Parallelization using SFC I
7/18/2015
SFC fills adaptivly refined
grid
Cutting one dimensional
SFC pieces with same
length, can be done
easily in O(n)
SFC tend to have small
surfaces (as seen before)
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
29
Parallelization using SFC II
process 1
process 2
process 3
process 4
process 5
7/18/2015
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
30
Parallelization using SFC III
At entrance into
processing area
7/18/2015
border values must be
in correct stack
How to bring
bordervalues into correct
stack, without run along
the hole curve?
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
31
Parallelization using SFC IV
7/18/2015
Finest level at the border
Rest coarse level
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
32
Contents
Motivation
Caching
Numeric without SFC
Numeric using SFC / stack architecture
Parallelization
7/18/2015
Partitioning without SFC
Partitioning using SFC
Repartitioning due to adaptivity
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
33
Repartitioning
During iteration, based on error estimations by using
extensions of element values, the algorithm adjusts
refinement locally
On the fly repartitioning
Obey same requirements as partitioning
7/18/2015
Small borders
Same call number for all processes
Capable to handle adaptivly refined grids
Fast
Using small amount of memory
Distributed in parallel
Minimizing data transfer
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
34
Repartition algorithms
Scratch remap algorithms
Idea:
Can change the inititial partition completly
New partition of area
Intelligent remap into old repartition to minimize data transfer
Losts of datatransfer needed
Fast
Usefull results, when repartitioned after each adaptiv refinement
Diffusion based repartitioning
Idea:
Only appropriate, when:
7/18/2015
Exchanging workload with direct neighbours
Refinements are globally distributed
Only slightly refinements preceded the repartitioning
=> Good results, even when seldom repartitioned
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
35
Focus of research
Repartitioning field which is:
Partitioned by SPC
Traversed using presented stack algorithms
Problems:
7/18/2015
Only adjustment is one dimension possible
How to reorganize stacks
Almost sure several not yet discovered problems
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
36
Any questions?
7/18/2015
Space Filling Curves – cache efficiency and
parallelization of numerical simulations
37