Transcript www.dtcenter.org
Overview of Nesting in the NMM-B
Tom Black
NMM-B User Tutorial 1 April 2015
●
Nest characteristics
●
1-way interactive / MPI task usage
●
2-way interactive / MPI task usage
●
Scaling
●
The grid
●
Motion
●
Upscale (2-way) exchange
●
User specification of nest-related variables
●
Sequence of execution
2
General Characteristics of NMM-B Nests
● Parent-oriented ● 1-way / 2-way interactive ● Static / moving ● Multiple nests run simultaneously ● Telescoping domains ● Bit restartable ( static / moving / 1-way / 2-way ) 3
All boxes represent ESMF components.
Ocean
parents and children
NMM
NEMS Structure
MAIN
NEMS EARTH(1:NM)
Atm-Ocn Mediator
Atm GSM
Earth Ensemble Mediator
Ice FIM Domain (1:ND) Solver Wrt Dyn Phy Chem Wrt Dyn Phy Wrt 4
1-Way Integration for Three Generations
All generations integrate concurrently .
Δt par Δt child Parent updates child BCs Δt grandchild 5
Task Usage for NMM-B 1-Way Nesting
The user
distributes
all available compute tasks among the various domains and fine-tunes those assignments (along with those of quilt tasks) so that parents and their children proceed in the forecast at virtually the same rate as
all domains integrate concurrently
. This gives the user the ability to optimize the work load balance.
NOTE: For all nesting, the physical dimensions of a parent ‘s tasks’ subdomains MUST BE LARGER than the physical dimensions of its children’s tasks’ subdomains.
6
NMM-B with 1-Way Nesting using 72 Compute Tasks generation #1 tasks 0-7 8 2 generation #2 tasks 8-47 4 generation #3 tasks 48-71 24 32 2 7
Preliminary Estimate of 1-Way Compute Task Assignments There are N compute tasks available.
There are 3 generations with 1 domain, 2 domains, and 2 domains, respectively.
Domain #1: IM1 , JM1 DT1 => Work1 = IM1 x JM1 Domain #2: IM2 , JM2 DT2 => Work2 = IM2 x JM2 x ( DT1 / DT2 ) Domain #3: IM3 , JM3 DT3 => Work3 = IM3 x JM3 x ( DT1 / DT3 ) Domain #4: IM4 , JM4 DT4 => Work4 = IM4 x JM4 x ( DT1 / DT4 ) Domain #5: IM5 , JM5 DT5 => Work5 = IM5 x JM5 x ( DT1 / DT5 ) Total Work = TW = Work1 + Work2 + Work3 + Work4 + Work5 Domain #1 compute tasks: (Work1 / TW) x N Domain #2 compute tasks: (Work2 / TW) x N Domain #3 compute tasks: (Work3 / TW) x N Domain #4 compute tasks: (Work4 / TW) x N Domain #5 compute tasks: (Work5 / TW) x N 8
Some Key Timers
cpl1_recv_tim: Child wait time to recv BC data Appears as ‘cpl recv = ‘ in stdout file cpl2_wait_tim: Parent wait time for BC send to finish Appears as ‘cpl wait = ‘ in stdout file
If child wait time is large then child is too fast relative to parent.
=> Reduce child tasks, increase parent tasks.
If parent wait time is large => parent is too fast relative to child.
=> Reduce parent tasks, increase child tasks.
9
Current Operational NAM with 1-Way Static Nests
• • • Parent runs at 12 km to 84 hr Four static nests run to 60 hr –
4 km CONUS nest (3-to-1)
– –
6 km Alaska nest (2-to-1) 3 km HI & PR nests (4-to-1) Single relocatable 1.33km or 1.5km FireWeather grandchild run to 36hr (3-to-1 or 4-to-1)
10 10
Relative Compute Resources used by NAM Nests
12 km parent 10% 6 km Alaska nest 7% 7% 4 km CONUS nest 57% 57% 3 km Hawaii nest 5% 1.33 km CONUS FireWx nest 17% 3 km Puerto Rico nest 4% 11
2-Way Integration for Three Generations
Only one generation can be active at a given time.
Δt par Δt child Parent updates child BCs Δt grandchild Child updates parent 12
Use 1-Way Task Assignment Strategy in 2-Way Nests?
NO
– Too many tasks can sit idle since domains are active in only one
generation
at a time.
Therefore use a different approach based on the
generations
of domains.
13
NMM-B with 1-Way Nesting using 72 Compute Tasks generation #1 tasks 0-7 8 2 4 generation #2 tasks 8-47
Only 40 of 72 tasks working in the busiest generation if using this method for 2-way.
generation #3 tasks 48-71 24 32 2 14
Basic Strategy for 2-Way Task Usage by Generations ‣ Generations must wait on each other in 2-way mode.
‣ All domains cannot execute concurrently so maximize the amount of work that can be done at any given time by
assigning ALL compute tasks to the most expensive generation
and
distributing them among its domains
for optimal efficiency.
‣ Then
reassign only as many compute tasks to domains in each remaining generation
as is beneficial in minimizing the clocktimes of those generations
by avoiding too small subdomains
with too little computation being done and too costly halo exchanges.
15
Rules for ‘Generational’ Task Usage ‣ Generations execute
sequentially
.
‣ All domains in each generation execute
concurrently
.
‣
ALL
compute tasks are assigned to the most expensive generation .
‣ A compute task can be in more than one generation but cannot be on more than one domain per generation.
‣ Each quilt task must still be uniquely assigned to a single domain to retain
asynchronous writing of output
.
The user is now able to optimize speed in 2-way nesting while never imposing large imbalances. Some tasks might be idle in some generations but all generations are running as fast as possible.
16
NMM-B with 2-Way Nesting using 72 Compute Tasks
‘Generational’ task usage
generation #1 tasks 0-11 12 4 generation #2 tasks 0-71 8
All 72 of 72 tasks working in the busiest generation.
generation #3 tasks 12-53 42 56 4 17
Preliminary Estimate of 2-Way Compute Task Assignments
Same setup as the 1-way case.
Assume 2 nd generation is the most expensive.
gen #2:
Total Work = TW2 = Work2 + Work3
Distribute tasks in 2 nd generation as done for all 1-way domains previously.
Domain #2 compute tasks: (Work2 / TW2) x N Domain #3 compute tasks: (Work3 / TW2) x N
Assign as many of the N tasks to generations 1 and 3 as possible without slowing down the run .
gen #1:
Domain #1 compute tasks: <= N
gen #3:
Total Work = TW3 = Work4 + Work5 Domain #4 compute tasks: <= (Work4 / TW3) x N Domain #5 compute tasks: <= (Work5 / TW3) x N 18
Example of 2-way Task Assignments ‣ You have 128 112 1 16 compute write available tasks.
‣ Five domains; 3 generations; 3 rd is most expensive.
gen #1
Dom #1 :
Compute
5x8
Write
1x2
gen #2
Dom #2 : Dom #3 :
gen #3
Dom #4 : Dom #5 : 6x6 6x6 7x8 7x8 = 112 1x3 1x3 1x4 1x4 = 16
= 128
19
Scaling
● Code
efficiency
drops as a task subdomain’s computation is overwhelmed by the cost of inter-processor communication .
● This happens when subdomain dimensions become too small and there is insufficient work to do compared to time spent in halo exchanges.
20
Scaling
(2) ● Therefore scaling is simply a direct indicator of a code’s computational density.
More expensive computation => Code will scale to a larger number of processors. Less expensive computation => Code will scale to a smaller number of processors. 21
Scaling
(3) ● When assigning tasks always be sure the subdomains are not too small due to a task count that is too large.
As a general rule of thumb check to see that no domain has a dimension less than ~10 points in I or J or else halo exchange cost will begin to exceed computational cost.
● However if minimization of clocktime is desired then extensive experimentation is required after first guess task assignments are made because optimal counts cannot be predicted.
22
18 KM Parent 1080x486, Outer nest 181x181, Inner nest 361x361
48 hour simulation parent only INPES JNPES Total tasks i-points j-points elapsed time speed up 6 24 144 180 20 1370 12 24 48 96 24 24 24 24 288 576 1152 2304 90 45 23 11 20 20 20 20 687 1.994
388 1.771
242 1.603
171 1.415
48 hour simulation parent and outer nest parent outer nest INPES JNPES Total tasks i-points j-points INPES 48 48 48 48 24 24 24 24 1152 1152 1152 1152 23 23 23 23 20 20 20 20 16 16 24 24 JNPES Total tasks i-points j-points elapsed time 16 24 24 32 256 384 576 768 11 11 7 7 11 7 7 5 756 752 684 822 48 hour simulation parent, outer nest and inner nest parent outer nest inner nest INPES JNPES Total tasks i-points j-points INPES 48 48 48 24 24 24 1152 1152 1152 23 23 23 20 20 20 24 24 24 JNPES Total tasks i-points j-points INPES 24 24 24 576 576 576 7 7 7 7 7 7 48 48 32 JNPES 32 24 32 Total tasks i-points 1536 1152 1024 7 7 11 j-points 11 15 11 elapsed time 2598 2405 2656
Portion of Parent Domain
Parent-Oriented Nests ◦ ◦ ◦ ◦
x
◦
x x x
◦
x x
◦
Nest Task Subdomains
◦ ◦
x x x x x x x x x x x x
◦
x x x x x x x x x x x
◦
x x x x x x
◦
x x x x x
◦
x x x x x x x x x x x x x x x x
◦
x
◦
x x
◦
x x x x x x x x x x x x x
◦ ◦ ◦ ◦ ◦
Parent Task Subdomains
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
The southwest H point of the nest domain coincides with a parent H point.
24
Summary of Parent-Child Gridpoint Relationships
ODD space ratio
H h h h H h v v v h h v v h h h h V v v v h h v H h h h H h Child h points lie on parent H points.
Child v points lie on parent V points.
H h
EVEN space ratio
h H h v v h V h h v v H h h H h Child h points lie on parent H and V points.
Child v points do not coincide with parent points.
25
Parent and Child H Gridpoints for 3:1 Ratio
Nest point locations
ITS_PARENT_ON_CHILD=-5
SW corner of nest
I=IDS=1 ITE_PARENT_ON_CHILD=9 H h h h H h ITS_PARENT=1
1 st point on parent task
h h H h h I_PARENT_SW=3 h H h
Parent point locations
h h H h h gap h ITE_PARENT=5
Last point on parent task
H h
26
Parent and Nest V Gridpoints for 3:1 Ratio
Nest point locations
ITS_PARENT_ON_CHILD=-5 I=IDS=1 ITE_PARENT_ON_CHILD=9 H h H h v h h V v h h v H h H h v h h V v h h H h v h v h v H h h h H h h gap h H h v H h v v h V v v h v v H h v h V v h v H h v h V v h v v H h ITS_PARENT=1 on V I_PARENT_SW=3 on V ITE_PARENT=5 on V V v ITS_PARENT=1 on H I_PARENT_SW=3 on H
Parent point locations
ITE_PARENT=5 on H
27
NMM-B Moving Nests
●
1-way or 2-way
interactive.
● Forecast can contain
multiple nests
. ●
Telescoping
domains.
28
Three Types of Data Motion Needed to Satisfy a Nest’s Shift Nest domain after shift Inter-Task Update Parent updates Nest domain before shift Occupies the pre-move ‘footprint’ Intra-task Update
29
Shift onto a Corner
Nest domain before shift occupies the pre-move ‘footprint’ Intra-task update Parent updates Nest domain after shift
30
Simplest Parent Update Over the SW Corner
SW corner of pre-move footprint Here one parent task updates the entire parent update region of this nest task subdomain.
Nest Task Subdomain
31
Four Parent Tasks Update Over the SW Corner
2nd parent task’s 2nd update region SW corner of pre-move footprint 2nd parent task’s 1 st update region 1 st parent task’s update region
Nest Task Subdomain
3rd parent task’s update region 4th parent task’s update region 32
Child’s Bookkeeping for Relative Motion
‣ ‣ ‣ The child tasks determine which of their points are updated by each of the
three
processes.
Intra-task
updating is the simplest (a shift in memory).
Inter-task
updating is more complex.
Child tasks determine which of their subdomain points inside the pre-move footprint will be updated by which other child tasks and vice versa.
Updates from the parent
are the most complicated.
Child tasks determine which of their subdomain points outside of the pre-move footprint will be updated by which parent tasks. 33
Parent’s Bookkeeping for Relative Child Motion
The parent tasks perform bookkeeping to determine which nest points are updated by the parent outside of the pre-move footprint.
Due to the complexity involved both the parent and child tasks perform this bookkeeping from their own perspectives to serve as checks on each other as well as to eliminate additional communication.
34
Nest tasks to be updated
The Parent Stores Its Bookkeeping Results
Child task subdomains and those points on them that are updated by a given parent task
change with each shift
of the nest. Use arrays of linked lists to deal with this continual change.
Parent array of moving nest update specifications
Element 1 Moving Child #1 Element 2 Moving Child #2 Element 3 Moving Child #3
Each link holds parent task update specifications for each relevant task of a moving child following a shift.
35
The Child Stores Its Bookkeeping Results
There is no need for linked list arrays in storing the bookkeeping results from the child’s perspective since the number of parent tasks providing update data is always between 0 and 4.
=> Allocate a derived datatype array (1:4) and store appropriately.
This assumes
the geographical area of parent task subdomains is always larger than that of child task subdomains
.
36
Surface Data
‣ Eight invariant surface fields from NPS cover the uppermost parent domain at each different resolution of all moving nests.
‣ Among these are topography, land/sea mask, soil type, vegetation type, and vegetation fraction.
‣ Each nest task with a parent update region reads the external files to update those variables rather than receiving them from the parent so as not to lose the higher resolution information. ‣ For sfc variables NOT among those eight: (a) Generate a search list of I,J increments from near to far.
(b) If parent update sfc data is from a different surface type then the nest searches for its own nearest point with the same sfc type (e.g. soil T or SST).
37
Upscale (2-way) Data Exchange
As is done for motion both the child and the parent compute which parent tasks will receive upscale data from which points on which child tasks. This eliminates communication and serves as a check.
38
Upscale Exchange - Child
(1) Is the child at the end of a parent timestep?
(2) If so, determine which points on which parent tasks it will update.
(3) Loop through the appropriate parent tasks.
-
Loop through the specified 2-way variables.
Generate upscale values using the mean of child values within the stencil region.
-
Send upscale data for all variables to the given parent task.
39
Generate Upscale Values – Odd Space Ratio
H-pt variables
v v v v h h h v v v h h v v v v H h v h v v h h v v
V-pt variables
h h h h v v v h h v v h h h h V v v v h h v h h h h Average over these stencils
40
h
Generate Upscale Values – Even Space Ratio
H-pt variables V-pt variables
h h h h h v v v v h H h h h V h h v v v v h h h h h h Average over these stencils
41
Upscale Exchange - Parent
(1) Determine which of its points are updated by which child tasks. Save each child task’s specs as a link in a linked list (since we do not know ahead of time how many child tasks will send upscale data after each shift of moving nests).
(2)
-
Loop through the appropriate child tasks.
Recv data for all specified 2-way variables.
Incorporate data if the current timestep does not immediately follow a restart output time (for bit identical restarts).
If the parent’s sfc elevation differs from the child’s then adjust the data using a spline interpolation.
Update the parent values applying the user-specified child weight from the configure file.
42
Specify Update Variables for BC, Motion, and 2-Way Exchange
● Use the nests.txt
file which (like solver_state.txt
) lists desired variables from the Solver internal state.
KEY for boundary vbls: H – mass pt V – velocity pt KEY for moving vbls: KEY for 2-way vbls : H – mass pt V – velocity pt L – land sfc W – water sfc F – read external file in parent update region x – parent must update halo when child moves H – mass pt V – velocity pt 43
###
Example of ‘
nests.txt
’ specifications
BC ### 2-D Integer ‘ISLTYP’ Moving 2-way F - ‘Soil type’ ### 2-D Real ‘FIS’ ‘CMC’ ‘SST’ F - Lx - Wx
-
### 3-D Real ‘T’ H H H ‘U’ V V ‘STC’ V Lx - ‘Sfc geopotential (m2 s-2)’ ‘Canopy moisture (m)’ ‘Sea surface temperature (K)’ ‘Sensible temperature (K)’ ‘U component of wind (m s-1)’ ‘Soil temperature (K)’ 44
High Level Order of Execution
Timestepping loop in subroutine NMM_INTEGRATE
►
Children recv
BC updates from parents from the end of the current parent timestep.
►
Parents recv
upscale data from children from the end of the previous parent timestep.
►
Domain integrates
►
Parents send
BC updates to children who are at the beginning of the current parent timestep.
►
Children send
upscale data to parents who recv it at the beginning of the next parent timestep.
45
Run Step of the NMM
DO Loop over generations (a single iteration for 1-way interaction) DO Loop over all (1-way) or some (2-way) forecast timesteps CALL phase 1 Parent-Child Coupler Run ( check 2-way signals ) CALL phase 2 Parent-Child Coupler Run ( children recv BCs from parents ) CALL phase 3 Parent-Child Coupler Run ( parents recv upscale from children ) CALL phase 1 Domain Run ( integrate the forecast one timestep ) CALL phase 4 Parent-Child Coupler Run ( parents send BCs to children ) CALL phase 5 Parent-Child Coupler Run ( children send upscale to parents ) CALL phase 2 Domain Run ( digital filter )
Advance the Clock
CALL phase 3 Domain Run ( write history/restart ) ENDDO Timestep loop ENDDO Generations loop 46
Example of erratic nest motions due to weak storm(s) interacting with complex terrain.
Note how the wind field remains coherent as it evolves within the outer and inner nest domains.
47
48
Additional Slides
The Composite Object
● A derived datatype to hold assorted variables used throughout the Parent-Child coupler component.
● Allows tasks lying on multiple domains to easily reference such variables generically when they have different values on different domains.
A1
Composite Object – Defined / Allocated
Top of module before CONTAINS TYPE COMPOSITE INTEGER(kind=KINT),DIMENSION(1:3) ::
PARENT_SHIFT
END TYPE COMPOSITE INTEGER(kind=KINT),DIMENSION(:),POINTER ::
PARENT_SHIFT
SUBROUTINE PARENT_CHILD_COUPLER_SETUP TYPE(COMPOSITE), DIMENSION(:), POINTER, SAVE ::
CPL_COMPOSITE
ALLOCATE(
CPL_COMPOSITE
(1:NUM_DOMAINS),stat=ISTAT) END SUBROUTINE PARENT_CHILD_COUPLER_SETUP A2
Composite Object - Used
SUBROUTINE CHILDREN_RECV_PARENT_DATA
CALL POINT_TO_COMPOSITE(MY_DOMAIN_ID)
CALL MPI_RECV(
PARENT_SHIFT
, 3 , MPI_INTEGER, …….
END SUBROUTINE CHILDREN_RECV_PARENT_DATA SUBROUTINE POINT_TO_COMPOSITE(MY_DOMAIN_ID) TYPE(COMPOSITE), POINTER :: CC
CC => CPL_COMPOSITE(MY_DOMAIN_ID) PARENT_SHIFT => CC%PARENT_SHIFT
END SUBROUTINE POINT_TO_COMPOSITE A3
1-Way Communication Between a Parent and Child ‣ MPI
inter
communicators are very convenient for this.
‣ The lead tasks on both domains have rank 0.
‣ MPI sends/recvs use simple target and sender task ranks.
A4
Example of an Intercommunicator The global task ranks (unique task assignments to domains): Parent – 25, 26, 27 Child – 52, 53, 54, 55 The intercommunicator task ranks: Parent – 0, 1, 2 Child – 0, 1, 2, 3 A5
Parent and Child Communications w/ Generations ‣ MPI
inter
communicators cannot be used because parent and child may share some of the same tasks. MPI does not allow global task ranks to be repeated in intercommunicators.
‣ Therefore we use MPI
intra
communicators .
‣ Parent/child task ranks may repeat but will lie in a single non-repeating sequence in the communicator.
A6
Example of an Intracommunicator The global task ranks (tasks can be in more than 1 generation): Parent – 3, 4, 5, 6 Child – 1, 2, 3, 4, 5, 6, 7 The intracommunicator task ranks (parent first): Union – 3, 4, 5, 6, 1, 2, 7 -> 0, 1, 2, 3, 4, 5, 6 Parent – 0, 1, 2, 3 Child – 4, 5, 0, 1, 2, 3, 6 More bookkeeping for the Init step.
Variable sources/targets in MPI sends/recvs.
A7
B-grid vs. E-grid
v v v H H H v v v H H H v v v H H H
B-grid B-grid dx and dy E-grid dx and dy
B-grid is just a rotated E-grid
E-grid A8