Running CCSM Tony Craig CCSM Software Engineering Group [email protected] www.ccsm.ucar.edu Outline • • • • • • • • General review of CCSM Setting up and running a simple case Datasets Production Modifying source code Errors Tools Performance www.ccsm.ucar.edu.

Download Report

Transcript Running CCSM Tony Craig CCSM Software Engineering Group [email protected] www.ccsm.ucar.edu Outline • • • • • • • • General review of CCSM Setting up and running a simple case Datasets Production Modifying source code Errors Tools Performance www.ccsm.ucar.edu.

Running CCSM
Tony Craig
CCSM Software Engineering Group
[email protected]
www.ccsm.ucar.edu
Outline
•
•
•
•
•
•
•
•
General review of CCSM
Setting up and running a simple case
Datasets
Production
Modifying source code
Errors
Tools
Performance
www.ccsm.ucar.edu
Review of CCSM
• Five components / Ten models
–
–
–
–
Atmosphere(3) : atm, datm, latm
Ocean(2) : ocn, docn
Land(2) : lnd, dlnd
Ice(2+) : ice, ice (prescribed mode), ice (mixed
layer ocean mode), dice
– Coupler(1) : cpl
• Communication via MPI between
components and coupler only
• Each component runs on multiple processors
via MPI, OpenMP, MPI/OpenMP
www.ccsm.ucar.edu
Component parallelization
•
•
•
•
•
•
atm : MPI, OpenMP, or MPI/OpenMP
lnd : MPI, OpenMP, or MPI/OpenMP
Ice : MPI only
ocn : MPI only
cpl : OpenMP only
The data models, datm, docn, dice, dlnd, and
latm : serial only, 1 processor
www.ccsm.ucar.edu
Configurations
•
•
•
•
•
•
•
•
•
•
A = datm, dlnd, docn, dice, cpl
B = atm, lnd, ocn, ice, cpl
C = datm, dlnd, ocn, dice, cpl
D = datm, dlnd, docn, ice, cpl
F = atm, lnd, docn, ice (prescribed mode), cpl
G = latm, dlnd, ocn, ice, cpl
H = atm, dlnd, docn, dice, cpl
I = datm, lnd, docn, dice, cpl
K = atm, lnd, docn, dice, cpl
M = latm, dlnd, docn, ice (ml ocn mode), cpl
www.ccsm.ucar.edu
Resolutions
•
•
•
•
atm/lnd/datm/dlnd = T42, T31
ocn/ice/docn/dice = gx1v3, gx3, gx3v4
latm = T62
Scientifically validated combinations
– B, T42_gx1v3 = b20.007 control run
(test.a1 case)
– B, T31_gx3v4 = paleo control run (test.a2
case)
www.ccsm.ucar.edu
“Available” configurations
A B C D F G H I
T42_gx1v3
T31_gx3
T31_gx3v4
* * * * *
* * * * *
*
T62_gx1v3
K M
* * *
*
*
*
*
T62_gx3
*
*
*
www.ccsm.ucar.edu
*
*
= supported (subject to change)
= b20.007 control
= paleo control
Platforms
• IBM
• SGI
• Compaq*
www.ccsm.ucar.edu
Review of scripts
• Main script (test.a1.run)
– Sets primary ccsm environment variables
– Calls $model.setup.csh
• Gets input datasets
• Builds components
– Runs model
– Archives
– Harvests
www.ccsm.ucar.edu
Setting up a simple case
• Use the GUI !!
– The GUI modifies the scripts and creates a new
case for you
– Input $CASE, $CSMROOT, $CSMDATA,
$EXEROOT
– Input resolution
– Input configuration (A-M)
– Sets processor layout based on configuration (first
guess)
– Sets some batch environment variables
– Works well in the NCAR environment, other sites
require post script-generation tuning
www.ccsm.ucar.edu
Setting up a simple case, without GUI
• Create new case directory under
scripts, copy over test.a1 files
• Rename file test.a1.run to $CASE.run
– Edit $CASE, $CSMROOT, $CSMDATA,
$EXEROOT, $ARCROOT
– Edit batch environment parameters
– Edit $GRID
– Edit $SETUPS
– Edit $NTASKS, $NTHRDS
www.ccsm.ucar.edu
$NTASKS, $NTHRDS, batch
• $NTASKS are the total number of MPI tasks
for each component
• $NTHRDS are the number of OpenMP
threads per MPI task
• $NTASKS*$NTHRDS = total number of
processors for each component
• Tuning required to get optimal load balance
• Batch parameters should match processors
used, consistency important, task_geometry
(loadleveler) is very powerful
www.ccsm.ucar.edu
Component parallelization
•
•
•
•
•
•
atm : MPI, OpenMP, or MPI/OpenMP
lnd : MPI, OpenMP, or MPI/OpenMP
ice : MPI only, NTHRDS=1
ocn : MPI only, NTHRDS=1
cpl : OpenMP only, NTASKS=1
The data models, datm, docn, dice, dlnd, and
latm : serial only, 1 processor, NTASKS=1,
NTHRDS=1
www.ccsm.ucar.edu
Main script configuration summary
•
B case
MODELS ( atm lnd
SETUPS ( atm lnd
NTASKS ( 8
2
NTHRDS ( 4
4
ocn
ocn
40
1
ice
ice
8
1
cpl)
cpl)
1)
4)
•
datm/dlnd/ocn/ice case
MODELS ( atm lnd ocn ice
SETUPS ( datm dlnd ocn ice
NTASKS ( 1
1
64 16
NTHRDS ( 1
1
1
1
cpl)
cpl)
1)
4)
www.ccsm.ucar.edu
$RUNTYPE
• Startup - initial startup of model using arbitrary
initialization
– set $CASE, $BASEDATE
• Continue - continuation of case, bit-for-bit
guaranteed, uses model restart files
– set $CASE
• Branch - start new case as a bit-for-bit continuation of
another case, uses model restart files, requires
continuous date
– set $CASE, $REFCASE, $REFDATE
• Hybrid - start new case, not bit-for-bit continuation,
uses model initial files in atm and land, can change
starting date
– set $CASE,$BASEDATE,$REFCASE,$REFDATE
www.ccsm.ucar.edu
Coupler namelist
• Stop_option: ndays, nmonths, newmonth, halfyear,
newyear, newdecade
• Stop_n : integer (ndays, nmonths)
• Rest_freq : ndays, monthly, quarterly, halfyear, yearly
• Rest_n : integer (ndays)
• Diag_freq : daily, weekly, biweekly, monthly, quarterly,
yearly, ndays
• Diag_n : integer (ndays)
• info_bcheck : integer
www.ccsm.ucar.edu
Data Sets
• Types
– Grid files, binary
– Namelist input, ascii
– Initial datasets, binary/netcdf
– Restart datasets, binary
– History datasets, netcdf
– Log files, ascii
• inputdata directory
– This is usually pointed to by $CSMDATA
www.ccsm.ucar.edu
Data Flow, Input
• Everything is copied to $EXEROOT
• Tools and scripts attempt to automate most of the
“get input files”
• Main script variables include $CSMDATA, $LFSINP,
$LMSINP, $MACINP, $RFSINP, $RMSINP
scripts/$CASE
$CSMDATA = inputdata
$EXEROOT
$ARCROOT/restart
Mass Store
www.ccsm.ucar.edu
Data Flow, Output
•
•
•
•
Output files are moved out of $EXEROOT
Harvesting is a separate process
Writing of restart files coordinated by the coupler
Writing of history files is not coordinated between
components, monthly average is default
• Main script variables include $LMSOUT, $MACOUT,
$RFSOUT
Scripts
$EXEROOT
archiving
$ARCROOT
www.ccsm.ucar.edu
Mass Store
harvesting
Log Files
• Each component produces a log file,
$model.log.$LID
• $LID is a system date stamp
• Date stamps are the same on all log files for a run
• Log files are written into the $EXEROOT/$model
directories during execution
• Log files are copied to $SCRIPTS/logs at the end of a
run
• There are separate stdout and stderr that sometimes
contain output information
www.ccsm.ucar.edu
Archiving, ccsm_archive
• Means moving model output to a separate
area on a local disk, ccsm_archive
• Local disk area is set by $ARCROOT in the
main script
• Benefits
– Allows separation of running and harvesting
– Mass storage availability does not prevent
continued execution of the model
– Allows users to run in volatile temporary space
– Supports simple harvesting in a clustered
machine environment (like nirvana)
www.ccsm.ucar.edu
Harvesting, $CASE.har
• Means copying model output to the local mass store
• Separate script in scripts/$CASE, $CASE.har
• Typically submitted in batch, can also be run
interactively
• Submitted by main script after model run, off by
default
• Sources ccsm_joe for important environment
variables
• Harvests all files in $ARCROOT/{atm,lnd,ocn,ice,cpl}
• Verifies accurate copy on mass store before
removing
• Can scp files to remote machines
www.ccsm.ucar.edu
Exact Restart
• CCSM can stop and restart exactly
• The coupler controls the frequency of
restart file writes
• Restart files guarantee bit-for-bit
continuity at a checkpoint boundary
• rpointer files are updated in the
scripts/$CASE directory after each run
www.ccsm.ucar.edu
Restart file management (1)
• ccsm_archive
– In scripts/$CASE
– Called from main script after model run is
complete, commented out by default
– $ARCROOT/restart contains the latest full set of
restart files
– ccsm_archive copies full set of restart datasets
into $ARCROOT/restart after each run
– ccsm_archive then tars up that restart set into the
$ARCROOT/restart.tars directory
– These tar files can be large, regular clean up
required
www.ccsm.ucar.edu
Restart file management (2)
• ccsm_getrestart
– In scripts/tools
– Called from main script before model run starts,
commented out by default
– Copies the latest set of restart files from
$ARCROOT/restart to the appropriate directories
• To “backup” model run to previous model
date
– Assumes both ccsm_archive and ccsm_getrestart
have been active in the main script
– Delete all files in $ARCROOT/restart
– Untar an $ARCROOOT/restart.tars file into
$ARCROOT/restart
– Resubmit
www.ccsm.ucar.edu
Auto-Resubmit
• RESUBMIT file in scripts/$CASE
directory
– contains a single integer
– If the integer is >0, main script resubmits
itself and decrements the integer
• Runaway jobs
– FIRST! set value in RESUBMIT file to 0
– Attempt to kill running jobs
www.ccsm.ucar.edu
Production
• Modify coupler namelist in cpl.setup.csh, set
run length and restart frequency, turn down
diagnostic frequency, set info_bcheck to 0.
• Run a startup, hybrid, or branch case
$RUNTYPE
• Transition to continue $RUNTYPE
• Turn on archiving, harvesting, and
ccsm_getrestart
• Edit RESUBMIT file to initiate autoresubmission
www.ccsm.ucar.edu
Monitoring a run
• Monitor the batch jobs using llq, bjobs, qstat
• Verify that runs complete successfully, check
for timing information at the end of a log file
• Tail -f $EXEROOT/cpl/cpl.log*
• If runs are not succeeding,
– tail each log file
– grep for ENDRUN in atm and lnd log files
– Check stdout and stderr files for component
messages or system messages
– Look for core files in $EXEROOT/$model
– Look for zero length files in $EXEROOT/$model
– Check email
www.ccsm.ucar.edu
Modifying source code
• Modifying files in the ccsm models directory is
not recommended
• Create directories under scripts/$CASE
– src.atm, src.lnd, src.ocn, src.ice, src.cpl
– Copy subset of model source code to these
directories and modify it
– Has highest priority with respect to build
• Benefits include
– Release source code remains unmodified and
available
– Allows implementation of case dependent code
modifications
www.ccsm.ucar.edu
Multiple Machine Support
• Should run on blackforest, babyblue, and ute
“out of the box”
• “Other” machines include seaborg, nirvana,
eagle, falcon, cheetah
• Supported platforms are indicated in $OS,
$SITE, $MACH, $ARCH environment
variables in the main script
• See also scripts/tools/test.a1.mods.$MACH
for suggested changes to test.a1.run for
“other” machines.
www.ccsm.ucar.edu
Running on a “New” Machine
• Main script
–
–
–
–
–
Set batch queue commands
Add new $OS, $SITE, $MACH, $ARCH options
Set standard CCSM path names, $CSMROOT, …
Harvester submission issues
Set data movement variables, $LMSINP, …
• Harvester script
– May require modification
• Tools
– May need to modify ccsm_msread, ccsm_mswrite
• Build
– Modify models/bld/Macros.$OS file
www.ccsm.ucar.edu
ccsm_joe
•
•
•
•
Created by main script
Updated every time the main script runs
Case dependent
Records important ccsm environment
variables
• Can be “sourced” by other scripts to
inherit ccsm environment variables
www.ccsm.ucar.edu
Interactive/Batch Issues
• Can run main script interactively
• Typically used to build and pre-stage initial
data
• Uncomment “exit” command in main script to
stop the script before script starts ccsm
execution
• Batch environment highly site dependent
–
–
–
–
NQS
Loadleveler
LSF
PBS
www.ccsm.ucar.edu
Common Errors (1)
• Model won’t build
– Try rebuilding clean
– Remove all obj directories, these are
$OBJROOT/model/obj which is normally
equivalent to $EXEROOT/model/obj
– When rebuilding, make sure $SETBLD is true in
main script
• Model won’t continue due to restart problem
– Determine cause of problem; quota, hardware,
script, zero length files, rpointer problems
– Fix if possible
– Back up to latest “good” restart dataset
– Rerun
www.ccsm.ucar.edu
Common Errors (2)
• Ice model stops due to mp transport error
–
–
–
–
Double ndte in ice.setup.csh ice model namelist
Back up to latest “good” restart dataset
Run past previous stop date
Reset ndte value
• Ocean model non-convergence
– Add about 10% to the number of model
timesteps/hour in ocn.setup.csh, DT_COUNT
– Back up to latest “good” restart dataset
– Run past previous stop date
– Reset DT_COUNT
– Non-convergence on first timestep is special case
www.ccsm.ucar.edu
Tools
• Under scripts/tools
–
–
–
–
–
ccsm_getfile : hierarchical search for file
ccsm_getinput : hierarchical search for input file
ccsm_msread : copies a file from local mass store
ccsm_mswrite : copies a file to local mass store
ccsm_checkenvs : echo ccsm environment
variables, used to created ccsm_joe
– ccsm-getrestart : copies restart files from
$ARCROOT/restart to appropriate $EXEROOT
and scripts/$CASE directories
www.ccsm.ucar.edu
Performance
• This is complicated!
• Issues
– Performance of components and system as a
function of resolution and configuration
– Scalability of individual components, scaling
efficiency of individual components
– Task/Thread counts
– Components sharing nodes, overloading nodes
with multiple components, overloading threads,
overloading tasks
– Load balance of coupled system
www.ccsm.ucar.edu
Seconds/simulated day
Component Timings
300
250
200
atm
lnd
ice
ocn
150
100
50
0
4
8
16
32
Number of processors
www.ccsm.ucar.edu
64
CCSM Load Balancing
processors
40 ocean
32 atm
16 ice
12 land
04 cpl
104 total
53.2
8.6
6.2
9.4
10.0
5 3
2
40.4
15.0
3.0
10.0
55
Timings in seconds per day
www.ccsm.ucar.edu
Component/Hardware layout
• Machine, set of nodes
• Nodes, group of processors that share
memory
• Processors, individual computing elements
• General rules
– Do not oversubscribe processors, place only 1
MPI task or 1 thread on each processor
– Minimize the number of nodes used for a given
component and processor requirement
– Multiple components can share a node as long as
there is no oversubscription of processors
– Test several decompositions, layouts, task/thread
combinations to try to optimize performance
www.ccsm.ucar.edu
Summary
• CCSM is a complicated multi-executable climate
model, expect there to be “spin-up” time
• CCSM is a scientific research code
• There are many possible components,
configurations, platforms, and resolutions; we are
unable to test everything
• Users are responsible for validating their science
• NCAR can help with software/configuration problems,
[email protected]
• Please report bugs, fixes, improvements, and ports to
new hardware, so we can incorporate those changes!
[email protected]
www.ccsm.ucar.edu