ESM Runscripts

Usage

esm_runscripts [-h] [-d] [-v] [-e EXPID] [-c] [-P] [-j LAST_JOBTYPE]
                  [-t TASK] [-p PID] [-x EXCLUDE] [-o ONLY]
                  [-r RESUME_FROM] [-U] [-i INSPECT]
                  runscript

Arguments

Optional arguments	Description
`-h`, `--help`	Show this help message and exit.
`-d`, `--debug`	Print lots of debugging statements.
`-v`, `--verbose`	Be verbose.
`-e EXPID`, `--expid EXPID`	The experiment ID to use. Default `test`.
`-c`, `--check`	Run in check mode (don’t submit job to supercomputer).
`-P`, `--profile`	Write profiling information (esm-tools).
`-j LAST_JOBTYPE`, `--last_jobtype LAST_JOBTYPE`	Write the jobtype this run was called from (esm-tools internal).
`-t TASK`, `--task TASK`	The task to run. Choose from: `compute`, `post`, `couple`, `tidy`.
`-p PID`, `--pid PID`	The PID of the task to observe.
`-x EXCLUDE`, `--exclude EXCLUDE`	E[x]clude this step.
`-o ONLY`, `--only ONLY`	[o]nly do this step.
`-r RESUME_FROM`, `--resume-from RESUME_FROM`	[r]esume from the specified run/step (i.e. to resume a second run you’ll need to use `-r 2`).
`-U`, `--update`	[U]pdate the runscript in the experiment folder and associated files
`--update-filetypes UPDATE_FILETYPES [UPDATE_FILETYPES ...]`	Updates the requested files from external sources in a currently ongoing simulation. For example, if you want to update the binaries and the configs (namelists) in a resubmission of a experiment you can do this by adding `--update-filetypes bin config` to your `esm_runscripts` command. We strongly advise against using this option unless you really know what you are doing.
`-i`, `--inspect`	This option can be used to [i]nspect the results of a previous run, for example one prepared with `-c`. This argument needs an additional keyword. Choose among: `overview` (gives you the same litte message you see at the beginning of each run, `lastlog` (displays the last log file), `explog` (the overall experiment logfile), `datefile` (the overall experiment logfile), `config` (the Python dict that contains all information), `size` (the size of the experiment folder), a filename or a directory name output the content of the file /directory if found in the last `run_` folder.)
`--trace`	Enable `TRACE`-level output to stdout.
`--task-log-files`	Enable per-task log files on disk.

Running a Model/Setup

ESM-Runscripts is the ESM-Tools package that allows the user to run the experiments. ESM-Runscripts reads the runscript (either a bash or yaml file), applies the required changes to the namelists and configuration files, submits the runs of the experiment to the compute nodes, and handles and organizes restart, output and log files. The command to run a runscript is:

$ esm_runscripts <runscript.yaml/.run> -e <experiment_ID>

The runscript.yaml/.run should contain all the information regarding the experiment paths, and particular configurations of the experiment (see the yaml:Runscripts section for more information about the syntax of yaml runscripts). The experiment_ID is used to identify the experiment in the scheduler and to name the experiment’s directory (see Experiment Directory Structure). Omitting the argument -e <experiment_ID> will create an experiment with the default experimant ID test.

ESM-Runscript allows to run an experiment check by adding the -c flag to the previous command. This check performs all the system operations related to the experiment that would take place on a normal run (creates the experiment directory and subdirectories, copies the binaries and the necessary restart/forcing files, edits the namelists, …) but stops before submitting the run to the compute nodes. We strongly recommend running first a check before submitting an experiment to the compute nodes, as the check outputs contains already valuable information to understand whether the experiment will work correctly or not (we strongly encourage users to pay particular attention to the Namelists and the Missing files sections of the check’s output).

Job Phases

ESM-Tools job phases

The following table summarizes the job phases of ESM-Runscripts and gives a brief description. …

Running only part of a job

It’s possible to run only part of a job. This is particularly interesting for development work; when you might only want to test a specific phase without having to run a whole simulation.

As an example; let’s say you only want to run the tidy phase of a particular job; which will move things from the particular run folder to the overall experiment tree. In this example; the experiment will be called test001:

esm_runscripts ${PATH_TO_USER_CONFIG} -t tidy

Experiment Directory Structure

All the files related to a given experiment are saved in the Experiment Directory. This includes among others model binaries, libraries, namelists, configuration files, outputs, restarts, etc. The idea behind this approach is that all the necessary files for running an experiment are contained in this folder (the user can always control through the runscript or configuration files whether the large forcing and mesh files also go into this folder), so that the experiment can be reproduced again, for example, even if there were changes into one of the model’s binaries or in the original runscript.

The path of the Experiment Directory is composed by the general.base_dir path specified in the runscript (see yaml:Runscripts syntax) followed by the given experiment_ID during the esm_runscripts call:

<general.base_dir>/<experiment_ID>

The main experiment folder (General exp dir) contains the subfolders indicated in the graph and table below. Each of these subfolders contains a folder for each component in the experiment (i.e. for an AWI-CM experiment the outdata folder will contain the subfolders echam, fesom, hdmodel, jsbach, oasis3mct).

The structure of the run folder run_YYYYMMDD-YYYYMMDD (Run dir in the graph) replicates that of the general experiment folder. Run directories are created before each new run and they are useful to debug and restart experiments that have crashed.

Experiment directory structure

Subfolder	Files	Description
analysis	user’s files	Results of user’s “by-hand” analysis can be placed here.
bin	component binaries	Model binaries needed for the experiment.
config	<experiment_ID>_ finished_config.yaml namelists other configuration files	Configuration files for the experiment including namelists and other files specified in the component’s configuration files (`<PATH>/esm_tools/configs/<component>/<component>.yaml`, see File Dictionaries). The file `<experiment_ID>_finished_config.yaml` is located at the base of the `config` folder and contains the whole ESM-Tools variable space for the experiment, resulting from combining the variables of the runscript, setup and component configuration files, and the machine environment file.
couple	coupling related files	Necessary files for model couplings.
forcing	forcing files	Forcing files for the experiment. Only copied here when specified by the user in the runscript or in the configuration files (File Dictionaries).
input	input files	Input files for the experiment. Only copied here when specified by the user in the runscript or in the configuration files (File Dictionaries).
log	<experiment_ID>_ <setup_name>.log component log files	Experiment log files. The component specific log files are placed in their respective subfolder. The general log file `<experiment_ID>_<setup_name>.log` reports on the ESM-Runscripts Job Phases and is located at the base of the `log` folder. Log file names and copying instructions should be included in the configuration files of components (File Dictionaries).
mon	user’s files	Monitoring scripts created by the user can be placed here.
outdata	outdata files	Outdata files are placed here. Outdata file names and copying instructions should be included in the configuration files of components (File Dictionaries).
restart	restart files	Restart files are placed here. Restart file names and copying instructions should be included in the configuration files of components (File Dictionaries).
run_YYYYMMDD-YYYYMMDD	run files	Run folder containing all the files for a given run. Folders contained here have the same names as the ones contained in the general experiment folder (`analysis`, `bin`, `config`, etc). Once the run is finished the run files are copied to the general experiment folder.
scripts	`esm_tools` folder containing: all namelists all functions <experiment_ID>_ compute_YYYYMMDD- YYYYMMDD.run> <experiment_ID>_ compute_YYYYMMDD- YYYYMMDD_<JobID>.log <experiment_ID>_ <setup_name>.date original runscript file.log hostfile_srun	Contains all the scripts needed for the experiment. A subfolder `esm_tools` includes all the config files and namelists of `ESM-Tools` (a copy of the `configs` and `namelists` folders in the `esm_tools` installation folder). It also contains the `.run` files to be submitted to slurm. The file `<experiment_ID>_compute_YYYYMMDD_YYYYMMDD_<JobID>.log` is the log file for the experiment run. The `<experiment_ID>_<setup_name>.date` indicates the finishing date of the last run.
unknown		Folder where all the unknown files from `run_YYYYMMDD_YYYYMMDD/work` are copied.
viz	user’s files	Aimed for user’s visualization scripts.
work	component files output files before copied to the `output` folder restart files before copied to the `restart` folder	The `work` folder inside the `run_YYYYMMDD_YYYYMMDD` folder is the main directory where the components are executed. Output and restart files are generated here before being copied to their respective folders.

If one file was to be copied in a directory containing a file with the same name, both files get renamed by the addition of their start date and end dates at the end of their names (i.e. fesom.clock_YYYYMMDD-YYYYMMDD).

Note

Having a general and several run subfolders means that files are duplicated and, when models consist of several runs, the general directory can end up looking very untidy. Run folders were created with the idea that they will be deleted once all files have been transferred to their respective folders in the general experiment directory. The default is not to delete this folders as they can be useful for debugging or restarting a crashed simulation, but the user can choose to delete them (see Cleanup of run_ directories).

Cleanup of `run_` directories

This plugin allows you to clean up the run_${DATE} folders. To do that you can use the following variables under the general section of your runscript (documentation follows order of code as it is executed):

clean_runs: This is the most important variable for most users. It can take the following values:
- True: removes the run_ directory after each run (overrides every other clean_ option).
- False: does not remove any run_ directory (default) if no clean_ variable is defined.
- <int>: giving an integer as a value results in deleting the run_ folders except for the last <int> runs (recommended option as it allows for debugging of crashed simulations).
Note

clean_runs: (bool) is incompatible with clean_this_rundir and clean_runs: (int) is incompatible with clean_old_rundirs_except (an error will be raised after the end of the first simulation). The functionality of clean_runs variable alone will suffice most of the standard user requirements. If finer tunning for the removal of run_ directories is required you can used the following variables instead of clean_runs.
clean_this_rundir: (bool) Removes the entire run directory (equivalent to clean_runs: (bool)). clean_this_rundir: True overrides every other clean_ option.
clean_old_rundirs_except: (int) Removes the entire run directory except for the last <x> runs (equivalent to clean_runs: (int)).
clean_old_rundirs_keep_every: (int) Removes the entire run directory except every <x>th run. Compatible with clean_old_rundirs_except or clean_runs: (int).
clean_<filetype>_dir: (bool) Erases the run directory for a specific filetype. Compatible with all the other options.
clean_size: (int or float) Erases all files with size greater than clean_size, must be specified in bytes! Compatible with all the other options.

Example

To delete all the run_ directories in your experiment include this into your runscript:

general:
        clean_runs: True

To keep the last 2 run_ directories:

general:
        clean_runs: 2

To keep the last 2 runs and every 5 runs:

general:
        clean_old_rundirs_except: 2
        clean_old_rundirs_keep_every: 5

Debugging an Experiment

To debug an experiment we recommend checking the following files that you will find, either in the general experiment directory or in the run subdirectory:

The ESM-Tools variable space file config/<experiment_ID>_finished_config.yaml.

The run log file run_YYYYMMDD-YYYYMMDD/<experiment_ID>_compute_YYYYMMDD-YYYYMMDD_<JobID>.log`.

For interactive debugging, you may also add the following to the general section of your configuration file. This will enable the pdb Python debugger, and allow you to step through the recipe.

general:
    debug_recipe: True

Configuration Provenance

In addition to the hints summarized in the “Debugging an Experiment” section, you will also find that the finished_config.yaml found in your config directory contains end-of-line comments detailing where a particular setting came from. You can use this to better track down what is being set and why, but it is strongly recommended that the configuration files found in your esm-tools source directory should not be changed unless you know exactly what you are doing. All of the configuration settings can be overridden from the run configuration, which is the prefered location for user changes. For more information see How can I know where a parameter is defined?.

Setting the file movement method for filetypes in the runscript

By default, esm_runscripts copies all files initially into the first run_-folder, and from there to work. After the run, outputs, logs, restarts etc. are copied from work to run_, and then moved from there to the overall experiment folder. We chose that as the default setting as it is the safest option, leaving the user with everything belonging to the experiment in one folder. It is also the most disk space consuming, and it makes sense to link some files into the experiment rather than copy them.

As an example, to configure esm_runscripts for an echam-experiment to link the forcing and inputs, one can add the following to the runscript yaml file:

echam:
        file_movements:
                forcing:
                        all_directions: "link"
                input:
                        init_to_exp: "link"
                        exp_to_run: "link"
                        run_to_work: "link"
                        work_to_run: "link"

Both ways to set the entries are doing the same thing. It is possible, as in the input case, to set the file movement method independently for each of the directions; the setting all_directions is just a shortcut if the method is identical for all of them.

Parallel File Movements

By default, esm_runscripts moves files (inputs, restarts, outputs, etc.) between experiment directories in parallel using Dask workers distributed across the compute nodes. This can significantly speed up the file-handling phases of a simulation, especially on large allocations.

The mode is controlled by parallel_file_movements in the general section of your runscript. To use Dask workers across compute nodes (the default) use:

general:
    parallel_file_movements: "dask"

To use local threads instead (only 1 node with cores working in parallel):

general:
    parallel_file_movements: "threads"

To disable parallel file movements entirely (serial):

general:
    parallel_file_movements: False

Which mode should I use?

Small runs (< 5 nodes): "threads" is usually sufficient and has no cluster initialization overhead.
Large runs (>= 5 nodes): "dask" distributes I/O across all nodes and scales better. The small startup cost is offset by faster file transfers.
Debugging or safety: False runs everything serially.

Note

When "dask" is selected but the Dask cluster is not available (e.g. on a login node or if workers fail to start), esm_runscripts automatically falls back to "threads" with a warning.

Dask internals

When parallel_file_movements is set to "dask", the following happens automatically:

Cluster startup – Before the main recipe begins, a Dask scheduler and workers are launched via srun (SLURM) on the allocated nodes. The scheduler writes a dask_scheduler.json file into the run’s work directory.
Parallel I/O – During each file-movement phase, esm_runscripts connects to the Dask cluster and submits copy/link/move operations as parallel tasks. If any single file transfer fails, it is retried serially as a fallback.

The cluster is configured through the dask section, which has sensible defaults but can be tuned in your runscript:

Dask configuration variables
Variable	Default	Description
`dask.client_timeout`	`0.05`	Timeout (seconds) when probing the Dask scheduler status.
`dask.workers_timeout`	`5`	Max time (seconds) to wait for workers to become available.
`dask.poll_interval`	`0.5`	How often (seconds) to poll for cluster readiness.
`dask.init_scheduler_cmd`	per batch system	Shell command to start the Dask scheduler (defined in the batch system YAML, e.g. `slurm.yaml`).
`dask.init_workers_cmd`	per batch system	Shell command to start the Dask workers (defined in the batch system YAML, e.g. `slurm.yaml`).
`dask.scheduler_json`	`${general.thisrun_work_dir}/dask_scheduler.json`	Full path to the Dask scheduler JSON file used for client connections.
`dask.actions`	`["parallel_file_movements"]`	List of actions that trigger Dask cluster initialization.

The Dask scheduler and worker launch commands are defined per batch system (e.g. in slurm.yaml) and are not typically changed by users. For SLURM, the default worker count is nnodes * partition_cpn / 4, meaning one Dask worker per four CPU cores (see configs/other_software/batch/slurm.yaml). Workers are distributed cyclically across all allocated nodes using InfiniBand:

slurm.yaml

dask:
    ntasks: "$(( ${computer.nnodes} * ${computer.partition_cpn} / 4 ))"
    ...
    init_workers_cmd: "srun --ntasks=${dask.ntasks} --cpus-per-task=1 --nodes=@nodes@ --distribution=cyclic:cyclic:cyclic dask worker --scheduler-file ${dask.scheduler_json} --nthreads 1 --nworkers 1 --interface ib0"

If you need to change the number of workers, you can either redefine dask.ntasks or provide a custom dask.init_workers_cmd in any of your configuration files or directly in your runscript.

Running an experiment with a virtual environment

Running jobs can optionally be encapsulated into a virtual environment.

To use a virtual environment run esm_runscripts with the flag --contained-run or set use_venv within the general section of your runscript to True:

general:
    use_venv: True

This shields the run from changes made to the remainder of the ESM-Tool installation, and it’s strongly recommended for production runs.

Warning

Refrain from using this feature if you have installed ESM-Tools within a conda environment.

If you choose to use a virtual environment, a local installation will be created in the experiment tree at the begining of the first run into the folder named .venv_esmtools. That installation will be used for the experiment. It will be installed at the root of your experiment and contains all the Python libraries used by ESM-Tools. The installation at the beginning of the experiment will induce a small overhead (~2-3 minutes).

For example, for a user miguel with a run with expid test ESM-Tools will be installed here:

/scratch/miguel/test/.venv_esmtools/lib/python3.10/site-packages/esm_tools

instead of:

/albedo/home/miguel/.local/lib/site-packages/esm_tools

The virtual environment installs by default the release branch, pulling it directly from our GitHub repository. You can choose to override this default by specifying another branch, adding to your runscript:

general:
    install_esm_tools_branch: '<your_branch_name>'

Warning

The branch needs to exist on GitHub as it is cloned form there, and not from your local folder. If you made any changes in your local branch make sure they are pushed before running esm_runscripts with a virtual environment, so that your changes are included in the virtual environment installation.

You may also select to install esm_tools in editable mode, in which case they will be installed in a folder src/esm_tools/ in the root of your experiment. Any changes made to the code in that folder will influence how ESM-Tools behave. To create a virtual environment with ESM-Tools installed in editable mode use:

general:
    install_<esm_package>_editable: true/false

Note

When using a virtual environment, config files and namelists will come of the folder .venv_esmtools listed above and not from your user install directory. You should make all changes to the namelists and config files via your user runscript (Changing Namelists). This is recommended in all cases!!!

Running an experiment with conda

If you submit esm_runscripts from within an active conda environment, that same environment is automatically activated inside the generated job script before the model runs. You don’t need to do anything for this to work.

If you need finer control (e.g. the environment used on the compute nodes should differ from the one used to launch esm_runscripts, or conda is not on the default PATH of the compute nodes), you can specify it explicitly in a conda section of your runscript:

conda:
    env: /path/to/your/conda/env
    root: /path/to/conda  # optional, needed if "conda" is not already on PATH

See conda.env, conda.root, and launched_with_conda in Run-time variables for details.

Warning

Do not combine this feature with the virtual environment feature described above (use_venv); mixing a conda environment with a venv created by ESM-Tools may cause conflicts.

Logging and verbosity

esm_runscripts uses Loguru-based logging with simple flags to control verbosity and file logging. Logs are always written in the main run log ( <base_dir>/<expid>/log/<expid>_<model>_<datestamp>_<jobid>.log). For more log granularity, it is possible to also set --task-log-files as a flag of esm_runscripts, to write logs of each task to a separate file. You can use the following esm_runscripts flags to control the logging behavior:

--trace: enable TRACE-level output to stdout. Prints very detailed diagnostics and the parsed command-line config.
-d, --debug: enable DEBUG-level output to stdout (less detailed than --trace) and breakpoints.
-v, --verbose: also enables DEBUG-level output to stdout, without breakpoints.
--task-log-files: enable per-task log files on disk. When enabled, esm_runscripts writes each task’s output to a file in the experiment’s log folder (<base_dir>/<expid>/log/<expid>_<model>_<task>_<datestamp>_<jobid>.log). To reduce the number of files, this option is turned off by default, but the logs are always printed in the run log anyway.

Note

Because the logging starts before the parsing of the yaml files, it is not possible to control the logging behavior from variables defined in the yamls. Only command-line flags can control the logging behavior.

ESM Runscripts

Usage

Arguments

Running a Model/Setup

Job Phases

Running only part of a job

Experiment Directory Structure

Cleanup of run_ directories

Debugging an Experiment

Configuration Provenance

Setting the file movement method for filetypes in the runscript

Parallel File Movements

Which mode should I use?

Dask internals

Running an experiment with a virtual environment

Running an experiment with conda

Logging and verbosity

Cleanup of `run_` directories