esm_archiving package

Top-level package for ESM Archiving.

esm_archiving.archive_mistral(tfile, rtfile=None)[source]

Puts the tfile to the tape archive using tape_command

Parameters:
  • tfile (str) – The full path of the file to put to tape

  • rtfile (str) – The filename on the remote tape server. Defaults to None, in which case a replacement is performed to keep as much of the filename the same as possible. Example: /work/ab0246/a270077/experiment.tgz –> /hpss/arch/ab0246/a270077/experiment.tgz

Returns:

Return type:

None

esm_archiving.check_tar_lists(tar_lists)[source]
esm_archiving.delete_original_data(tfile, force=False)[source]

Erases data which is found in the tar file.

Parameters:
  • tfile (str) – Path to the tarfille whose data should be erased.

  • force (bool) – If False, asks the user if they really want to delete their files. Otherwise just does this silently. Default is False

Returns:

Return type:

None

esm_archiving.determine_datestamp_location(files)[source]

Given a list of files; figures where the datestamp is by checking if it varies.

Parameters:

files (list) – A list (longer than 1!) of files to check

Returns:

A slice object giving the location of the datestamp

Return type:

slice

Raises:

DatestampLocationError : – Raised if there is more than one slice found where the numbers vary over different files -or- if the length of the file list is not longer than 1.

esm_archiving.determine_potential_datestamp_locations(filepattern)[source]

For a filepattern, gives back index of potential date locations

Parameters:

filepattern (str) – The filepattern to check.

Returns:

A list of slice object which you can use to cut out dates from the filepattern

Return type:

list

esm_archiving.find_indices_of(char, in_string)[source]

Finds indicies of a specific character in a string

Parameters:
  • char (str) – The character to look for

  • in_string (str) – The string to look in

Yields:

int – Each round of the generator gives you the next index for the desired character.

esm_archiving.get_files_for_date_range(filepattern, start_date, stop_date, frequency, date_format='%Y%m%d')[source]

Creates a list of files for specified start/stop dates

Parameters:
  • filepattern (str) – A filepattern to replace dates in

  • start_date (str) – The starting date, in a pandas-friendly date format

  • stop_date (str) – Ending date, pandas friendly. Note that for end dates, you need to add one month to assure that you get the last step in your list!

  • frequency (str) – Frequency of dates, pandas friendly

  • date_format (str) – How dates should be formatted, defaults to %Y%m%d

Returns:

A list of strings for the filepattern with correct date stamps.

Return type:

list

Example

>>> filepattern =  "LGM_24hourly_PMIP4_echam6_BOT_mm_>>>DATE<<<.nc"
>>> LGM_files = get_files_for_date_range(filepattern, "1890-07", "1891-11", "1M", date_format="%Y%m")
>>> LGM_files == [
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189007.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189008.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189009.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189010.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189011.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189012.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189101.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189102.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189103.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189104.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189105.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189106.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189107.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189108.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189109.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189110.nc",
... ]
True
esm_archiving.get_list_from_filepattern(filepattern)[source]
esm_archiving.group_files(top, filetype)[source]

Generates quasi-regexes for a specific filetype, replacing all numbers with #.

Parameters:
  • top (str) – Where to start looking (this should normally be top of the experiment)

  • filetype (str) – Which files to go through (e.g. outdata, restart, etc…)

Returns:

A dictonary containing keys for each folder found in filetype, and values as lists of files with strings where numbers are replaced by #.

Return type:

dict

esm_archiving.group_indexes(index_list)[source]

Splits indexes into tuples of monotonically ascending values.

Parameters:

list – The list to split up

Returns:

A list of tuples, so that you can get only one group of ascending tuples.

Return type:

list

Example

>>> indexes = [0, 1, 2, 3, 12, 13, 15, 16]
>>> group_indexes(indexes)
[(0, 1, 2, 3), (12, 13), (15, 16)]
esm_archiving.log_tarfile_contents(tfile)[source]

Generates a log of the tarball contents

Parameters:

tfile (str) – The path for the tar file to generate a log for

Returns:

Return type:

None

Warning

Note that for this function to work, you need to have write permission in the directory where the tarball is located. If not, this will probably raise an OSError. I can imagine giving the location of the log path as an argument; but would like to see if that is actually needed before implementing it…

esm_archiving.pack_tarfile(flist, wdir, outname)[source]

Creates a compressed tarball (outname) with all files found in flist.

Parameters:
  • flist (list) – A list of files to include in this tarball

  • wdir (str) – The directory to “change” to when packing up the tar file. This will (essentially) be used in the tar command as the -C option by stripping off the beginning of the flist

  • outname (str) – The output file name

Returns:

The output file name

Return type:

str

esm_archiving.purify_expid_in(model_files, expid, restore=False)[source]

Puts or restores >>>EXPID<<< marker in filepatterns

Parameters:
  • model_files (dict) – The model files for archiving

  • expid (str) – The experiment ID to purify or restore

  • restore (bool) – Set experiment ID back from the temporary marker

Returns:

Dictionary containing keys for each model, values for file patterns

Return type:

dict

esm_archiving.sort_files_to_tarlists(model_files, start_date, end_date, config)[source]
esm_archiving.split_list_due_to_size_limit(in_list, slimit)[source]
esm_archiving.stamp_filepattern(filepattern, force_return=False)[source]

Transforms # in filepatterns to >>>DATE<<< and replaces other numbers back to original

Parameters:
  • filepattern (str) – Filepattern to get date stamps for

  • force_return (bool) – Returns the list of filepatterns even if it is longer than 1.

Returns:

New filepattern, with >>>DATE<<<

Return type:

str

esm_archiving.stamp_files(model_files)[source]

Given a sttandard file dictioanry (keys: model names, values: filepattern); figures out where the date probably is, and replaces the # sequence with a >>>DATE<<< stamp.

Parameters:

model_files (dict) – Dictionary of keys (model names) where values are lists of files for each model.

Returns:

As the input, but replaces the filepatterns with the >>>DATE<<< stamp.

Return type:

dict

esm_archiving.sum_tar_lists(tar_lists)[source]

Sums up the amount of space in the tar lists dictionary

Given tar_lists, which is generally a dicitonary consisting of keys (model names) and values (files to be tarred), figures out how much space the raw, uncompressed files would use. Generally the compressed tarball will take up less space.

Parameters:

tar_lists (dict) – Dictionary of file lists to be summed up. Reports every sum as a value for the key of that particular list.

Returns:

Keys are the same as in the input, values are the sums (in bytes) of all files present within the list.

Return type:

dict

esm_archiving.sum_tar_lists_human_readable(tar_lists)[source]

As sum_tar_lists but gives back strings with human-readable sizes.

Subpackages

Submodules

esm_archiving.cli module

After installation, you have a new command in your path:

esm_archive

Passing in the argument --help will show available subcommands:

Usage: esm_archive [OPTIONS] COMMAND [ARGS]...

  Console script for esm_archiving.

Options:
  --version             Show the version and exit.
  --write_local_config  Write a local configuration YAML file in the current
                        working directory
  --write_config        Write a global configuration YAML file in
                        ~/.config/esm_archiving/
  --help                Show this message and exit.

Commands:
  create
  upload

To use the tool, you can first create a tar archive and then use upload to put it onto the tape server.

Creating tarballs

Use esm_archive create to generate tar files from an experiment:

esm_archive create /path/to/top/of/experiment start_date end_date

The arguments start_date and end_date should take the form YYYY-MM-DD. A complete example would be:

esm_archive create /work/ab0246/a270077/from_ba0989/AWICM/LGM_6hours 1850-01-01 1851-01-01

The archiving tool will automatically pack up all files it finds matching these dates in the outdata and restart directories and generate logs in the top of the experiment folder. Note that the final date (1851-01-1 in this example) is not included. During packing, you get a progress bar indicating when the tarball is finished.

Please be aware that are size limits in place on DKRZ’s tape server. Any tar files larger than 500 Gb will be trucated. For more information, see: https://www.dkrz.de/up/systems/hpss/hpss

Uploading tarballs

A second command esm_archive upload allows you to put tarballs onto to tape server at DKRZ:

esm_archive upload /path/to/top/of/experiment start_date end_date

The signature is the same as for the create subcommand. Note that for this to work; you need to have a properly configured .netrc file in your home directory:

$ cat ~/.netrc
machine tape.dkrz.de login a270077 password OMITTED

This file needs to be readable/writable only for you, e.g. chmod 600. The archiving program will then be able to automatically log into the tape server and upload the tarballs. Again, more information about logging onto the tape server without password authentication can be found here: https://www.dkrz.de/up/help/faq/hpss/how-can-i-use-the-hpss-tape-archive-without-typing-my-password-every-time-e-g-in-scripts-or-jobs

esm_archiving.config module

When run from either the command line or in library mode (note not as an ESM Plugin), esm_archiving can be configured to how it looks for specific files. The configuration file is called esm_archiving_config, should be written in YAML, and have the following format:

echam:  # The model name
    archive: # archive seperator **required**
        # Frequency specification (how often
        # a datestamp is generated to look for)
        frequency: "1M"
        # Date format specification
        date_format: "%Y%m"

By default, esm_archive looks in the following locations:

  1. Current working directory

  2. Any files in the XDG Standard:

    https://specifications.freedesktop.org/basedir-spec/basedir-spec-latest.html

If nothing is found, the program reverts to the hard-coded defaults, found in esm_archiving/esm_archiving/config.py

Note

In future, it might be changed that the program will look for an experiment specific configuration based upon the path it is given during the create or upload step.

Generating a configuration

You can use the command line switches --write_local_config and --write_config to generate configuration files either in the current working directory, or in the global directory for your user account defined by the XDG standard (typically ~/.config/esm_archiving):

$ esm_archive --write_local_config
Writing local (experiment) configuration...

$ esm_archive --write_config
Writing global (user) configuration...
esm_archiving.config.load_config()[source]

Loads the configuration from one of the default configuration directories. If none can be found, returns the hard-coded default configuration.

Returns:

A representation of the configuration used for archiving.

Return type:

dict

esm_archiving.config.write_config_yaml(path=None)[source]

esm_archiving.esm_archiving module

This is the esm_archiving module.

exception esm_archiving.esm_archiving.DatestampLocationError[source]

Bases: Exception

esm_archiving.esm_archiving.archive_mistral(tfile, rtfile=None)[source]

Puts the tfile to the tape archive using tape_command

Parameters:
  • tfile (str) – The full path of the file to put to tape

  • rtfile (str) – The filename on the remote tape server. Defaults to None, in which case a replacement is performed to keep as much of the filename the same as possible. Example: /work/ab0246/a270077/experiment.tgz –> /hpss/arch/ab0246/a270077/experiment.tgz

Returns:

Return type:

None

esm_archiving.esm_archiving.check_tar_lists(tar_lists)[source]
esm_archiving.esm_archiving.delete_original_data(tfile, force=False)[source]

Erases data which is found in the tar file.

Parameters:
  • tfile (str) – Path to the tarfille whose data should be erased.

  • force (bool) – If False, asks the user if they really want to delete their files. Otherwise just does this silently. Default is False

Returns:

Return type:

None

esm_archiving.esm_archiving.determine_datestamp_location(files)[source]

Given a list of files; figures where the datestamp is by checking if it varies.

Parameters:

files (list) – A list (longer than 1!) of files to check

Returns:

A slice object giving the location of the datestamp

Return type:

slice

Raises:

DatestampLocationError : – Raised if there is more than one slice found where the numbers vary over different files -or- if the length of the file list is not longer than 1.

esm_archiving.esm_archiving.determine_potential_datestamp_locations(filepattern)[source]

For a filepattern, gives back index of potential date locations

Parameters:

filepattern (str) – The filepattern to check.

Returns:

A list of slice object which you can use to cut out dates from the filepattern

Return type:

list

esm_archiving.esm_archiving.find_indices_of(char, in_string)[source]

Finds indicies of a specific character in a string

Parameters:
  • char (str) – The character to look for

  • in_string (str) – The string to look in

Yields:

int – Each round of the generator gives you the next index for the desired character.

esm_archiving.esm_archiving.get_files_for_date_range(filepattern, start_date, stop_date, frequency, date_format='%Y%m%d')[source]

Creates a list of files for specified start/stop dates

Parameters:
  • filepattern (str) – A filepattern to replace dates in

  • start_date (str) – The starting date, in a pandas-friendly date format

  • stop_date (str) – Ending date, pandas friendly. Note that for end dates, you need to add one month to assure that you get the last step in your list!

  • frequency (str) – Frequency of dates, pandas friendly

  • date_format (str) – How dates should be formatted, defaults to %Y%m%d

Returns:

A list of strings for the filepattern with correct date stamps.

Return type:

list

Example

>>> filepattern =  "LGM_24hourly_PMIP4_echam6_BOT_mm_>>>DATE<<<.nc"
>>> LGM_files = get_files_for_date_range(filepattern, "1890-07", "1891-11", "1M", date_format="%Y%m")
>>> LGM_files == [
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189007.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189008.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189009.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189010.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189011.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189012.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189101.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189102.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189103.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189104.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189105.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189106.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189107.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189108.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189109.nc",
... "LGM_24hourly_PMIP4_echam6_BOT_mm_189110.nc",
... ]
True
esm_archiving.esm_archiving.get_list_from_filepattern(filepattern)[source]
esm_archiving.esm_archiving.group_files(top, filetype)[source]

Generates quasi-regexes for a specific filetype, replacing all numbers with #.

Parameters:
  • top (str) – Where to start looking (this should normally be top of the experiment)

  • filetype (str) – Which files to go through (e.g. outdata, restart, etc…)

Returns:

A dictonary containing keys for each folder found in filetype, and values as lists of files with strings where numbers are replaced by #.

Return type:

dict

esm_archiving.esm_archiving.group_indexes(index_list)[source]

Splits indexes into tuples of monotonically ascending values.

Parameters:

list – The list to split up

Returns:

A list of tuples, so that you can get only one group of ascending tuples.

Return type:

list

Example

>>> indexes = [0, 1, 2, 3, 12, 13, 15, 16]
>>> group_indexes(indexes)
[(0, 1, 2, 3), (12, 13), (15, 16)]
esm_archiving.esm_archiving.log_tarfile_contents(tfile)[source]

Generates a log of the tarball contents

Parameters:

tfile (str) – The path for the tar file to generate a log for

Returns:

Return type:

None

Warning

Note that for this function to work, you need to have write permission in the directory where the tarball is located. If not, this will probably raise an OSError. I can imagine giving the location of the log path as an argument; but would like to see if that is actually needed before implementing it…

esm_archiving.esm_archiving.pack_tarfile(flist, wdir, outname)[source]

Creates a compressed tarball (outname) with all files found in flist.

Parameters:
  • flist (list) – A list of files to include in this tarball

  • wdir (str) – The directory to “change” to when packing up the tar file. This will (essentially) be used in the tar command as the -C option by stripping off the beginning of the flist

  • outname (str) – The output file name

Returns:

The output file name

Return type:

str

esm_archiving.esm_archiving.purify_expid_in(model_files, expid, restore=False)[source]

Puts or restores >>>EXPID<<< marker in filepatterns

Parameters:
  • model_files (dict) – The model files for archiving

  • expid (str) – The experiment ID to purify or restore

  • restore (bool) – Set experiment ID back from the temporary marker

Returns:

Dictionary containing keys for each model, values for file patterns

Return type:

dict

esm_archiving.esm_archiving.query_yes_no(question, default='yes')[source]

Ask a yes/no question via input() and return their answer.

“question” is a string that is presented to the user. “default” is the presumed answer if the user just hits <Enter>.

It must be “yes” (the default), “no” or None (meaning an answer is required of the user).

The “answer” return value is True for “yes” or False for “no”.

Note: Shamelessly stolen from StackOverflow It’s not hard to implement, but Paul is lazy…

Parameters:
  • question (str) – The question you’d like to ask the user

  • default (str) – The presumed answer for question. Defaults to “yes”.

Returns:

True if the user said yes, False if the use said no.

Return type:

bool

esm_archiving.esm_archiving.run_command(command)[source]

Runs command and directly prints output to screen.

Parameters:

command (str) – The command to run, with pipes, redirects, whatever

Returns:

rc – The return code of the subprocess.

Return type:

int

esm_archiving.esm_archiving.sort_files_to_tarlists(model_files, start_date, end_date, config)[source]
esm_archiving.esm_archiving.split_list_due_to_size_limit(in_list, slimit)[source]
esm_archiving.esm_archiving.stamp_filepattern(filepattern, force_return=False)[source]

Transforms # in filepatterns to >>>DATE<<< and replaces other numbers back to original

Parameters:
  • filepattern (str) – Filepattern to get date stamps for

  • force_return (bool) – Returns the list of filepatterns even if it is longer than 1.

Returns:

New filepattern, with >>>DATE<<<

Return type:

str

esm_archiving.esm_archiving.stamp_files(model_files)[source]

Given a sttandard file dictioanry (keys: model names, values: filepattern); figures out where the date probably is, and replaces the # sequence with a >>>DATE<<< stamp.

Parameters:

model_files (dict) – Dictionary of keys (model names) where values are lists of files for each model.

Returns:

As the input, but replaces the filepatterns with the >>>DATE<<< stamp.

Return type:

dict

esm_archiving.esm_archiving.sum_tar_lists(tar_lists)[source]

Sums up the amount of space in the tar lists dictionary

Given tar_lists, which is generally a dicitonary consisting of keys (model names) and values (files to be tarred), figures out how much space the raw, uncompressed files would use. Generally the compressed tarball will take up less space.

Parameters:

tar_lists (dict) – Dictionary of file lists to be summed up. Reports every sum as a value for the key of that particular list.

Returns:

Keys are the same as in the input, values are the sums (in bytes) of all files present within the list.

Return type:

dict

esm_archiving.esm_archiving.sum_tar_lists_human_readable(tar_lists)[source]

As sum_tar_lists but gives back strings with human-readable sizes.