Checkpoint SLURM jobs

Introduction

As you probably noticed, execution time for jobs in DCSR clusters is limited to 3 days. For those jobs that take more than 3 days and cannot be optimized or divided up into smaller jobs, DCSR's clusters provide a Checkpoint mechanism. This mechanism will save the state of application in disk, resubmit the same job, and restore the state of the application from the point at which it was stopped. The checkpoint mechanism is based on CRIU which uses low level operating system mechanisms, so in theory it should work for most of the applications.

How to use it

In order to use it, you need to do two things:

Modify job
Define some environmental variables before launching your job

job modifications

make the following changes to your jobs scripts:

You need to source the script /dcsrsoft/spack/external/ckptslurmjob/scripts/ckpt_methods.sh
Add the setup_ckpt call
Use launch_app to call your application
(optional) add --error and --output to slurm parameters. This will create two separate files for standard output and standard error. If you need to process the output of your application later you are encourage to add these parameters, otherwise you will see some errors or warnings from the checkpoint mechanism. If your application generates custom output files, you do not need these options.

The script below summarizes those changes:

#!/bin/sh
#SBATCH --job-name job1
#SBATCH -N 1
#SBATCH --cpus-per-task 4
#SBATCH --partition cpu
#SBATCH --time 02:00:00
#SBATCH --mem=16G
#SBATCH --error job1-%j.error
#SBATCH --output job1-%j.out

source /dcsrsoft/spack/external/ckptslurmjob/scripts/ckpt_methods.sh

setup_ckpt

launch_app $APP

the --time parameter does not limit the duration of the job but It will be used to create the checkpoint. For example for a --time 02:00:00 , after 2 hours the job will be checkpointed and it will be rescheduled some minutes later. You can put any value from 1 hour to 3 days, a good value is something in the middle: 10h or 12h. The checkpoint uses low level Operating System mechanism so it should work for most of applications, however, there coud be some error with some exotic application. That is why it is a good idea to put something not longer that 12 hours for the time limit, as it will allow you to know if the application is compatible with checkpointing.

launching the job

Before launching your job please follow the following recipe:

export SBATCH_OPEN_MODE="append"
export SBATCH_SIGNAL=B:USR1@60
sbatch job.sh

SBATCH_SIGNAL=B:USR1@60 implies that checkpoint mechanism have 60 seconds to create the checkpoint. For some memory hungry applications the checkpoint can be longer. Refer to checkpoint performance to have some indications.

Additionally to the out and error files produced by SLURM, the execution of the job will generate:

checkpoint-JOB_ID.log: checkpoint logs
checkpoint-JOB_ID-files: application checkpoint files. Please do not delete this directory until your job has finished otherwise the job will fail.

Job examples:

#!/bin/sh
#SBATCH --job-name job1
#SBATCH -N 1
#SBATCH --cpus-per-task 1
#SBATCH --partition cpu
#SBATCH --time 02:00:00
#SBATCH --mem=16G

source /dcsrsoft/spack/external/ckptslurmjob/scripts/ckpt_methods.sh

setup_ckpt

launch_app ../pi_css5 400000000

Multithread application:

#!/bin/sh
#SBATCH --job-name job1
#SBATCH -N 1
#SBATCH --cpus-per-task 4
#SBATCH --partition cpu
#SBATCH --time 02:00:00
#SBATCH --mem=16G

export OMP_NUM_THREADS=4
module load gcc

source /dcsrsoft/spack/external/ckptslurmjob/scripts/ckpt_methods.sh
setup_ckpt

launch_app /home/user1/lu.C.x

Tensorflow:

#!/bin/sh
#SBATCH --job-name job1
#SBATCH -N 1
#SBATCH --cpus-per-task 4
#SBATCH --partition cpu
#SBATCH --time 02:00:00
#SBATCH --mem=16G

export OMP_NUM_THREADS=4
source ../tensorflow_env/bin/activate

source /dcsrsoft/spack/external/ckptslurmjob/scripts/ckpt_methods.sh
setup_ckpt

launch_app python run_tensorflow.py

Samtools:

#!/bin/sh
#SBATCH --job-name job1
#SBATCH -N 1
#SBATCH --cpus-per-task 1
#SBATCH --partition cpu
#SBATCH --time 02:00:00
#SBATCH --mem=16G

module load gcc samtools

source /dcsrsoft/spack/external/ckptslurmjob/scripts/ckpt_methods.sh
setup_ckpt

launch_app samtools sort /users/user1/samtools/HG00154.mapped.ILLUMINA.bwa.GBR.low_coverage.20101123.bam -o sorted_file.bam

Again, before launching a job you need to follow the following recipe:

export SBATCH_OPEN_MODE="append"
export SBATCH_SIGNAL=B:USR1@60
sbatch job.sh

Performance

Checkpoint can take some time and it is directly proportional to the amount of memory used by the application. These are some numbers:

Memory size (GB)	Checkpoint time (secs)
4.6	3.2
7.1	5.6
9	7.74
18	15

In theory with 60 seconds it should be able to checkpoint an application that take up to ~50 GB of RAM. Feel free to change SBATCH_SIGNAL=B:USR1@60. A good rule of thumb is to count a 1.5 second per GB of memory allocated

Java applications

In order to checkpoint java application, we have to use two parameters for launching the application:

-XX:-UsePerfData

This will deactivate the creation of the directory /tmp/hsperfdata_$USER, otherwise it will make the checkpoint restoration fail

-XX:+UseSerialGC

This will enable the Serial Garbage collector which deactivates the parallel garbage collector. The parallel garbage collector generates a GC thread per thread of computation. Thus, making the restoration of checkpoint more difficult due to the large number of threads.

Snakemake

In order to use the checkpoint mechanism with snakemake, you need to adapt the SLURM profile used to submit jobs into the cluster. Normally the SLURM profile define the following options:

cluster: slurm-submit.py (This script is used to send jobs to SLURM)
cluster-status: "slurm-status.py" (This script is used to parse jobs status from slurm)
jobscript: "slum-jobscript.sh" (Template used for submitting snakemake commands as job scripts)

We need to modify how jobs are launched to slurm, the idea is to wrap snakemake jobscript into another job. This will enable us to checkpoint all processes launched by snakemake.

The procedure consist in the following steps (the following steps are based on the slurm plugin provided here: https://github.com/Snakemake-Profiles/slurm)

Create checkpoint script

Please create the following script and call it job-checkpoint.sh:

#!/bin/bash

source /dcsrsoft/spack/external/ckptslurmjob/scripts/ckpt_methods.sh

setup_ckpt

launch_app $1

make it executable: chmod +x job-checkpoint.sh. This script should be placed at the same directory as the other slurm scripts used.

Modify slurm-scripts

We need to modify the sbatch command used. Normally a jobscript is passed as a parameter, we need to pass our aforementioned script first and pass the snakemake jobscript as a parameter, as shown below (lines 6 and 9):

def submit_job(jobscript, **sbatch_options):
    """Submit jobscript and return jobid."""
    options = format_sbatch_options(**sbatch_options)
    try:
        # path of our checkpoint script
        jobscript_ckpt = os.path.join(os.path.dirname(__file__),'job-checkpoint.sh')
        # we tell sbatch to execute the chekcpoint script first and we pass 
        # jobscript as a parameter
        cmd = ["sbatch"] + ["--parsable"] + options + [jobscript_ckpt] + [jobscript]
        res = sp.check_output(cmd)
    except sp.CalledProcessError as e:
        raise e
    # Get jobid
    res = res.decode()
    try:
        jobid = re.search(r"(\d+)", res).group(1)
    except Exception as e:
        raise e
    return jobid

Ideally, we need to pass extra options to sbatch in order to control output and error files:

sbatch_options = { "output" : "{rule}_%j.out", "error" : "{rule}_%j.error"}

This is necessary to isolate errors and warnings raised by the checkpoint mechanism into an error file (as explained at the beginning of this page). This is only valid for the official slurm profile as it will treat snakemake wildcards defined in Snakefile (e.g rule).

Export necessary variables

You still need to export some variables before launching snakemake:

export SBATCH_OPEN_MODE="append"
export SBATCH_SIGNAL=B:USR1@1800
snakemake --profile slurm-chk/ --verbose

With this configuration, the checkpoint will start 30 min before the end of the job.

Limitations

It does not work for MPI and GPU applications
The application launched should be composed of only one command with its arguments. If you need complex workflows wrapt the code inside a script.
In case of error and if the job fails, checkpoint cannot be restored in another job, therefore the progress will be lost.