Use the SLEAP module on the SWC HPC cluster#

Warning

Some links within this document point to the SWC internal wiki, which is only accessible from within the SWC network. We recommend opening these links in a new tab.

Interpreting code blocks

Shell commands will be shown in code blocks like this (with the $ sign indicating the shell prompt):

$ echo "Hello world!"

Similarly, Python code blocks will appear with the >>> sign indicating the Python interpreter prompt:

>>> print("Hello world!")

The expected outputs of both Shell and Python commands will be shown without any prompt:

Hello world!

Abbreviations#

Acronym

Meaning

SLEAP

Social LEAP Estimates Animal Poses

SWC

Sainsbury Wellcome Centre

HPC

High Performance Computing

GUI

Graphical User Interface

SLURM

Simple Linux Utility for Resource Management

Prerequisites#

Access to the HPC cluster#

Verify that you can access HPC gateway node (typing your <SWC-PASSWORD> both times when prompted):

$ ssh <SWC-USERNAME>@ssh.swc.ucl.ac.uk
$ ssh hpc-gw1

To learn more about accessing the HPC via SSH, see the relevant how-to guide.

Access to the SLEAP module#

Once you are on the HPC gateway node, SLEAP should be listed among the available modules when you run module avail:

$ module avail
...
SLEAP/2023-03-13
SLEAP/2023-08-01
...
  • SLEAP/2023-03-13 corresponds to SLEAP v.1.2.9

  • SLEAP/2023-08-01 corresponds to SLEAP v.1.3.1

We recommend always using the latest version, which is the one loaded by default when you run module load SLEAP. If you want to load a specific version, you can do so by typing the full module name, including the date e.g. module load SLEAP/2023-03-13.

If a module has been successfully loaded, it will be listed when you run module list, along with other modules it may depend on:

$ module list
Currently Loaded Modulefiles:
 1) cuda/11.8   2) SLEAP/2023-08-01

If you have troubles with loading the SLEAP module, see this guide’s Troubleshooting section.

Install SLEAP on your local PC/laptop#

While you can delegate the GPU-intensive work to the HPC cluster, you will need to use the SLEAP GUI for some steps, such as labelling frames. Thus, you also need to install SLEAP on your local PC/laptop.

We recommend following the official SLEAP installation guide. If you already have conda installed, you may skip the mamba installation steps and opt for installing the libmamba-solver for conda:

$ conda install -n base conda-libmamba-solver
$ conda config --set solver libmamba

This will get you the much faster dependency resolution that mamba provides, without having to install mamba itself. From conda version 23.10 onwards (released in November 2023), libmamba-solver is anyway the default.

After that, you can follow the rest of the SLEAP installation guide, substituting conda for mamba in the relevant commands.

$ conda create -y -n sleap -c conda-forge -c nvidia -c sleap -c anaconda sleap=1.3.1
$ conda create -y -n sleap -c conda-forge -c anaconda -c sleap sleap=1.3.1

You may exchange sleap=1.3.1 for other versions. To be on the safe side, ensure that your local installation version matches (or is at least close to) the one installed in the cluster module.

Mount the SWC filesystem on your local PC/laptop#

The rest of this guide assumes that you have mounted the SWC filesystem on your local PC/laptop. If you have not done so, please follow the relevant instructions on the SWC internal wiki.

We will also assume that the data you are working with are stored in a ceph directory to which you have access to. In the rest of this guide, we will use the path /ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data which contains a SLEAP project for test purposes. You should replace this with the path to your own data.

Data storage location matters

The cluster has fast access to data stored on the ceph filesystem, so if your data is stored elsewhere, make sure to transfer it to ceph before running the job. You can use tools such as rsync to copy data from your local machine to ceph via an ssh connection. For example:

$ rsync -avz <LOCAL-DIR> <SWC-USERNAME>@ssh.swc.ucl.ac.uk:/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data

Model training#

This will consist of two parts - preparing a training job (on your local SLEAP installation) and running a training job (on the HPC cluster’s SLEAP module). Some evaluation metrics for the trained models can be viewed via the SLEAP GUI on your local SLEAP installation.

Prepare the training job#

Follow the SLEAP instructions for Creating a Project and Initial Labelling. Ensure that the project file (e.g. labels.v001.slp) is saved in the mounted SWC filesystem (as opposed to your local filesystem).

Next, follow the instructions in Remote Training, i.e. Predict -> Run Training… -> Export Training Job Package….

  • For selecting the right configuration parameters, see Configuring Models and Troubleshooting Workflows

  • Set the Predict On parameter to nothing. Remote training and inference (prediction) are easiest to run separately on the HPC Cluster. Also unselect Visualize Predictions During Training in training settings, if it’s enabled by default.

  • If you are working with camera view from above or below (as opposed to a side view), set the Rotation Min Angle and Rotation Max Angle to -180 and 180 respectively in the Augmentation section.

  • Make sure to save the exported training job package (e.g. labels.v001.slp.training_job.zip) in the mounted SWC filesystem, for example, in the same directory as the project file.

  • Unzip the training job package. This will create a folder with the same name (minus the .zip extension). This folder contains everything needed to run the training job on the HPC cluster.

Run the training job#

Login to the HPC cluster as described above.

$ ssh <SWC-USERNAME>@ssh.swc.ucl.ac.uk
$ ssh hpc-gw1

Navigate to the training job folder (replace with your own path) and list its contents:

$ cd /ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data
$ cd labels.v001.slp.training_job
$ ls -1
centered_instance.json
centroid.json
inference-script.sh
jobs.yaml
labels.v001.pkg.slp
labels.v001.slp.predictions.slp
train_slurm.sh
swc-hpc-pose-estimation
train-script.sh

There should be a train-script.sh file created by SLEAP, which already contains the commands to run the training. You can see the contents of the file by running cat train-script.sh:

labels.v001.slp.training_job/train-script.sh#
1#!/bin/bash
2sleap-train centroid.json labels.v001.pkg.slp
3sleap-train centered_instance.json labels.v001.pkg.slp

The precise commands will depend on the model configuration you chose in SLEAP. Here we see two separate training calls, one for the ‘centroid’ and another for the ‘centered_instance’ model. That’s because in this example we have chosen the ‘Top-Down’ configuration, which consists of two neural networks - the first for isolating the animal instances (by finding their centroids) and the second for predicting all the body parts per instance.

Top-Down model configuration

More on ‘Top-Down’ vs ‘Bottom-Up’ models

Although the ‘Top-Down’ configuration was designed with multiple animals in mind, it can also be used for single-animal videos. It makes sense to use it for videos where the animal occupies a relatively small portion of the frame - see Troubleshooting Workflows for more info.

Next you need to create a SLURM batch script, which will schedule the training job on the HPC cluster. Create a new file called train_slurm.sh (you can do this in the terminal with nano/vim or in a text editor of your choice on your local PC/laptop). Here we create the script in the same folder as the training job, but you can save it anywhere you want, or even keep track of it with git.

$ nano train_slurm.sh

An example is provided below, followed by explanations.

train_slurm.sh#
 1#!/bin/bash
 2
 3#SBATCH -J slp_train # job name
 4#SBATCH -p gpu # partition (queue)
 5#SBATCH -N 1   # number of nodes
 6#SBATCH --mem 32G # memory pool for all cores
 7#SBATCH -n 8 # number of cores
 8#SBATCH -t 0-06:00 # time (D-HH:MM)
 9#SBATCH --gres gpu:1 # request 1 GPU (of any kind)
10#SBATCH -o slurm.%x.%N.%j.out # STDOUT
11#SBATCH -e slurm.%x.%N.%j.err # STDERR
12#SBATCH --mail-type=ALL
13#SBATCH --mail-user=user@domain.com
14
15# Load the SLEAP module
16module load SLEAP
17
18# Define directories for SLEAP project and exported training job
19SLP_DIR=/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data
20SLP_JOB_NAME=labels.v001.slp.training_job
21SLP_JOB_DIR=$SLP_DIR/$SLP_JOB_NAME
22
23# Go to the job directory
24cd $SLP_JOB_DIR
25
26# Run the training script generated by SLEAP
27./train-script.sh

In nano, you can save the file by pressing Ctrl+O and exit by pressing Ctrl+X.

Explanation of the batch script
  • The #SBATCH lines are SLURM directives. They specify the resources needed for the job, such as the number of nodes, CPUs, memory, etc. A primer on the most useful SLURM arguments is provided in this how-to guide. For more information see the SLURM documentation.

  • The # lines are comments. They are not executed by SLURM, but they are useful for explaining the script to your future self and others.

  • The module load SLEAP line loads the latest SLEAP module and any other modules it may depend on.

  • The cd line changes the working directory to the training job folder. This is necessary because the train-script.sh file contains relative paths to the model configuration and the project file.

  • The ./train-script.sh line runs the training job (executes the contained commands).

Warning

Before submitting the job, ensure that you have permissions to execute both the batch script and the training script generated by SLEAP. You can make these files executable by running in the terminal:

$ chmod +x train-script.sh
$ chmod +x train_slurm.sh

If the scripts are not in your working directory, you will need to specify their full paths:

$ chmod +x /path/to/train-script.sh
$ chmod +x /path/to/train_slurm.sh

Now you can submit the batch script via running the following command (in the same directory as the script):

$ sbatch train_slurm.sh
Submitted batch job 3445652

You may monitor the progress of the job in various ways:

View the status of the queued/running jobs with squeue:

$ squeue --me
JOBID    PARTITION  NAME     USER      ST  TIME   NODES  NODELIST(REASON)
3445652  gpu        slp_train sirmpila  R   23:11  1      gpu-sr670-20

View status of running/completed jobs with sacct:

$ sacct
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
3445652      slp_train        gpu     swc-ac          2  COMPLETED      0:0
3445652.bat+      batch                swc-ac          2  COMPLETED      0:0

Run sacct with some more helpful arguments. For example, you can view jobs from the last 24 hours, displaying the time elapsed and the peak memory usage in KB (MaxRSS):

$ sacct \
  --starttime $(date -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \
  --endtime $(date +%Y-%m-%dT%H:%M:%S) \
  --format=JobID,JobName,Partition,State,Start,Elapsed,MaxRSS

JobID           JobName  Partition      State               Start    Elapsed     MaxRSS
------------ ---------- ---------- ---------- ------------------- ---------- ----------
4043595       slp_infer        gpu     FAILED 2023-10-10T18:14:31   00:00:35
4043595.bat+      batch                FAILED 2023-10-10T18:14:31   00:00:35    271104K
4043603       slp_infer        gpu     FAILED 2023-10-10T18:27:32   00:01:37
4043603.bat+      batch                FAILED 2023-10-10T18:27:32   00:01:37    423476K
4043611       slp_infer        gpu    PENDING             Unknown   00:00:00

View the contents of standard output and error (the node name and job ID will differ in each case):

$ cat slurm.gpu-sr670-20.3445652.out
$ cat slurm.gpu-sr670-20.3445652.err
Out-of-memory (OOM) errors

If you encounter out-of-memory errors, keep in mind that there two main sources of memory usage:

  • CPU memory (RAM), specified via the --mem argument in the SLURM batch script. This is the memory used by the Python process running the training job and is shared among all the CPU cores.

  • GPU memory, this is the memory used by the GPU card(s) and depends on the GPU card type you requested via the --gres gpu:1 argument in the SLURM batch script. To increase it, you can request a specific GPU card type with more GPU memory (e.g. --gres gpu:a4500:1). The SWC wiki provides a list of all GPU card types and their specifications.

  • If requesting more memory doesn’t help, you can try reducing the size of your SLEAP models. You may tweak the model backbone architecture, or play with Input scaling, Max stride and Batch size. See SLEAP’s documentation and discussion forum for more details.

Evaluate the trained models#

Upon successful completion of the training job, a models folder will have been created in the training job directory. It contains one subfolder per training run (by default prefixed with the date and time of the run).

$ cd /ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data
$ cd labels.v001.slp.training_job
$ cd models
$ ls -1
230509_141357.centered_instance
230509_141357.centroid

Each subfolder holds the trained model files (e.g. best_model.h5), their configurations (training_config.json) and some evaluation metrics.

$ cd 230509_141357.centered_instance
$ ls -1
best_model.h5
initial_config.json
labels_gt.train.slp
labels_gt.val.slp
labels_pr.train.slp
labels_pr.val.slp
metrics.train.npz
metrics.val.npz
training_config.json
training_log.csv

The SLEAP GUI on your local machine can be used to quickly evaluate the trained models.

  • Select Predict -> Evaluation Metrics for Trained Models…

  • Click on Add Trained Models(s) and select the folder containing the model(s) you want to evaluate.

  • You can view the basic metrics on the shown table or you can also view a more detailed report (including plots) by clicking View Metrics.

For more detailed evaluation metrics, you can refer to SLEAP’s model evaluation notebook.

Model inference#

By inference, we mean using a trained model to predict the labels on new frames/videos. SLEAP provides the sleap-track command line utility for running inference on a single video or a folder of videos.

Below is an example SLURM batch script that contains a sleap-track call.

infer_slurm.sh#
 1#!/bin/bash
 2
 3#SBATCH -J slp_infer # job name
 4#SBATCH -p gpu # partition
 5#SBATCH -N 1   # number of nodes
 6#SBATCH --mem 64G # memory pool for all cores
 7#SBATCH -n 16 # number of cores
 8#SBATCH -t 0-02:00 # time (D-HH:MM)
 9#SBATCH --gres gpu:rtx5000:1 # request 1 GPU (of a specific kind)
10#SBATCH -o slurm.%x.%N.%j.out # write STDOUT
11#SBATCH -e slurm.%x.%N.%j.err # write STDERR
12#SBATCH --mail-type=ALL
13#SBATCH --mail-user=user@domain.com
14
15# Load the SLEAP module
16module load SLEAP
17
18# Define directories for SLEAP project and exported training job
19SLP_DIR=/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data
20VIDEO_DIR=$SLP_DIR/videos
21SLP_JOB_NAME=labels.v001.slp.training_job
22SLP_JOB_DIR=$SLP_DIR/$SLP_JOB_NAME
23
24# Go to the job directory
25cd $SLP_JOB_DIR
26# Make a directory to store the predictions
27mkdir -p predictions
28
29# Run the inference command
30sleap-track $VIDEO_DIR/M708149_EPM_20200317_165049331-converted.mp4 \
31    -m $SLP_JOB_DIR/models/231010_164307.centroid/training_config.json \
32    -m $SLP_JOB_DIR/models/231010_164307.centered_instance/training_config.json \
33    --gpu auto \
34    --tracking.tracker simple \
35    --tracking.similarity centroid \
36    --tracking.post_connect_single_breaks 1 \
37    -o predictions/labels.v001.slp.predictions.slp \
38    --verbosity json \
39    --no-empty-frames

The script is very similar to the training script, with the following differences:

  • The time limit -t is set lower, since inference is normally faster than training. This will however depend on the size of the video and the number of models used.

  • The requested number of cores n and memory --mem are higher. This will depend on the requirements of the specific job you are running. It’s best practice to try with a scaled-down version of your data first, to get an idea of the resources needed.

  • The requested GPU is of a specific kind (RTX 5000). This will again depend on the requirements of your job, as the different GPU kinds vary in GPU memory size and compute capabilities (see the SWC wiki).

  • The ./train-script.sh line is replaced by the sleap-track command.

  • The \ character is used to split the long sleap-track command into multiple lines for readability. It is not necessary if the command is written on a single line.

Explanation of the sleap-track arguments

Some important command line arguments are explained below. You can view a full list of the available arguments by running sleap-track --help.

  • The first argument is the path to the video file to be processed.

  • The -m option is used to specify the path to the model configuration file(s) to be used for inference. In this example we use the two models that were trained above.

  • The --gpu option is used to specify the GPU to be used for inference. The auto value will automatically select the GPU with the highest percentage of available memory (of the GPUs that are available on the machine/node)

  • The options starting with --tracking specify parameters used for tracking the detected instances (animals) across frames. See SLEAP’s guide on tracking methods for more info.

  • The -o option is used to specify the path to the output file containing the predictions.

  • The above script will predict all the frames in the video. You may select specific frames via the --frames option. For example: --frames 1-50 or --frames 1,3,5,7,9.

You can submit and monitor the inference job in the same way as the training job.

$ sbatch infer_slurm.sh
$ squeue --me

Upon completion, a labels.v001.slp.predictions.slp file will have been created in the job directory.

You can use the SLEAP GUI on your local machine to load and view the predictions: File -> Open Project… -> select the labels.v001.slp.predictions.slp file.

The training-inference cycle#

Now that you have some predictions, you can keep improving your models by repeating the training-inference cycle. The basic steps are:

  • Manually correct some of the predictions: see Prediction-assisted labeling

  • Merge corrected labels into the initial training set: see Merging guide

  • Save the merged training set as labels.v002.slp

  • Export a new training job labels.v002.slp.training_job (you may reuse the training configurations from v001)

  • Repeat the training-inference cycle until satisfied

Troubleshooting#

Problems with the SLEAP module#

In this section, we will describe how to test that the SLEAP module is loaded correctly for you and that it can use the available GPUs.

Login to the HPC cluster as described above.

Start an interactive job on a GPU node. This step is necessary, because we need to test the module’s access to the GPU.

$ srun -p fast --gres=gpu:1 --pty bash -i
Explain the above command
  • -p fast requests a node from the ‘fast’ partition. This refers to the queue of nodes with a 3-hour time limit. They are meant for short jobs, such as testing.

  • --gres=gpu:1 requests 1 GPU of any kind

  • --pty is short for ‘pseudo-terminal’.

  • The -i stands for ‘interactive’

Taken together, the above command will start an interactive bash terminal session on a node of the ‘fast’ partition, equipped with 1 GPU.

First, let’s verify that you are indeed on a node equipped with a functional GPU, by typing nvidia-smi:

$ nvidia-smi
Wed Sep 27 10:34:35 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:41:00.0 Off |                  N/A |
|  0%   42C    P8    22W / 240W |      1MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Your output should look similar to the above. You will be able to see the GPU name, temperature, memory usage, etc. If you see an error message instead, (even though you are on a GPU node) please contact the SWC Scientific Computing team.

Next, load the SLEAP module.

$ module load SLEAP
Loading SLEAP/2023-08-01
  Loading requirement: cuda/11.8

To verify that the module was loaded successfully:

$ module list
Currently Loaded Modulefiles:
 1) SLEAP/2023-08-01

You can essentially think of the module as a centrally installed conda environment. When it is loaded, you should be using a particular Python executable. You can verify this by running:

$ which python
/ceph/apps/ubuntu-20/packages/SLEAP/2023-08-01/bin/python

Finally we will verify that the sleap python package can be imported and can ‘see’ the GPU. We will mostly just follow the relevant SLEAP instructions. First, start a Python interpreter:

$ python

Next, run the following Python commands:

Warning

The import sleap command may take some time to run (more than a minute). This is normal. Subsequent imports should be faster.

>>> import sleap

>>> sleap.versions()
SLEAP: 1.3.1
TensorFlow: 2.8.4
Numpy: 1.21.6
Python: 3.7.12
OS: Linux-5.4.0-109-generic-x86_64-with-debian-bullseye-sid

>>> sleap.system_summary()
GPUs: 1/1 available
  Device: /physical_device:GPU:0
         Available: True
        Initialized: False
     Memory growth: None

>>> import tensorflow as tf

>>> print(tf.config.list_physical_devices('GPU'))
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

>>> tf.constant("Hello world!")
<tf.Tensor: shape=(), dtype=string, numpy=b'Hello world!'>

If all is as expected, you can exit the Python interpreter, and then exit the GPU node

>>> exit()
$ exit()

If you encounter troubles with using the SLEAP module, contact Niko Sirmpilatze of the SWC Neuroinformatics Unit.

To completely exit the HPC cluster, you will need to logout of the SSH session twice:

$ logout
$ logout

See Set up SSH for the SWC HPC cluster for more information.