Use the SLEAP module on the SWC HPC cluster#
Warning
Some links within this document point to the SWC internal wiki, which is only accessible from within the SWC network. We recommend opening these links in a new tab.
Interpreting code blocks
Shell commands will be shown in code blocks like this
(with the $
sign indicating the shell prompt):
$ echo "Hello world!"
Similarly, Python code blocks will appear with the >>>
sign indicating the
Python interpreter prompt:
>>> print("Hello world!")
The expected outputs of both Shell and Python commands will be shown without any prompt:
Hello world!
Abbreviations#
Prerequisites#
Note on managed Linux desktops
The SWC’s IT team offers managed desktop computers equipped with a Linux image. These machines are already part of SWC’s trusted domain and have direct access to SLURM, the HPC modules, and the SWC filesystem.
If you have access to one of these desktops,
you can skip the pre-requisite steps.
You may simply open a terminal, type module load SLEAP
,
and start using SLEAP directly, as you would on any local
Linux machine. All SLEAP commands should work as expected,
including sleap-label
for launching the GUI.
That said, you may still want to offload GPU-intensive tasks to an HPC node (e.g. because the desktop’s GPU is not powerful enough or because you need to run many jobs in parallel). In that case, you may still want to read the sections on model training and inference.
Access to the HPC cluster#
Verify that you can access HPC gateway node (typing your <SWC-PASSWORD>
both times when prompted):
$ ssh <SWC-USERNAME>@ssh.swc.ucl.ac.uk
$ ssh hpc-gw1
To learn more about accessing the HPC via SSH, see the relevant how-to guide.
Access to the SLEAP module#
Once you are on the HPC gateway node, SLEAP should be listed among the available modules when you run module avail
:
$ module avail
...
SLEAP/2023-03-13
SLEAP/2023-08-01
SLEAP/2024-08-14
...
SLEAP/2023-03-13
corresponds toSLEAP v.1.2.9
SLEAP/2023-08-01
corresponds toSLEAP v.1.3.1
SLEAP/2024-08-14
corresponds toSLEAP v.1.3.3
We recommend always using the latest version, which is the one loaded by default
when you run module load SLEAP
. If you want to load a specific version,
you can do so by typing the full module name,
including the date e.g. module load SLEAP/2023-08-01
.
If a module has been successfully loaded, it will be listed when you run module list
,
along with other modules it may depend on:
$ module list
Currently Loaded Modulefiles:
1) cuda/11.8 2) SLEAP/2023-08-01
If you have troubles with loading the SLEAP module, see this guide’s Troubleshooting section.
Install SLEAP on your local PC/laptop#
While you can delegate the GPU-intensive work to the HPC cluster, you will need to use the SLEAP GUI for some steps, such as labelling frames. Thus, you also need to install SLEAP on your local PC/laptop.
We recommend following the official SLEAP installation guide. To minimise the risk of issues due to incompatibilities between versions, ensure the version of your local installation of SLEAP matches the one you plan to load in the cluster.
Mount the SWC filesystem on your local PC/laptop#
The rest of this guide assumes that you have mounted the SWC filesystem on your local PC/laptop. If you have not done so, please follow the relevant instructions on the SWC internal wiki.
We will also assume that the data you are working with are stored in a ceph
directory to which you have access to. In the rest of this guide, we will use the path
/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data
which contains a SLEAP project
for test purposes. You should replace this with the path to your own data.
Data storage location matters
The cluster has fast access to data stored on the ceph
filesystem, so if your
data is stored elsewhere, make sure to transfer it to ceph
before running the job.
You can use tools such as rsync
to copy data from your local machine to ceph
via an ssh connection. For example:
$ rsync -avz <LOCAL-DIR> <SWC-USERNAME>@ssh.swc.ucl.ac.uk:/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data
Model training#
This will consist of two parts: preparing a training job (on your local SLEAP installation) and running a training job (on the HPC cluster’s SLEAP module). Some evaluation metrics for the trained models can be viewed via the SLEAP GUI on your local SLEAP installation.
Prepare the training job#
Follow the SLEAP instructions for Creating a Project
and Initial Labelling.
Ensure that the project file (e.g. labels.v001.slp
) is saved in the mounted SWC filesystem
(as opposed to your local filesystem).
Next, follow the instructions in Remote Training, i.e. Predict -> Run Training… -> Export Training Job Package….
For selecting the right configuration parameters, see Configuring Models and Troubleshooting Workflows
Set the Predict On parameter to nothing. Remote training and inference (prediction) are easiest to run separately on the HPC Cluster. Also unselect Visualize Predictions During Training in training settings, if it’s enabled by default.
If you are working with camera view from above or below (as opposed to a side view), set the Rotation Min Angle and Rotation Max Angle to -180 and 180 respectively in the Augmentation section.
Make sure to save the exported training job package (e.g.
labels.v001.slp.training_job.zip
) in the mounted SWC filesystem, for example, in the same directory as the project file.Unzip the training job package. This will create a folder with the same name (minus the
.zip
extension). This folder contains everything needed to run the training job on the HPC cluster.
Run the training job#
Login to the HPC cluster as described above.
$ ssh <SWC-USERNAME>@ssh.swc.ucl.ac.uk
$ ssh hpc-gw1
Navigate to the training job folder (replace with your own path) and list its contents:
$ cd /ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data
$ cd labels.v001.slp.training_job
$ ls -1
centered_instance.json
centroid.json
inference-script.sh
jobs.yaml
labels.v001.pkg.slp
labels.v001.slp.predictions.slp
train_slurm.sh
swc-hpc-pose-estimation
train-script.sh
There should be a train-script.sh
file created by SLEAP, which already contains the
commands to run the training. You can see the contents of the file by running cat train-script.sh
:
1#!/bin/bash
2sleap-train centroid.json labels.v001.pkg.slp
3sleap-train centered_instance.json labels.v001.pkg.slp
The precise commands will depend on the model configuration you chose in SLEAP. Here we see two separate training calls, one for the ‘centroid’ and another for the ‘centered_instance’ model. That’s because in this example we have chosen the ‘Top-Down’ configuration, which consists of two neural networks - the first for isolating the animal instances (by finding their centroids) and the second for predicting all the body parts per instance.
More on ‘Top-Down’ vs ‘Bottom-Up’ models
Although the ‘Top-Down’ configuration was designed with multiple animals in mind, it can also be used for single-animal videos. It makes sense to use it for videos where the animal occupies a relatively small portion of the frame - see Troubleshooting Workflows for more info.
Next you need to create a SLURM batch script, which will schedule the training job
on the HPC cluster. Create a new file called train_slurm.sh
(you can do this in the terminal with nano
/vim
or in a text editor of
your choice on your local PC/laptop). Here we create the script in the same folder
as the training job, but you can save it anywhere you want, or even keep track of it with git
.
$ nano train_slurm.sh
An example is provided below, followed by explanations.
1#!/bin/bash
2
3#SBATCH -J slp_train # job name
4#SBATCH -p gpu # partition (queue)
5#SBATCH -N 1 # number of nodes
6#SBATCH --mem 32G # memory pool for all cores
7#SBATCH -n 8 # number of cores
8#SBATCH -t 0-06:00 # time (D-HH:MM)
9#SBATCH --gres gpu:1 # request 1 GPU (of any kind)
10#SBATCH -o slurm.%x.%N.%j.out # STDOUT
11#SBATCH -e slurm.%x.%N.%j.err # STDERR
12#SBATCH --mail-type=ALL
13#SBATCH --mail-user=user@domain.com
14
15# Load the SLEAP module
16module load SLEAP
17
18# Define directories for SLEAP project and exported training job
19SLP_DIR=/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data
20SLP_JOB_NAME=labels.v001.slp.training_job
21SLP_JOB_DIR=$SLP_DIR/$SLP_JOB_NAME
22
23# Go to the job directory
24cd $SLP_JOB_DIR
25
26# Run the training script generated by SLEAP
27./train-script.sh
In nano
, you can save the file by pressing Ctrl+O
and exit by pressing Ctrl+X
.
Explanation of the batch script
The
#SBATCH
lines are SLURM directives. They specify the resources needed for the job, such as the number of nodes, CPUs, memory, etc. A primer on the most useful SLURM arguments is provided in this how-to guide. For more information see the SLURM documentation.The
#
lines are comments. They are not executed by SLURM, but they are useful for explaining the script to your future self and others.The
module load SLEAP
line loads the latest SLEAP module and any other modules it may depend on.The
cd
line changes the working directory to the training job folder. This is necessary because thetrain-script.sh
file contains relative paths to the model configuration and the project file.The
./train-script.sh
line runs the training job (executes the contained commands).
Warning
Before submitting the job, ensure that you have permissions to execute both the batch script and the training script generated by SLEAP. You can make these files executable by running in the terminal:
$ chmod +x train-script.sh
$ chmod +x train_slurm.sh
If the scripts are not in your working directory, you will need to specify their full paths:
$ chmod +x /path/to/train-script.sh
$ chmod +x /path/to/train_slurm.sh
Now you can submit the batch script via running the following command (in the same directory as the script):
$ sbatch train_slurm.sh
Submitted batch job 3445652
You may monitor the progress of the job in various ways:
View the status of the queued/running jobs with squeue
:
$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3445652 gpu slp_train sirmpila R 23:11 1 gpu-sr670-20
View status of running/completed jobs with sacct
:
$ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
3445652 slp_train gpu swc-ac 2 COMPLETED 0:0
3445652.bat+ batch swc-ac 2 COMPLETED 0:0
Run sacct
with some more helpful arguments.
For example, you can view jobs from the last 24 hours, displaying the time
elapsed and the peak memory usage in KB (MaxRSS):
$ sacct \
--starttime $(date -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \
--endtime $(date +%Y-%m-%dT%H:%M:%S) \
--format=JobID,JobName,Partition,State,Start,Elapsed,MaxRSS
JobID JobName Partition State Start Elapsed MaxRSS
------------ ---------- ---------- ---------- ------------------- ---------- ----------
4043595 slp_infer gpu FAILED 2023-10-10T18:14:31 00:00:35
4043595.bat+ batch FAILED 2023-10-10T18:14:31 00:00:35 271104K
4043603 slp_infer gpu FAILED 2023-10-10T18:27:32 00:01:37
4043603.bat+ batch FAILED 2023-10-10T18:27:32 00:01:37 423476K
4043611 slp_infer gpu PENDING Unknown 00:00:00
View the contents of standard output and error (the node name and job ID will differ in each case):
$ cat slurm.gpu-sr670-20.3445652.out
$ cat slurm.gpu-sr670-20.3445652.err
Out-of-memory (OOM) errors
If you encounter out-of-memory errors, keep in mind that there two main sources of memory usage:
CPU memory (RAM), specified via the
--mem
argument in the SLURM batch script. This is the memory used by the Python process running the training job and is shared among all the CPU cores.GPU memory, this is the memory used by the GPU card(s) and depends on the GPU card type you requested via the
--gres gpu:1
argument in the SLURM batch script. To increase it, you can request a specific GPU card type with more GPU memory (e.g.--gres gpu:a4500:1
). The SWC wiki provides a list of all GPU card types and their specifications.If requesting more memory doesn’t help, you can try reducing the size of your SLEAP models. You may tweak the model backbone architecture, or play with Input scaling, Max stride and Batch size. See SLEAP’s documentation and discussion forum for more details.
Model evaluation#
Upon successful completion of the training job, a models
folder will have
been created in the training job directory. It contains one subfolder per
training run (by default prefixed with the date and time of the run).
$ cd /ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data
$ cd labels.v001.slp.training_job
$ cd models
$ ls -1
230509_141357.centered_instance
230509_141357.centroid
Each subfolder holds the trained model files (e.g. best_model.h5
),
their configurations (training_config.json
) and some evaluation metrics.
$ cd 230509_141357.centered_instance
$ ls -1
best_model.h5
initial_config.json
labels_gt.train.slp
labels_gt.val.slp
labels_pr.train.slp
labels_pr.val.slp
metrics.train.npz
metrics.val.npz
training_config.json
training_log.csv
The SLEAP GUI on your local machine can be used to quickly evaluate the trained models.
Select Predict -> Evaluation Metrics for Trained Models…
Click on Add Trained Models(s) and select the folder containing the model(s) you want to evaluate.
You can view the basic metrics on the shown table or you can also view a more detailed report (including plots) by clicking View Metrics.
For more detailed evaluation metrics, you can refer to SLEAP’s model evaluation notebook.
Model inference#
By inference, we mean using a trained model to predict the labels on new frames/videos.
SLEAP provides the sleap-track
command line utility for running inference
on a single video or a folder of videos.
Below is an example SLURM batch script that contains a sleap-track
call.
1#!/bin/bash
2
3#SBATCH -J slp_infer # job name
4#SBATCH -p gpu # partition
5#SBATCH -N 1 # number of nodes
6#SBATCH --mem 64G # memory pool for all cores
7#SBATCH -n 16 # number of cores
8#SBATCH -t 0-02:00 # time (D-HH:MM)
9#SBATCH --gres gpu:rtx5000:1 # request 1 GPU (of a specific kind)
10#SBATCH -o slurm.%x.%N.%j.out # write STDOUT
11#SBATCH -e slurm.%x.%N.%j.err # write STDERR
12#SBATCH --mail-type=ALL
13#SBATCH --mail-user=user@domain.com
14
15# Load the SLEAP module
16module load SLEAP
17
18# Define directories for SLEAP project and exported training job
19SLP_DIR=/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data
20VIDEO_DIR=$SLP_DIR/videos
21SLP_JOB_NAME=labels.v001.slp.training_job
22SLP_JOB_DIR=$SLP_DIR/$SLP_JOB_NAME
23
24# Go to the job directory
25cd $SLP_JOB_DIR
26# Make a directory to store the predictions
27mkdir -p predictions
28
29# Run the inference command
30sleap-track $VIDEO_DIR/M708149_EPM_20200317_165049331-converted.mp4 \
31 -m $SLP_JOB_DIR/models/231010_164307.centroid/training_config.json \
32 -m $SLP_JOB_DIR/models/231010_164307.centered_instance/training_config.json \
33 --gpu auto \
34 --tracking.tracker simple \
35 --tracking.similarity centroid \
36 --tracking.post_connect_single_breaks 1 \
37 -o predictions/labels.v001.slp.predictions.slp \
38 --verbosity json \
39 --no-empty-frames
The script is very similar to the training script, with the following differences:
The time limit
-t
is set lower, since inference is normally faster than training. This will however depend on the size of the video and the number of models used.The requested number of cores
n
and memory--mem
are higher. This will depend on the requirements of the specific job you are running. It’s best practice to try with a scaled-down version of your data first, to get an idea of the resources needed.The requested GPU is of a specific kind (RTX 5000). This will again depend on the requirements of your job, as the different GPU kinds vary in GPU memory size and compute capabilities (see the SWC wiki).
The
./train-script.sh
line is replaced by thesleap-track
command.The
\
character is used to split the longsleap-track
command into multiple lines for readability. It is not necessary if the command is written on a single line.
Explanation of the sleap-track arguments
Some important command line arguments are explained below.
You can view a full list of the available arguments by running sleap-track --help
.
The first argument is the path to the video file to be processed.
The
-m
option is used to specify the path to the model configuration file(s) to be used for inference. In this example we use the two models that were trained above.The
--gpu
option is used to specify the GPU to be used for inference. Theauto
value will automatically select the GPU with the highest percentage of available memory (of the GPUs that are available on the machine/node)The options starting with
--tracking
specify parameters used for tracking the detected instances (animals) across frames. See SLEAP’s guide on tracking methods for more info.The
-o
option is used to specify the path to the output file containing the predictions.The above script will predict all the frames in the video. You may select specific frames via the
--frames
option. For example:--frames 1-50
or--frames 1,3,5,7,9
.
You can submit and monitor the inference job in the same way as the training job.
$ sbatch infer_slurm.sh
$ squeue --me
Upon completion, a labels.v001.slp.predictions.slp
file will have been created in the job directory.
You can use the SLEAP GUI on your local machine to load and view the predictions:
File -> Open Project… -> select the labels.v001.slp.predictions.slp
file.
The training-inference cycle#
Now that you have some predictions, you can keep improving your models by repeating the training-inference cycle. The basic steps are:
Manually correct some of the predictions: see Prediction-assisted labeling
Merge corrected labels into the initial training set: see Merging guide
Save the merged training set as
labels.v002.slp
Export a new training job
labels.v002.slp.training_job
(you may reuse the training configurations fromv001
)Repeat the training-inference cycle until satisfied
Troubleshooting#
Problems with the SLEAP module#
In this section, we will describe how to test that the SLEAP module is loaded correctly for you and that it can use the available GPUs.
Login to the HPC cluster as described above.
Start an interactive job on a GPU node. This step is necessary, because we need to test the module’s access to the GPU.
$ srun -p fast --gres=gpu:1 --pty bash -i
Explain the above command
-p fast
requests a node from the ‘fast’ partition. This refers to the queue of nodes with a 3-hour time limit. They are meant for short jobs, such as testing.--gres=gpu:1
requests 1 GPU of any kind--pty
is short for ‘pseudo-terminal’.The
-i
stands for ‘interactive’
Taken together, the above command will start an interactive bash terminal session on a node of the ‘fast’ partition, equipped with 1 GPU.
First, let’s verify that you are indeed on a node equipped with a functional
GPU, by typing nvidia-smi
:
$ nvidia-smi
Wed Sep 27 10:34:35 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:41:00.0 Off | N/A |
| 0% 42C P8 22W / 240W | 1MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Your output should look similar to the above. You will be able to see the GPU name, temperature, memory usage, etc. If you see an error message instead, (even though you are on a GPU node) please contact the SWC Scientific Computing team.
Next, load the SLEAP module.
$ module load SLEAP
Loading SLEAP/2024-08-14
Loading requirement: cuda/11.8
To verify that the module was loaded successfully:
$ module list
Currently Loaded Modulefiles:
1) SLEAP/2024-08-14
You can essentially think of the module as a centrally installed conda environment. When it is loaded, you should be using a particular Python executable. You can verify this by running:
$ which python
/ceph/apps/ubuntu-20/packages/SLEAP/2024-08-14/bin/python
Finally we will verify that the sleap
python package can be imported and can
‘see’ the GPU. We will mostly just follow the
relevant SLEAP instructions.
First, start a Python interpreter:
$ python
Next, run the following Python commands:
Warning
The import sleap
command may take some time to run (more than a minute).
This is normal. Subsequent imports should be faster.
>>> import sleap
>>> sleap.versions()
SLEAP: 1.3.3
TensorFlow: 2.8.4
Numpy: 1.21.6
Python: 3.7.12
OS: Linux-5.4.0-109-generic-x86_64-with-debian-bullseye-sid
>>> sleap.system_summary()
GPUs: 1/1 available
Device: /physical_device:GPU:0
Available: True
Initialized: False
Memory growth: None
>>> import tensorflow as tf
>>> print(tf.config.list_physical_devices('GPU'))
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
>>> tf.constant("Hello world!")
<tf.Tensor: shape=(), dtype=string, numpy=b'Hello world!'>
If all is as expected, you can exit the Python interpreter, and then exit the GPU node
>>> exit()
$ exit()
If you encounter troubles with using the SLEAP module, contact Niko Sirmpilatze of the SWC Neuroinformatics Unit.
To completely exit the HPC cluster, you will need to type exit
or
logout
until you are back to the terminal prompt of your local machine.
See Set up SSH for the SWC HPC cluster
for more information.