Running pose estimation on the SWC HPC system

Adam Tyson, Niko Sirmpilatze & Igor Tatarnikov

Hardware overview
Introduction to High Performance Computing
SWC HPC system
Using the job scheduler
Running pose estimation on the SWC HPC

Hardware overview

CPU (Central Processing Unit)
- General-purpose
- Split into cores (typically between 4 and 64)
- Each core can run a separate process
- Typically higher clock speed than GPU (~3-5GHz)

GPU (Graphics Processing Unit)
- Originally for rendering graphics
- Thousands of cores
- Optimised for parallel processing of matrix multiplication
- Typically lower clock speed than CPU (~1-2GHz)

Hardware overview

Primary storage:

Cache
- Small, fast memory
- Stores frequently accessed data
- Sits directly on the CPU/GPU
- Typically in the MB range with multiple levels

Main memory (RAM/VRAM)
- Fast storage for data
- CPU/GPU can access data quickly
- Lost when machine is powered off
- Typically 8-512 GB range

Hardware overview

Secondary storage:

Drive storage (HDD/SSD)
- Much slower than RAM
- SSDs faster than HDDs
- Typically in the GB-TB range

Network storage (e.g. ceph)
- Shared storage accessible from multiple machines
- Typically in the TB-PB range
- High latency compared to local storage

Hardware overview

Hardware overview

Performance considerations

CPU
- Frequency is important for single-threaded tasks
- More cores can be better for parallel tasks
- Sometimes your local machine is faster than the HPC for CPU tasks

GPU
- Great for parallel tasks (e.g. machine learning)
- Memory is important - make sure your data fits in VRAM
- Generation can be important, a new generation is typically ~10 – 20% faster

Storage
- Best if you can keep data in primary memory (Cache/RAM)
- If data doesn’t fit in memory make sure it’s on fast storage (local)

`cd`: Navigate Directories

cd [directory] – Changes the current working directory to the specified directory.
cd .. – Move up one directory level.
cd /path/to/directory – Go to a specific path.
cd ~ or cd – Go to the home directory.
cd / – Go to the root directory.

`ls`: List Directory Contents

ls – Lists files and directories in the current working directory.
ls -l – Displays detailed information about each file (permissions, owner, size, etc.).
ls -a – Shows all files, including hidden files (files starting with a dot).
ls -h – Displays sizes in human-readable format (e.g., KB, MB).
ls -lah – Combines the above options.

`mkdir`: Make Directory

mkdir [directory_name] – Creates a new directory with the specified name.
mkdir -p /path/to/dir to create nested directories.

`rmdir` and `rm`: Remove Directories and Files

rm [file]– Removes a file.
rmdir [directory] – Removes an empty directory.
rm -r [directory] - Removes a directory and its contents recursively.
rm -f [file] - Removes to force remove a file (no undo be careful!).

`mv` and `cp`: Move, Rename and Copy

cp [source] [destination] – Copies a file or directory to the destination.
mv [source] [destination] – Moves a file or directory to the destination.
mv can also be used to rename a file or directory if the source and destination directories match.

cp file.txt /new/location/
mv file.txt /new/location/
mv old_name.txt new_name.txt

Input and Output

echo [text] – Displays text or outputs text to a file.
echo $ENV_VAR – Displays the value of an environment variable.
touch [filename] – Creates an empty file or updates the timestamp of an existing file.
> – Redirects output to a file.
>> – Appends output to a file.

echo "Hello, Linux!"
touch file.txt
echo "Hello, Linux!" > file.txt
echo $HOME >> file.txt

`watch`: Monitor Command Output

watch [command] – Repeatedly runs a command at intervals and displays the result.
Use watch -n [seconds] [command] to change the interval.

watch -n 0.5 nvidia-smi

man and help: Get Help

man [command] – Opens the manual page for a command.
help [command] – Provides a short description of built-in commands.

man ls
help cd

`|`: Pipes!

Description:

| – Pipes the output of one command as input to another.
Useful for chaining commands together to perform complex operations.

ls | grep ".txt"
cat large_log_file.log | grep "ERROR" | less

Introduction to High Performance Computing (HPC)

Lots of meanings
Often just a system with many machines (nodes) linked together with some/all of:
- Lots of CPU cores per node
- Powerful GPUs
- Lots of memory per node
- Fast networking to link nodes
- Fast data storage
- Standardised software installation

Why?

Run jobs too large for desktop workstations
Run many jobs at once
Efficiency (cheaper to have central machines running 24/7)

In neuroscience, typically used for:
- Analysing large data (e.g. high memory requirements)
- Parallelising analysis/modelling (run on many machines at once)

SWC HPC hardware

(Correct at time of writing)

Ubuntu 20.04
81 nodes
- 46 CPU nodes
- 35 GPU nodes
3000 CPU cores
83 GPUs
~20TB RAM

Logging in

Log into bastion node (not necessary within SWC network)

ssh <USERNAME>@ssh.swc.ucl.ac.uk

Log into HPC gateway node

ssh <USERNAME>@hpc-gw1

This node is fine for light work, but no intensive analyses

More details

See our guide at howto.neuroinformatics.dev

File systems

/nfs/nhome/live/<USERNAME> or /nfs/ghome/live/<USERNAME>
- “Home drive” (SWC/GCNU), also at ~/
/nfs/winstor/<group> - Old SWC research data storage
/nfs/gatsbystor - GCNU data storage
/ceph/<group> - Current research data storage
/ceph/scratch - Not backed up, for short-term storage
/ceph/apps - HPC applications

Note

You may only be able to “see” a drive if you navigate to it

Navigate to the scratch space

cd /ceph/scratch

Create a directory for yourself

mkdir <USERNAME>

HPC software

All nodes have the same software installed

Ubuntu 20.04 LTS
General linux utilities

Modules

Preinstalled packages available for use, including:

ANTs
BrainGlobe
CUDA
DeepLabCut
FSL
Julia

Kilosort
mamba
MATLAB
neuron
miniconda
SLEAP

Using modules

List available modules

module avail

Load a module

module load SLEAP

Unload a module

module unload SLEAP

Load a specific version

module load SLEAP/2024-08-14

List loaded modules

module list

SLURM

Simple Linux Utility for Resource Management
Job scheduler
Allocates jobs to nodes
Queues jobs if nodes are busy
Users must explicitly request resources

SLURM commands

View a summary of the available resources

sinfo

atyson@hpc-gw1:~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
cpu*         up 10-00:00:0      1   mix# gpu-380-25
cpu*         up 10-00:00:0     31    mix enc1-node[1-14],enc2-node[1-13],enc3-node[6-8],gpu-380-24
cpu*         up 10-00:00:0      4  alloc enc3-node[1-2,4-5]
gpu          up 10-00:00:0      1   mix# gpu-380-15
gpu          up 10-00:00:0      1  down~ gpu-380-16
gpu          up 10-00:00:0     12    mix gpu-350-[01-05], gpu-380-[11,13-14,17-18],gpu-sr670-[20,22]
a100         up 30-00:00:0      2    mix gpu-sr670-[21,23]
lmem         up 10-00:00:0      1  idle~ gpu-380-12
medium       up   12:00:00      1   mix# gpu-380-15
medium       up   12:00:00      1  down~ gpu-380-16
medium       up   12:00:00      7    mix enc3-node[6-8],gpu-380-[11,14,17-18]
medium       up   12:00:00      4  alloc enc3-node[1-2,4-5]
fast         up    3:00:00      2  idle~ enc1-node16,gpu-erlich01
fast         up    3:00:00      4    mix gpu-380-[11,14,17-18]

View currently running jobs (from everyone)

squeue

atyson@hpc-gw1:~$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
4036257       cpu     bash   imansd  R 13-01:10:01      1 enc1-node2
4050946       cpu      zsh apezzott  R 1-01:02:30      1 enc2-node11
3921466       cpu     bash   imansd  R 51-03:05:29      1 gpu-380-13
4037613       gpu     bash  pierreg  R 12-05:55:06      1 gpu-sr670-20
4051306       gpu ddpm-vae   jheald  R      15:49      1 gpu-350-01
4051294       gpu  jupyter    samoh  R    1:40:59      1 gpu-sr670-22
4047787       gpu     bash antonins  R 4-18:59:43      1 gpu-sr670-21
4051063_7       gpu    LRsem apezzott  R 1-00:08:32      1 gpu-350-05
4051063_8       gpu    LRsem apezzott  R 1-00:08:32      1 gpu-380-10
4051305       gpu     bash  kjensen  R      18:33      1 gpu-sr670-20
4051297       gpu     bash   slenzi  R    1:15:39      1 gpu-350-01

More details

See our guide at howto.neuroinformatics.dev

Partitions

Interactive job

Start an interactive job (bash -i) in the fast partition (-p fast) in pseudoterminal mode (--pty) with one CPU core (-n 1).

srun -p fast -n 1 --pty bash -i

Always start a job (interactive or batch) before doing anything intensive to spare the gateway node.

Run some “analysis”

Clone a test script

cd ~/
git clone https://github.com/neuroinformatics-unit/course-behaviour-hpc

Make the script executable

cd course-behaviour-hpc/demo
chmod +x multiply.sh

Run the script

./multiply.sh 10 5

Stop interactive job

exit

Batch jobs

Check out batch script:

cd course-behaviour-hpc/demo
cat batch_example.sh

#!/bin/bash

#SBATCH -p fast # partition (queue)
#SBATCH -N 1   # number of nodes
#SBATCH --mem 1G # memory pool for all cores
#SBATCH -n 1 # number of cores
#SBATCH -t 0-0:1 # time (D-HH:MM)
#SBATCH -o slurm_output.out
#SBATCH -e slurm_error.err

for i in {1..5}
do
  ./multiply.sh $i 10
done

Run batch job:

sbatch batch_example.sh

Array jobs

Check out array script:

cat array_example.sh

#!/bin/bash

#SBATCH -p fast # partition (queue)
#SBATCH -N 1   # number of nodes
#SBATCH --mem 1G # memory pool for all cores
#SBATCH -n 1 # number of cores
#SBATCH -t 0-0:1 # time (D-HH:MM)
#SBATCH -o slurm_array_%A-%a.out
#SBATCH -e slurm_array_%A-%a.err
#SBATCH --array=0-9%4

# Array job runs 10 separate jobs, but not more than four at a time.
# This is flexible and the array ID ($SLURM_ARRAY_TASK_ID) can be used in any way.

echo "Multiplying $SLURM_ARRAY_TASK_ID by 10"
./multiply.sh $SLURM_ARRAY_TASK_ID 10

Run array job:

sbatch array_example.sh

Using GPUs

Start an interactive job with one GPU:

srun -p gpu --gres=gpu:1 --pty bash -i

Load TensorFlow & CUDA

module load tensorflow
module load cuda/11.8

Check GPU

python

import tensorflow as tf
tf.config.list_physical_devices('GPU')

Useful commands

Cancel a job

scancel <JOBID>

Cancel all your jobs

scancel -u <USERNAME>

Example: pose estimation with SLEAP

Pose estimation

“easy” in humans - vast amounts of data
“harder” in animals - less data, more variability

Pose estimation software

DeepLabCut: transfer learning

SLEAP:smaller networks

Top-down pose estimation

SLEAP workflow

Training and inference are GPU-intensive
We can delegate to the HPC cluster’s GPU nodes

Sample data

/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data/course-hpc-2023

Mouse videos from Loukia Katsouri
SLEAP project with:
- labeled frames
- trained models
- prediction results

Labeling data locally

Exporting a training job package

Training job package contents

Copy the unzipped training package to your scratch space and inspect its contents:

cp -r /ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data/course-hpc-2023/labels.v001.slp.training_job /ceph/scratch/$USER/
cd /ceph/scratch/$USER/labels.v001.slp.training_job
ls -1

labels.v001.slp.training_job

labels.v001.pkg.slp     # Copy of labeled frames
centroid.json           # Model configuration
centered_instance.json  # Model configuration
train-script.sh         # Bash script to run training
inference-script.sh     # Bash script to run inference
jobs.yaml               # Summary of all jobs

Warning

Make sure all scripts are executable

chmod +x *.sh

What’s in the SLEAP scripts?

Training

cat train-script.sh

#!/bin/bash
sleap-train centroid.json labels.v001.pkg.slp
sleap-train centered_instance.json labels.v001.pkg.slp

Inference

cat inference-script.sh

#!/bin/bash

Suitable for debugging (immediate feedback)

Start an interactive job with one GPU
```
srun -p gpu --gres=gpu:1 --pty bash -i
```

Execute commands one-by-one, e.g.:

module load SLEAP
cd /ceph/scratch/$USER/labels.v001.slp.training_job
bash train-script.sh

# Stop the session
exit

Main method for submitting jobs

Prepare a batch script, e.g. sleap_train_slurm.sh
Submit the job:
```
sbatch sleap_train_slurm.sh
```
Monitor job status:
```
squeue --me
```

See example batch scripts

cd ~/course-behaviour-hpc/pose-estimation/slurm-scripts
ls

Warning

Make sure all scripts are executable

chmod +x *.sh

Edit a specific script:

nano sleap_train_slurm.sh

Save with Ctrl+O (followed by Enter), exit with Ctrl+X

Batch script for training

sleap_train_slurm.sh

#!/bin/bash

#SBATCH -J slp_train # job name
#SBATCH -p gpu # partition (queue)
#SBATCH -N 1   # number of nodes
#SBATCH --mem 16G # memory pool for all cores
#SBATCH -n 4 # number of cores
#SBATCH -t 0-06:00 # time (D-HH:MM)
#SBATCH --gres gpu:1 # request 1 GPU (of any kind)
#SBATCH -o slurm.%x.%N.%j.out # STDOUT
#SBATCH -e slurm.%x.%N.%j.err # STDERR
#SBATCH --mail-type=ALL
#SBATCH --mail-user=user@domain.com

# Load the SLEAP module
module load SLEAP

# Define the directory of the exported training job package
SLP_JOB_NAME=labels.v001.slp.training_job
SLP_JOB_DIR=/ceph/scratch/$USER/$SLP_JOB_NAME

# Go to the job directory
cd $SLP_JOB_DIR

# Run the training script generated by SLEAP
./train-script.sh

Monitoring the training job

sbatch sleap_train_slurm.sh
  Submitted batch job 4232289

squeue
sacct
View the logs

View the status of your queued/running jobs with squeue --me

squeue --me

JOBID    PARTITION    NAME       USER      ST   TIME  NODES   NODELIST(REASON)
4232289  gpu          slp_trai   sirmpila  R    0:20      1   gpu-380-18

View status of running/completed jobs with sacct:

sacct

JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
4232289       slp_train        gpu     swc-ac          4    RUNNING      0:0
4232289.bat+      batch                swc-ac          4    RUNNING      0:0

Run sacct with some more helpful arguments, e.g. view jobs from the last 24 hours, incl. time elapsed and peak memory usage in KB (MaxRSS):

sacct \
  --starttime $(date -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \
  --endtime $(date +%Y-%m-%dT%H:%M:%S) \
  --format=JobID,JobName,Partition,State,Start,Elapsed,MaxRSS

View the contents of standard output and error (the job name, node name and job ID will differ in each case):

cat slurm.slp_train.gpu-380-18.4232289.out
cat slurm.slp_train.gpu-380-18.4232289.err

View trained models

While you wait for the training job to finish, you can copy and inspect the trained models from a previous run:

cp -R /ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data/course-hpc-2023/labels.v001.slp.training_job/models /ceph/scratch/$USER/labels.v001.slp.training_job/
cd /ceph/scratch/$USER/labels.v001.slp.training_job/models
ls

231130_160757.centroid
231130_160757.centered_instance

What’s in the model directory?

cd 231130_160757.centroid
ls -1

best_model.h5
initial_config.json
labels_gt.train.slp
labels_gt.val.slp
labels_pr.train.slp
labels_pr.val.slp
metrics.train.npz
metrics.val.npz
training_config.json
training_log.csv

Evaluate trained models

SLEAP workflow

Batch script for inference

sleap_infer_slurm.sh

#!/bin/bash

#SBATCH -J slp_infer # job name
#SBATCH -p gpu # partition
#SBATCH -N 1   # number of nodes
#SBATCH --mem 32G # memory pool for all cores
#SBATCH -n 8 # number of cores
#SBATCH -t 0-01:00 # time (D-HH:MM)
#SBATCH --gres gpu:1 # request 1 GPU
#SBATCH -o slurm.%x.%N.%j.out # write STDOUT
#SBATCH -e slurm.%x.%N.%j.err # write STDERR
#SBATCH --mail-type=ALL
#SBATCH --mail-user=user@domain.com

# Load the SLEAP module
module load SLEAP

# Define directories for exported SLEAP job package and videos
SLP_JOB_NAME=labels.v001.slp.training_job
SLP_JOB_DIR=/ceph/scratch/$USER/$SLP_JOB_NAME
VIDEO_DIR=/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data/course-hpc-2023/videos
VIDEO1_PREFIX=sub-01_ses-01_task-EPM_time-165049

# Go to the job directory
cd $SLP_JOB_DIR

# Make a directory to store the predictions
mkdir -p predictions

# Run the inference command
sleap-track $VIDEO_DIR/${VIDEO1_PREFIX}_video.mp4 \
    -m $SLP_JOB_DIR/models/231130_160757.centroid/training_config.json \
    -m $SLP_JOB_DIR/models/231130_160757.centered_instance/training_config.json \
    -o $SLP_JOB_DIR/predictions/${VIDEO1_PREFIX}_predictions.slp \
    --gpu auto \
    --no-empty-frames

Run inference job

Edit and save the batch script

nano sleap_infer_slurm.sh

Submit the job

sbatch sleap_infer_slurm.sh

Monitor the job

squeue --me

Run inference as an array job

Batch script for array job

sleap_infer_array_slurm.sh

#!/bin/bash

#SBATCH -J slp_infer # job name
#SBATCH -p gpu # partition
#SBATCH -N 1   # number of nodes
#SBATCH --mem 32G # memory pool for all cores
#SBATCH -n 8 # number of cores
#SBATCH -t 0-01:00 # time (D-HH:MM)
#SBATCH --gres gpu:1 # request 1 GPU
#SBATCH -o slurm.%x.%N.%j.out # write STDOUT
#SBATCH -e slurm.%x.%N.%j.err # write STDERR
#SBATCH --mail-type=ALL
#SBATCH --mail-user=user@domain.com
#SBATCH --array=0-1

# Load the SLEAP module
module load SLEAP

# Define directories for exported SLEAP job package and videos
SLP_JOB_NAME=labels.v001.slp.training_job
SLP_JOB_DIR=/ceph/scratch/$USER/$SLP_JOB_NAME
VIDEO_DIR=/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data/course-hpc-2023/videos

VIDEO1_PREFIX=sub-01_ses-01_task-EPM_time-165049
VIDEO2_PREFIX=sub-02_ses-01_task-EPM_time-185651
VIDEOS_PREFIXES=($VIDEO1_PREFIX $VIDEO2_PREFIX)
CURRENT_VIDEO_PREFIX=${VIDEOS_PREFIXES[$SLURM_ARRAY_TASK_ID]}
echo "Current video prefix: $CURRENT_VIDEO_PREFIX"

# Go to the job directory
cd $SLP_JOB_DIR

# Make a directory to store the predictions
mkdir -p predictions

# Run the inference command
sleap-track $VIDEO_DIR/${CURRENT_VIDEO_PREFIX}_video.mp4 \
    -m $SLP_JOB_DIR/models/231130_160757.centroid/training_config.json \
    -m $SLP_JOB_DIR/models/231130_160757.centered_instance/training_config.json \
    -o $SLP_JOB_DIR/predictions/${CURRENT_VIDEO_PREFIX}_array_predictions.slp \
    --gpu auto \
    --no-empty-frames

Running pose estimation on the SWC HPC system

Contents

Hardware overview

Hardware overview

Primary storage:

Hardware overview

Secondary storage:

Hardware overview

Hardware overview

Performance considerations

cd: Navigate Directories

ls: List Directory Contents

mkdir: Make Directory

rmdir and rm: Remove Directories and Files

mv and cp: Move, Rename and Copy

Input and Output

watch: Monitor Command Output

man and help: Get Help

|: Pipes!

Introduction to High Performance Computing (HPC)

Why?

SWC HPC hardware

Logging in

File systems

HPC software

Modules

Using modules

SLURM

SLURM commands

Partitions

Interactive job

Run some “analysis”

Batch jobs

Array jobs

Using GPUs

Useful commands

Example: pose estimation with SLEAP

Pose estimation

Pose estimation software

Top-down pose estimation

SLEAP workflow

Sample data

Labeling data locally

Exporting a training job package

Training job package contents

What’s in the SLEAP scripts?

Get SLURM to run the script

See example batch scripts

Batch script for training

Monitoring the training job

View trained models

Evaluate trained models

SLEAP workflow

Batch script for inference

Run inference job

Run inference as an array job

Batch script for array job

Further reading

`cd`: Navigate Directories

`ls`: List Directory Contents

`mkdir`: Make Directory

`rmdir` and `rm`: Remove Directories and Files

`mv` and `cp`: Move, Rename and Copy

`watch`: Monitor Command Output

`|`: Pipes!