Running pose estimation on the SWC HPC system

Adam Tyson, Niko Sirmpilatze & Igor Tatarnikov

Contents

  • Hardware overview
  • Introduction to High Performance Computing
  • SWC HPC system
  • Using the job scheduler
  • Running pose estimation on the SWC HPC

Hardware overview

  • CPU (Central Processing Unit)
    • General-purpose
    • Split into cores (typically between 4 and 64)
    • Each core can run a separate process
    • Typically higher clock speed than GPU (~3-5GHz)
  • GPU (Graphics Processing Unit)
    • Originally for rendering graphics
    • Thousands of cores
    • Optimised for parallel processing of matrix multiplication
    • Typically lower clock speed than CPU (~1-2GHz)

Hardware overview

Primary storage:

  • Cache
    • Small, fast memory
    • Stores frequently accessed data
    • Sits directly on the CPU/GPU
    • Typically in the MB range with multiple levels
  • Main memory (RAM/VRAM)
    • Fast storage for data
    • CPU/GPU can access data quickly
    • Lost when machine is powered off
    • Typically 8-512 GB range

Hardware overview

Secondary storage:

  • Drive storage (HDD/SSD)
    • Much slower than RAM
    • SSDs faster than HDDs
    • Typically in the GB-TB range
  • Network storage (e.g. ceph)
    • Shared storage accessible from multiple machines
    • Typically in the TB-PB range
    • High latency compared to local storage

Hardware overview

Hardware overview

Performance considerations

  • CPU
    • Frequency is important for single-threaded tasks
    • More cores can be better for parallel tasks
    • Sometimes your local machine is faster than the HPC for CPU tasks
  • GPU
    • Great for parallel tasks (e.g. machine learning)
    • Memory is important - make sure your data fits in VRAM
    • Generation can be important, a new generation is typically ~10 – 20% faster
  • Storage
    • Best if you can keep data in primary memory (Cache/RAM)
    • If data doesn’t fit in memory make sure it’s on fast storage (local)

cd: Navigate Directories

  • cd [directory] – Changes the current working directory to the specified directory.
  • cd .. – Move up one directory level.
  • cd /path/to/directory – Go to a specific path.
  • cd ~ or cd – Go to the home directory.
  • cd / – Go to the root directory.

ls: List Directory Contents

  • ls – Lists files and directories in the current working directory.
  • ls -l – Displays detailed information about each file (permissions, owner, size, etc.).
  • ls -a – Shows all files, including hidden files (files starting with a dot).
  • ls -h – Displays sizes in human-readable format (e.g., KB, MB).
  • ls -lah – Combines the above options.

mkdir: Make Directory

  • mkdir [directory_name] – Creates a new directory with the specified name.
  • mkdir -p /path/to/dir to create nested directories.

rmdir and rm: Remove Directories and Files

  • rm [file]– Removes a file.
  • rmdir [directory] – Removes an empty directory.
  • rm -r [directory] - Removes a directory and its contents recursively.
  • rm -f [file] - Removes to force remove a file (no undo be careful!).

mv and cp: Move, Rename and Copy

  • cp [source] [destination] – Copies a file or directory to the destination.
  • mv [source] [destination] – Moves a file or directory to the destination.
  • mv can also be used to rename a file or directory if the source and destination directories match.
cp file.txt /new/location/
mv file.txt /new/location/
mv old_name.txt new_name.txt

Input and Output

  • echo [text] – Displays text or outputs text to a file.
  • echo $ENV_VAR – Displays the value of an environment variable.
  • touch [filename] – Creates an empty file or updates the timestamp of an existing file.
  • > – Redirects output to a file.
  • >> – Appends output to a file.
echo "Hello, Linux!"
touch file.txt
echo "Hello, Linux!" > file.txt
echo $HOME >> file.txt

watch: Monitor Command Output

  • watch [command] – Repeatedly runs a command at intervals and displays the result.
  • Use watch -n [seconds] [command] to change the interval.
watch -n 0.5 nvidia-smi

man and help: Get Help

  • man [command] – Opens the manual page for a command.
  • help [command] – Provides a short description of built-in commands.
man ls
help cd

|: Pipes!

Description:

  • | – Pipes the output of one command as input to another.
  • Useful for chaining commands together to perform complex operations.
ls | grep ".txt"
cat large_log_file.log | grep "ERROR" | less

Introduction to High Performance Computing (HPC)

  • Lots of meanings
  • Often just a system with many machines (nodes) linked together with some/all of:
    • Lots of CPU cores per node
    • Powerful GPUs
    • Lots of memory per node
    • Fast networking to link nodes
    • Fast data storage
    • Standardised software installation

Why?

  • Run jobs too large for desktop workstations
  • Run many jobs at once
  • Efficiency (cheaper to have central machines running 24/7)
  • In neuroscience, typically used for:
    • Analysing large data (e.g. high memory requirements)
    • Parallelising analysis/modelling (run on many machines at once)

SWC HPC hardware

(Correct at time of writing)

  • Ubuntu 20.04
  • 81 nodes
    • 46 CPU nodes
    • 35 GPU nodes
  • 3000 CPU cores
  • 83 GPUs
  • ~20TB RAM

Logging in

Log into bastion node (not necessary within SWC network)

ssh <USERNAME>@ssh.swc.ucl.ac.uk

Log into HPC gateway node

ssh <USERNAME>@hpc-gw1

This node is fine for light work, but no intensive analyses

More details

See our guide at howto.neuroinformatics.dev

File systems

  • /nfs/nhome/live/<USERNAME> or /nfs/ghome/live/<USERNAME>
    • “Home drive” (SWC/GCNU), also at ~/
  • /nfs/winstor/<group> - Old SWC research data storage
  • /nfs/gatsbystor - GCNU data storage
  • /ceph/<group> - Current research data storage
  • /ceph/scratch - Not backed up, for short-term storage
  • /ceph/apps - HPC applications

Note

You may only be able to “see” a drive if you navigate to it

Navigate to the scratch space

cd /ceph/scratch

Create a directory for yourself

mkdir <USERNAME>

HPC software

All nodes have the same software installed

  • Ubuntu 20.04 LTS
  • General linux utilities

Modules

Preinstalled packages available for use, including:

  • ANTs
  • BrainGlobe
  • CUDA
  • DeepLabCut
  • FSL
  • Julia
  • Kilosort
  • mamba
  • MATLAB
  • neuron
  • miniconda
  • SLEAP

Using modules

List available modules

module avail

Load a module

module load SLEAP

Unload a module

module unload SLEAP

Load a specific version

module load SLEAP/2024-08-14

List loaded modules

module list

SLURM

  • Simple Linux Utility for Resource Management
  • Job scheduler
  • Allocates jobs to nodes
  • Queues jobs if nodes are busy
  • Users must explicitly request resources

SLURM commands

View a summary of the available resources

sinfo
atyson@hpc-gw1:~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
cpu*         up 10-00:00:0      1   mix# gpu-380-25
cpu*         up 10-00:00:0     31    mix enc1-node[1-14],enc2-node[1-13],enc3-node[6-8],gpu-380-24
cpu*         up 10-00:00:0      4  alloc enc3-node[1-2,4-5]
gpu          up 10-00:00:0      1   mix# gpu-380-15
gpu          up 10-00:00:0      1  down~ gpu-380-16
gpu          up 10-00:00:0     12    mix gpu-350-[01-05], gpu-380-[11,13-14,17-18],gpu-sr670-[20,22]
a100         up 30-00:00:0      2    mix gpu-sr670-[21,23]
lmem         up 10-00:00:0      1  idle~ gpu-380-12
medium       up   12:00:00      1   mix# gpu-380-15
medium       up   12:00:00      1  down~ gpu-380-16
medium       up   12:00:00      7    mix enc3-node[6-8],gpu-380-[11,14,17-18]
medium       up   12:00:00      4  alloc enc3-node[1-2,4-5]
fast         up    3:00:00      2  idle~ enc1-node16,gpu-erlich01
fast         up    3:00:00      4    mix gpu-380-[11,14,17-18]

View currently running jobs (from everyone)

squeue
atyson@hpc-gw1:~$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
4036257       cpu     bash   imansd  R 13-01:10:01      1 enc1-node2
4050946       cpu      zsh apezzott  R 1-01:02:30      1 enc2-node11
3921466       cpu     bash   imansd  R 51-03:05:29      1 gpu-380-13
4037613       gpu     bash  pierreg  R 12-05:55:06      1 gpu-sr670-20
4051306       gpu ddpm-vae   jheald  R      15:49      1 gpu-350-01
4051294       gpu  jupyter    samoh  R    1:40:59      1 gpu-sr670-22
4047787       gpu     bash antonins  R 4-18:59:43      1 gpu-sr670-21
4051063_7       gpu    LRsem apezzott  R 1-00:08:32      1 gpu-350-05
4051063_8       gpu    LRsem apezzott  R 1-00:08:32      1 gpu-380-10
4051305       gpu     bash  kjensen  R      18:33      1 gpu-sr670-20
4051297       gpu     bash   slenzi  R    1:15:39      1 gpu-350-01

More details

See our guide at howto.neuroinformatics.dev

Partitions

Interactive job

Start an interactive job (bash -i) in the fast partition (-p fast) in pseudoterminal mode (--pty) with one CPU core (-n 1).

srun -p fast -n 1 --pty bash -i

Always start a job (interactive or batch) before doing anything intensive to spare the gateway node.

Run some “analysis”

Clone a test script

cd ~/
git clone https://github.com/neuroinformatics-unit/course-behaviour-hpc

Make the script executable

cd course-behaviour-hpc/demo
chmod +x multiply.sh

Run the script

./multiply.sh 10 5

Stop interactive job

exit

Batch jobs

Check out batch script:

cd course-behaviour-hpc/demo
cat batch_example.sh
#!/bin/bash

#SBATCH -p fast # partition (queue)
#SBATCH -N 1   # number of nodes
#SBATCH --mem 1G # memory pool for all cores
#SBATCH -n 1 # number of cores
#SBATCH -t 0-0:1 # time (D-HH:MM)
#SBATCH -o slurm_output.out
#SBATCH -e slurm_error.err

for i in {1..5}
do
  ./multiply.sh $i 10
done

Run batch job:

sbatch batch_example.sh

Array jobs

Check out array script:

cat array_example.sh
#!/bin/bash

#SBATCH -p fast # partition (queue)
#SBATCH -N 1   # number of nodes
#SBATCH --mem 1G # memory pool for all cores
#SBATCH -n 1 # number of cores
#SBATCH -t 0-0:1 # time (D-HH:MM)
#SBATCH -o slurm_array_%A-%a.out
#SBATCH -e slurm_array_%A-%a.err
#SBATCH --array=0-9%4

# Array job runs 10 separate jobs, but not more than four at a time.
# This is flexible and the array ID ($SLURM_ARRAY_TASK_ID) can be used in any way.

echo "Multiplying $SLURM_ARRAY_TASK_ID by 10"
./multiply.sh $SLURM_ARRAY_TASK_ID 10 

Run array job:

sbatch array_example.sh

Using GPUs

Start an interactive job with one GPU:

srun -p gpu --gres=gpu:1 --pty bash -i

Load TensorFlow & CUDA

module load tensorflow
module load cuda/11.8

Check GPU

python
import tensorflow as tf
tf.config.list_physical_devices('GPU')

Useful commands

Cancel a job

scancel <JOBID>

Cancel all your jobs

scancel -u <USERNAME>

Example: pose estimation with SLEAP

Pose estimation

  • “easy” in humans - vast amounts of data
  • “harder” in animals - less data, more variability

Pose estimation software

DeepLabCut: transfer learning

SLEAP:smaller networks

source: sleap.ai

Top-down pose estimation

SLEAP workflow

  • Training and inference are GPU-intensive
  • We can delegate to the HPC cluster’s GPU nodes

Sample data

/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data/course-hpc-2023

  • Mouse videos from Loukia Katsouri
  • SLEAP project with:
    • labeled frames
    • trained models
    • prediction results

Labeling data locally

Exporting a training job package

Training job package contents

Copy the unzipped training package to your scratch space and inspect its contents:

cp -r /ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data/course-hpc-2023/labels.v001.slp.training_job /ceph/scratch/$USER/
cd /ceph/scratch/$USER/labels.v001.slp.training_job
ls -1
labels.v001.slp.training_job
labels.v001.pkg.slp     # Copy of labeled frames
centroid.json           # Model configuration
centered_instance.json  # Model configuration
train-script.sh         # Bash script to run training
inference-script.sh     # Bash script to run inference
jobs.yaml               # Summary of all jobs

Warning

Make sure all scripts are executable

chmod +x *.sh

What’s in the SLEAP scripts?

Training

cat train-script.sh
#!/bin/bash
sleap-train centroid.json labels.v001.pkg.slp
sleap-train centered_instance.json labels.v001.pkg.slp

Inference

cat inference-script.sh
#!/bin/bash

Get SLURM to run the script

Suitable for debugging (immediate feedback)

  • Start an interactive job with one GPU

    srun -p gpu --gres=gpu:1 --pty bash -i
  • Execute commands one-by-one, e.g.:

    module load SLEAP
    cd /ceph/scratch/$USER/labels.v001.slp.training_job
    bash train-script.sh
    
    # Stop the session
    exit

Main method for submitting jobs

  • Prepare a batch script, e.g. sleap_train_slurm.sh

  • Submit the job:

    sbatch sleap_train_slurm.sh
  • Monitor job status:

    squeue --me

See example batch scripts

cd ~/course-behaviour-hpc/pose-estimation/slurm-scripts
ls

Warning

Make sure all scripts are executable

chmod +x *.sh

Edit a specific script:

nano sleap_train_slurm.sh

Save with Ctrl+O (followed by Enter), exit with Ctrl+X

Batch script for training

sleap_train_slurm.sh
#!/bin/bash

#SBATCH -J slp_train # job name
#SBATCH -p gpu # partition (queue)
#SBATCH -N 1   # number of nodes
#SBATCH --mem 16G # memory pool for all cores
#SBATCH -n 4 # number of cores
#SBATCH -t 0-06:00 # time (D-HH:MM)
#SBATCH --gres gpu:1 # request 1 GPU (of any kind)
#SBATCH -o slurm.%x.%N.%j.out # STDOUT
#SBATCH -e slurm.%x.%N.%j.err # STDERR
#SBATCH --mail-type=ALL
#SBATCH --mail-user=user@domain.com

# Load the SLEAP module
module load SLEAP

# Define the directory of the exported training job package
SLP_JOB_NAME=labels.v001.slp.training_job
SLP_JOB_DIR=/ceph/scratch/$USER/$SLP_JOB_NAME

# Go to the job directory
cd $SLP_JOB_DIR

# Run the training script generated by SLEAP
./train-script.sh

Monitoring the training job

sbatch sleap_train_slurm.sh
  Submitted batch job 4232289

View the status of your queued/running jobs with squeue --me

squeue --me

JOBID    PARTITION    NAME       USER      ST   TIME  NODES   NODELIST(REASON)
4232289  gpu          slp_trai   sirmpila  R    0:20      1   gpu-380-18

View status of running/completed jobs with sacct:

sacct

JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
4232289       slp_train        gpu     swc-ac          4    RUNNING      0:0
4232289.bat+      batch                swc-ac          4    RUNNING      0:0

Run sacct with some more helpful arguments, e.g. view jobs from the last 24 hours, incl. time elapsed and peak memory usage in KB (MaxRSS):

sacct \
  --starttime $(date -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \
  --endtime $(date +%Y-%m-%dT%H:%M:%S) \
  --format=JobID,JobName,Partition,State,Start,Elapsed,MaxRSS

View the contents of standard output and error (the job name, node name and job ID will differ in each case):

cat slurm.slp_train.gpu-380-18.4232289.out
cat slurm.slp_train.gpu-380-18.4232289.err

View trained models

While you wait for the training job to finish, you can copy and inspect the trained models from a previous run:

cp -R /ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data/course-hpc-2023/labels.v001.slp.training_job/models /ceph/scratch/$USER/labels.v001.slp.training_job/
cd /ceph/scratch/$USER/labels.v001.slp.training_job/models
ls
231130_160757.centroid
231130_160757.centered_instance

What’s in the model directory?

cd 231130_160757.centroid
ls -1
best_model.h5
initial_config.json
labels_gt.train.slp
labels_gt.val.slp
labels_pr.train.slp
labels_pr.val.slp
metrics.train.npz
metrics.val.npz
training_config.json
training_log.csv

Evaluate trained models

SLEAP workflow

Batch script for inference

sleap_infer_slurm.sh
#!/bin/bash

#SBATCH -J slp_infer # job name
#SBATCH -p gpu # partition
#SBATCH -N 1   # number of nodes
#SBATCH --mem 32G # memory pool for all cores
#SBATCH -n 8 # number of cores
#SBATCH -t 0-01:00 # time (D-HH:MM)
#SBATCH --gres gpu:1 # request 1 GPU
#SBATCH -o slurm.%x.%N.%j.out # write STDOUT
#SBATCH -e slurm.%x.%N.%j.err # write STDERR
#SBATCH --mail-type=ALL
#SBATCH --mail-user=user@domain.com

# Load the SLEAP module
module load SLEAP

# Define directories for exported SLEAP job package and videos
SLP_JOB_NAME=labels.v001.slp.training_job
SLP_JOB_DIR=/ceph/scratch/$USER/$SLP_JOB_NAME
VIDEO_DIR=/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data/course-hpc-2023/videos
VIDEO1_PREFIX=sub-01_ses-01_task-EPM_time-165049

# Go to the job directory
cd $SLP_JOB_DIR

# Make a directory to store the predictions
mkdir -p predictions

# Run the inference command
sleap-track $VIDEO_DIR/${VIDEO1_PREFIX}_video.mp4 \
    -m $SLP_JOB_DIR/models/231130_160757.centroid/training_config.json \
    -m $SLP_JOB_DIR/models/231130_160757.centered_instance/training_config.json \
    -o $SLP_JOB_DIR/predictions/${VIDEO1_PREFIX}_predictions.slp \
    --gpu auto \
    --no-empty-frames

Run inference job

  1. Edit and save the batch script
nano sleap_infer_slurm.sh
  1. Submit the job
sbatch sleap_infer_slurm.sh
  1. Monitor the job
squeue --me

Run inference as an array job

Batch script for array job

sleap_infer_array_slurm.sh
#!/bin/bash

#SBATCH -J slp_infer # job name
#SBATCH -p gpu # partition
#SBATCH -N 1   # number of nodes
#SBATCH --mem 32G # memory pool for all cores
#SBATCH -n 8 # number of cores
#SBATCH -t 0-01:00 # time (D-HH:MM)
#SBATCH --gres gpu:1 # request 1 GPU
#SBATCH -o slurm.%x.%N.%j.out # write STDOUT
#SBATCH -e slurm.%x.%N.%j.err # write STDERR
#SBATCH --mail-type=ALL
#SBATCH --mail-user=user@domain.com
#SBATCH --array=0-1

# Load the SLEAP module
module load SLEAP

# Define directories for exported SLEAP job package and videos
SLP_JOB_NAME=labels.v001.slp.training_job
SLP_JOB_DIR=/ceph/scratch/$USER/$SLP_JOB_NAME
VIDEO_DIR=/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data/course-hpc-2023/videos

VIDEO1_PREFIX=sub-01_ses-01_task-EPM_time-165049
VIDEO2_PREFIX=sub-02_ses-01_task-EPM_time-185651
VIDEOS_PREFIXES=($VIDEO1_PREFIX $VIDEO2_PREFIX)
CURRENT_VIDEO_PREFIX=${VIDEOS_PREFIXES[$SLURM_ARRAY_TASK_ID]}
echo "Current video prefix: $CURRENT_VIDEO_PREFIX"

# Go to the job directory
cd $SLP_JOB_DIR

# Make a directory to store the predictions
mkdir -p predictions

# Run the inference command
sleap-track $VIDEO_DIR/${CURRENT_VIDEO_PREFIX}_video.mp4 \
    -m $SLP_JOB_DIR/models/231130_160757.centroid/training_config.json \
    -m $SLP_JOB_DIR/models/231130_160757.centered_instance/training_config.json \
    -o $SLP_JOB_DIR/predictions/${CURRENT_VIDEO_PREFIX}_array_predictions.slp \
    --gpu auto \
    --no-empty-frames

Further reading