/nfs/nhome/live/<USERNAME> or /nfs/ghome/live/<USERNAME>
“Home drive” (SWC/GCNU), also at ~/
/nfs/winstor/<group> - Old SWC research data storage (read-only soon)
/nfs/gatsbystor - GCNU data storage
/ceph/<group> - Current research data storage
/ceph/scratch - Not backed up, for short-term storage
/ceph/apps - HPC applications
Note
You may only be able to “see” a drive if you navigate to it
Navigate to the scratch space
cd /ceph/scratch
Create a directory for yourself
mkdir<USERNAME>
HPC software
All nodes have the same software installed
Ubuntu 20.04 LTS
General linux utilities
Modules
Preinstalled packages available for use, including:
BrainGlobe
CUDA
Julia
Kilosort
mamba
MATLAB
miniconda
SLEAP
Using modules
List available modules
module avail
Load a module
module load SLEAP
Unload a module
module unload SLEAP
Load a specific version
module load SLEAP/2023-08-01
List loaded modules
module list
SLURM
Simple Linux Utility for Resource Management
Job scheduler
Allocates jobs to nodes
Queues jobs if nodes are busy
Users must explicitly request resources
SLURM commands
View a summary of the available resources
sinfo
atyson@sgw2:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
cpu* up infinite 29 idle~ enc1-node[1,3-14],enc2-node[1-10,12-13],enc3-node[5-8]
cpu* up infinite 1 down* enc3-node3
cpu* up infinite 2 mix enc1-node2,enc2-node11
cpu* up infinite 5 idle enc3-node[1-2,4],gpu-380-[24-25]
gpu up infinite 9 mix gpu-350-[01,03-05],gpu-380-[10,13],gpu-sr670-[20-22]
gpu up infinite 9 idle gpu-350-02,gpu-380-[11-12,14-18],gpu-sr670-23
medium up 12:00:00 4 idle~ enc3-node[5-8]
medium up 12:00:00 1 down* enc3-node3
medium up 12:00:00 1 mix gpu-380-10
medium up 12:00:00 10 idle enc3-node[1-2,4],gpu-380-[11-12,14-18]
fast up 3:00:00 1 mix gpu-380-10
fast up 3:00:00 9 idle enc1-node16,gpu-380-[11-12,14-18],gpu-erlich01
View currently running jobs (from everyone)
squeue
atyson@sgw2:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
4036257 cpu bash imansd R 13-01:10:01 1 enc1-node2
4050946 cpu zsh apezzott R 1-01:02:30 1 enc2-node11
3921466 cpu bash imansd R 51-03:05:29 1 gpu-380-13
4037613 gpu bash pierreg R 12-05:55:06 1 gpu-sr670-20
4051306 gpu ddpm-vae jheald R 15:49 1 gpu-350-01
4051294 gpu jupyter samoh R 1:40:59 1 gpu-sr670-22
4047787 gpu bash antonins R 4-18:59:43 1 gpu-sr670-21
4051063_7 gpu LRsem apezzott R 1-00:08:32 1 gpu-350-05
4051063_8 gpu LRsem apezzott R 1-00:08:32 1 gpu-380-10
4051305 gpu bash kjensen R 18:33 1 gpu-sr670-20
4051297 gpu bash slenzi R 1:15:39 1 gpu-350-01
cd course-software-skills-hpc/democat batch_example.sh
#!/bin/bash#SBATCH -p gpu # partition (queue)#SBATCH -N 1 # number of nodes#SBATCH --mem 2G # memory pool for all cores#SBATCH -n 2 # number of cores#SBATCH -t 0-0:10 # time (D-HH:MM)#SBATCH -o slurm_output.out#SBATCH -e slurm_error.err#SBATCH --mail-type=ALL#SBATCH --mail-user=adam.tyson@ucl.ac.ukmodule load minicondaconda activate slurm_demofor i in{1..5}doecho"Multiplying $i by 10"python multiply.py $i 10 --jazzydone
Run batch job:
sbatch batch_example.sh
Array jobs
Check out array script:
cat array_example.sh
#!/bin/bash#SBATCH -p gpu # partition (queue)#SBATCH -N 1 # number of nodes#SBATCH --mem 2G # memory pool for all cores#SBATCH -n 2 # number of cores#SBATCH -t 0-0:10 # time (D-HH:MM)#SBATCH -o slurm_array_%A-%a.out#SBATCH -e slurm_array_%A-%a.err#SBATCH --mail-type=ALL#SBATCH --mail-user=adam.tyson@ucl.ac.uk#SBATCH --array=0-9%4# Array job runs 10 separate jobs, but not more than four at a time.# This is flexible and the array ID ($SLURM_ARRAY_TASK_ID) can be used in any way.module load minicondaconda activate slurm_demoecho"Multiplying $SLURM_ARRAY_TASK_ID by 10"python multiply.py $SLURM_ARRAY_TASK_ID 10 --jazzy
Run array job:
sbatch array_example.sh
Using GPUs
Start an interactive job with one GPU:
srun-p gpu --gres=gpu:1 --pty bash -i
Load TensorFlow & CUDA
module load tensorflowmodule load cuda/11.8
Check GPU
python
import tensorflow as tftf.config.list_physical_devices('GPU')
labels.v001.pkg.slp# Copy of labeled framescentroid.json# Model configurationcentered_instance.json# Model configurationtrain-script.sh# Bash script to run traininginference-script.sh# Bash script to run inferencejobs.yaml# Summary of all jobs
module load SLEAPcd /ceph/scratch/<USERNAME>/labels.v001.slp.training_jobbash train-script.sh# Stop the sessionexit
Main method for submitting jobs
Prepare a batch script, e.g. sleap_train_slurm.sh
Submit the job:
sbatch sleap_train_slurm.sh
Monitor job status:
squeue--me
See example batch scripts
cd ~/course-software-skills-hpc/pose-estimation/slurm-scriptsls
Warning
Make sure all scripts are executable
chmod +x *.sh
Edit a specific script:
nano sleap_train_slurm.sh
Save with Ctrl+O (followed by Enter), exit with Ctrl+X
Batch script for training
sleap_train_slurm.sh
#!/bin/bash#SBATCH -J slp_train # job name#SBATCH -p gpu # partition (queue)#SBATCH -N 1 # number of nodes#SBATCH --mem 16G # memory pool for all cores#SBATCH -n 4 # number of cores#SBATCH -t 0-06:00 # time (D-HH:MM)#SBATCH --gres gpu:1 # request 1 GPU (of any kind)#SBATCH -o slurm.%x.%N.%j.out # STDOUT#SBATCH -e slurm.%x.%N.%j.err # STDERR#SBATCH --mail-type=ALL#SBATCH --mail-user=user@domain.com# Load the SLEAP modulemodule load SLEAP# Define the directory of the exported training job packageSLP_JOB_NAME=labels.v001.slp.training_jobSLP_JOB_DIR=/ceph/scratch/$USER/$SLP_JOB_NAME# Go to the job directorycd$SLP_JOB_DIR# Run the training script generated by SLEAP./train-script.sh
#!/bin/bash#SBATCH -J slp_infer # job name#SBATCH -p gpu # partition#SBATCH -N 1 # number of nodes#SBATCH --mem 64G # memory pool for all cores#SBATCH -n 32 # number of cores#SBATCH -t 0-01:00 # time (D-HH:MM)#SBATCH --gres gpu:rtx5000:1 # request 1 RTX5000 GPU#SBATCH -o slurm.%x.%N.%j.out # write STDOUT#SBATCH -e slurm.%x.%N.%j.err # write STDERR#SBATCH --mail-type=ALL#SBATCH --mail-user=user@domain.com# Load the SLEAP modulemodule load SLEAP# Define directories for exported SLEAP job package and videosSLP_JOB_NAME=labels.v001.slp.training_jobSLP_JOB_DIR=/ceph/scratch/$USER/$SLP_JOB_NAMEVIDEO_DIR=/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data/course-hpc-2023/videosVIDEO1_PREFIX=sub-01_ses-01_task-EPM_time-165049# Go to the job directorycd$SLP_JOB_DIR# Make a directory to store the predictionsmkdir-p predictions# Run the inference commandsleap-track$VIDEO_DIR/${VIDEO1_PREFIX}_video.mp4 \-m$SLP_JOB_DIR/models/231130_160757.centroid/training_config.json \-m$SLP_JOB_DIR/models/231130_160757.centered_instance/training_config.json \-o$SLP_JOB_DIR/predictions/${VIDEO1_PREFIX}_predictions.slp \--gpu auto \--no-empty-frames
#!/bin/bash#SBATCH -J slp_infer # job name#SBATCH -p gpu # partition#SBATCH -N 1 # number of nodes#SBATCH --mem 64G # memory pool for all cores#SBATCH -n 32 # number of cores#SBATCH -t 0-01:00 # time (D-HH:MM)#SBATCH --gres gpu:rtx5000:1 # request 1 RTX5000 GPU#SBATCH -o slurm.%x.%N.%j.out # write STDOUT#SBATCH -e slurm.%x.%N.%j.err # write STDERR#SBATCH --mail-type=ALL#SBATCH --mail-user=user@domain.com#SBATCH --array=1-2# Load the SLEAP modulemodule load SLEAP# Define directories for exported SLEAP job package and videosSLP_JOB_NAME=labels.v001.slp.training_jobSLP_JOB_DIR=/ceph/scratch/$USER/$SLP_JOB_NAMEVIDEO_DIR=/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data/course-hpc-2023/videosVIDEO1_PREFIX=sub-01_ses-01_task-EPM_time-165049VIDEO2_PREFIX=sub-02_ses-01_task-EPM_time-185651VIDEOS_PREFIXES=($VIDEO1_PREFIX$VIDEO2_PREFIX)CURRENT_VIDEO_PREFIX=${VIDEOS_PREFIXES[$SLURM_ARRAY_TASK_ID - 1]}echo"Current video prefix: $CURRENT_VIDEO_PREFIX"# Go to the job directorycd$SLP_JOB_DIR# Make a directory to store the predictionsmkdir-p predictions# Run the inference commandsleap-track$VIDEO_DIR/${CURRENT_VIDEO_PREFIX}_video.mp4 \-m$SLP_JOB_DIR/models/231130_160757.centroid/training_config.json \-m$SLP_JOB_DIR/models/231130_160757.centered_instance/training_config.json \-o$SLP_JOB_DIR/predictions/${CURRENT_VIDEO_PREFIX}_array_predictions.slp \--gpu auto \--no-empty-frames