:blogpost: true
:date: December 3, 2025
:author: Pille Wetterauer, Jyoti Bhogal
:location: London, UK
:category: Blog
:language: English
:image: 1

(target-sam2)=

# Exploring automatic ways of extracting a pose estimation skeleton for *C. elegans*
*Segmenting C. elegans using SAM-2 and extracting skeletons.*

Manually defining the pose skeletons in each video frame can be very tedious. 
Automating this process would make movement analysis a lot faster and easier. Therefore, 
our project at [OSW 2025](https://neuroinformatics.dev/open-software-summer-school/2025/index.html) 
hackday aimed to explore ways of doing exactly that.

```{figure} /_static/blog_images/sam2/worm-to-skeleton.png
:align: center
:width: 80%

**How to automatically define a pose estimation skeleton on a worm?** We explored this as part of the OSW hackday.
```

## Why worms?
When studying animal behaviour, [_Caenorhabditis elegans_](https://en.wikipedia.org/wiki/Caenorhabditis_elegans) is not the first model 
organism most people think of. However, they are used frequently as a model for 
drug discovery, developmental biology, genetics, and other areas. Changes in their 
behaviour are thereby an important readout, indicating the effectiveness of a drug, 
a developmental defect, or the contribution of a specific gene. The main advantage 
of using _C. elegans_ over mammalian models like mice is their simplicity. This is also 
true for the OSW hackday project described here: what could be easier to skeletonise 
than a worm!

## Segmenting worms with SAM-2
In order to automatically extract pose skeletons, we first need to create segmentation masks, and 
of course this should be automated as well. There are lots of deep learning algorithms 
available for segmentation. Here, we try out 
[Segment Anything Model 2 (SAM-2)](https://github.com/facebookresearch/sam2). This model
can be applied both to images and videos and is designed to segment any object with 
minimal input from the user.

### Installation
Installing SAM-2 was one of the more challenging steps of this project. The installation 
instructions recommend using [WSL](https://en.wikipedia.org/wiki/Windows_Subsystem_for_Linux) on Windows. However, since I already had Python on my 
Windows system, I decided to ignore that. SAM-2 requires:
* Python >= 3.10
* Pytorch and TorchVision
* SAM-2 package, that can be cloned from GitHub
* [ffmpeg](https://ffmpeg.org/) for video manipulation

Setting up Pytorch with GPU support on a Windows system can be tricky. Fortunately, 
I had done this before, so the NVIDIA drivers and CUDA were already set up and only the 
correct wheel for Pytorch had to be downloaded. Still, it didn't all work out the first time. 
After some troubleshooting we found the culprit: `conda` was using `pip` from a different 
environment. `pip` is a package needed to install both Pytorch and SAM-2. Using `pip` from another environment
can lead to version conflicts between packages from different environments. You can easily 
check from which environment your `pip` comes from, using the command `which pip`, that will 
display the path to the package linked to the `pip`command.


`ffmpeg` is a command line tool for video manipulation, that is recommended in the [example 
notebook for SAM-2](https://github.com/facebookresearch/sam2/blob/main/notebooks/video_predictor_example.ipynb) for extracting frames. It is integrated in Linux distributions, but 
on Windows it has to be installed separately, outside the Python environment. Of course, 
any other video manipulation tool can be used as well.

In the end everything got installed correctly and I could run the notebook on my laptop, 
even though I kept getting some non-fatal error messages. The bigger problem, however, was the 
performance of my laptop. Despite the GPU, the prediction took several hours, 
so we decided to use Google Colab instead.

### Running the notebook on Google Colab
The SAM-2 repository contains [sample notebooks](https://github.com/facebookresearch/sam2/tree/main/notebooks) for different use-cases, including [video 
segmentation](https://github.com/facebookresearch/sam2/blob/main/notebooks/video_predictor_example.ipynb). The notebooks contain a link to Google Colab and code for setting up the 
Colab environment. You just need to choose a runtime with GPU (T4 for a free runtime), 
connect to it and mount Google Drive. The data has to be uploaded to Google Drive.

Google Colab is good for trying out things, however there are usage limits. The 
connection is terminated after being inactive for a while and there is a limited number 
of sessions with GPU usage. Google does not publish these limits, apparently they vary. 
There are paid options for longer runtimes and more GPU types.

### SAM-2 workflow for video segmentation

The SAM-2 repository provides an example notebook for segmenting videos, which is a good 
starting point for exploring this model. The video predictor is provided with pretrained 
weights (which need to be downloaded separately when using SAM-2 on a local machine!). The 
video data has to be saved as single frames, the example notebook uses JPEG files. These 
images are loaded into a variable called `inference_state` during initialization. Then the user should provide 
prompts for the objects that should be segmented. There can be one or more objects and 
the prompts can be either point coordinates or boxes. Furthermore, the prompts have a label,
showing whether the prompt is positive (i.e. marking the desired object) or negative (marking 
the background). In this way, the first masks for a given frame are predicted, 
as shown below. We used two positive and one negative point prompt for each worm.

```{figure} /_static/blog_images/sam2/sam2-workflow.png
:align: center
:width: 70%

**Masks definition on the first frame.** The SAM-2 workflow showing the provided prompts for three worms (positive points shown in green, negative points in red) and the initial mask predictions. A selected worm (highlighted with a red bounding box) is shown as a zoomed-in view in the panels on the right.
```

The resulting masks are visualized, so the user can decide whether to move on or add more 
prompts for a better result. Due to limited time, the masks here were not refined further.

The next step is to propagate the masks to the whole video. Masks for each frame are predicted 
and each object labelled in the first frame is tracked throughout the video. This is the time-consuming step. A good 
opportunity to grab a cup of coffee (or a piece of pizza) and have a chat with fellow coders.


```{figure} /_static/blog_images/sam2/sam2-propagation.png
:align: center
:width: 100%

**Propagation of masks to subsequent frames.** Once the masks are defined for the first frame, we can propagate the prompts to get the trajectories of the masks across the full video.
```

As a result of the prediction you get masks for every frame in the video. The [notebook](https://github.com/facebookresearch/sam2/blob/main/notebooks/video_predictor_example.ipynb) 
displays some of them, so you can check the quality. All the predicted masks are stored 
in a variable called `video_segments`. Since the masks are numpy arrays, they cannot be directly exported to a JSON file using Python's standard [`json`](https://docs.python.org/3/library/json.html) module (without first converting them to lists, which would lose `dtype` information). But you can use [`pickle`](https://docs.python.org/3/library/pickle.html) or [`numpy.savez()`](https://numpy.org/doc/stable/reference/generated/numpy.savez.html) to export them 
for later use.

The [example notebook](https://github.com/facebookresearch/sam2/blob/main/notebooks/video_predictor_example.ipynb)  in the SAM-2 repository contains all these steps with different options and extra code cells 
for e.g. adding additional prompts, different kinds of prompts, only one object, etc. 
A more concise notebook, tailored for segmenting _C. elegans_, can be found 
[here](https://github.com/pwetterauer/WormNotebooks.git).

### Limitations
While the results look reasonably good, there are some limitations. First, even in this 
simple example the masks overlap not only the worm, but also some neighbouring pixels. 
Using more points to prompt the model might solve this issue. However, the point coordinates 
have to be entered manually. Finding out the coordinates and typing them into an array is 
not very convenient. There are some SAM plugins for [FIJI](https://fiji.sc) ([SAMJ-IJ](https://github.com/segment-anything-models-java/SAMJ-IJ), only works for images not for 
videos) and [napari](https://napari.org) (e.g. [microSAM](https://computational-cell-analytics.github.io/micro-sam/micro_sam.html) for microscopic images, supports 2D, 3D and videos), 
that could make this step of the workflow easier.

Second, on a more crowded plate, where the worms touch each other, the model tends to lose 
track of single worms. This is, however a general problem not only for SAM-2 but also other 
segmentation models. This kind of mistakes can be manually corrected or one can avoid them 
by using less crowded videos.

In summary, SAM-2 did a good job segmenting the worms in a short time. The resulting masks 
can now be further analysed, e.g. by creating a skeleton and selecting some markers to create 
a pose estimation skeleton.

## Skeletonisation of the masks

The next step after obtaining segmentation masks is to extract a skeleton from them. For this, we used the 
[`skeletonize` function from the `skimage` library](https://scikit-image.org/docs/stable/auto_examples/edges/plot_skeleton.html). This function takes a masked image as input in the format of a two-dimensional array and returns a skeletonised version of the image.

The process of skeletionsation by using the `skimage` library is an iterative one, with several cycles of removing the outermost pixels of the object until only a one-pixel wide representation of the object remains. The function works by iteratively removing pixels from the boundaries of the object while preserving its connectivity and overall structure. The algorithm continues this process until no more pixels can be removed without breaking the connectivity of the object.

As an example, let's look at the following image of a worm mask and its skeletonised version:

```{figure} /_static/blog_images/sam2/EGCG5_40_2018_10_19_Mask_masked_and_skeletonised.png
:align: center
:width: 100%

**Masked worm image and its skeletonised version.** The input masked worm image (left) and its one-pixel wide skeletonised version (right).
```

Once we have the skeletonised image, we can define keypoints or nodes along the skeleton to represent the pose of the worm. This can be done by sampling points at regular intervals along the skeleton or by identifying specific features such as bends or junctions in the skeleton. In this exploration, I chose the method of random selection and selected 5 pixels by using [numpy's `choice()` function](https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.choice.html#numpy.random.Generator.choice). These keypoints could then be used to create a pose track for the worm, which can be further analysed for movement patterns and behaviours. They could also be used to quickly create annotations for a pose estimation model (as long as the keypoints are consistent across the frames).

```{figure} /_static/blog_images/sam2/EGCG5_40_2018_10_19_Mask_skeleton_and_nodes.png
:align: center
:width: 100%

**Skeletonised worm image with sampled nodes.** Five pixels were randomly sampled along the skeleton to define the nodes (shown in blue).
```

A Python notebook to perform the skeletonisation of the worm masks, extract the nodes from the masks, and visualise the process can be found in this [GitHub repository](https://github.com/jyoti-bhogal/neuroinformatics_osw/tree/main/python_code_skeletonisation_and_node_selection). Below we include the snippet to sample the nodes:

```python
import numpy as np
from PIL import Image
from skimage.morphology import skeletonize

# Open the .tif image
img = Image.open("path/to/the/worm-mask.tif")

# Convert to numpy array
img_worm = np.array(img)

# Compute skeletonised worm image (1-pixel wide worm mask)
skeleton_worm = skeletonize(img_worm)

# Get indices where skeleton_worm is True
worm_index_true = np.argwhere(skeleton_worm)

# Define nodes as 5 worm pixels randomly sampled (without replacement)
rng = np.random.default_rng(seed=42) # set a seed for reproducibility
sample_indices = rng.choice(len(worm_index_true), size=5, replace=False)

# Get rows (y-coordinate) and columns (x-coordinate) of the nodes in the skeleton_worm array
sampled_worm_index_true = worm_index_true[sample_indices, :]
```

In conclusion, the combination of SAM-2 for segmentation and `skimage` for skeletonisation provides an effective workflow for extracting pose estimation skeletons for _C. elegans_. This automated approach can significantly speed up the analysis of worm behaviour and facilitate further research in this area.