Big Imaging Data

Bridging Bioimaging and Research Software Engineering

Alessandro Felder, Igor Tatarnikov, Ruaridh Gollifer, Kimberly Meechan

Introduction

Schedule

AM (technical)

  • Intro to Big Imaging Data concepts
    • Bonus content: benchmarks!
  • Handling big imaging data with Python
    • self-paced, collaborative learning

PM (community)

  • A personal perspective
  • What next for careers and community in bioimage analysis?

Find these slides at https://neuroinformatics.dev/slides-big-imaging-data-osw25/.

Acknowledgements

Thank you to HEFTIE textbook authors: David Stansby, Ruaridh Gollifer and Kimberly Meechan!

Bioimaging formats today

  • Data is getting larger
  • Data is not standardised
  • Community efforts to collaborate
    • bioformats
    • more recently, OME-zarr

TIFF

  • Current de facto standard (aside from proprietary)
    • Large stacks: folders of 2D tiffs
  • Not designed for scientific applications
  • bioformats can help convert from proprietary

Limitations

  • Data does not fit into memory.
  • Would be nice to compress data,
    • But need to uncompress whole files to read a few pixels.
    • Uncompressing can be slow

Solution: OME-zarr (spoiler!)

  • community support1
    • helps with standardisation
  • uses “chunked storage”
  • uses “pyramidal file format”
    • help with data size

Let’s dig deeper

Chunked storage and pyramidal file formats

Folder of 2D tiffs

  • A form of chunked storage

Folder of 2D tiffs

  • Can access pixels in a few planes without loading whole image into memory

Folder of 2D tiffs

  • Can access pixels in a few planes without loading whole image into memory
  • Still only 4% of read data actually needed

Folder of 2D tiffs

  • Even more limited in some situations:

Chunked storage

  • Choose chunk size when we create the overall image “file”
  • Save each chunk into a separate, compressed, file

Chunked storage

  • Note: This also favours parallel reading and writing of files!

Chunked storage

  • Allows reading and decompressing fewer pixels when accessing sub arrays

Chunked storage

zarr is an open-source specification for how a large N-dimensional arrays should be stored.

Lots of choices

  • where should we store the data?

    • on the cloud or locally?
  • how big should we make the chunks?

  • how should we compress each chunk?

  • Criteria:

    • reading speed!
      • writing speed maybe less important
    • size on disk
  • Luckily, Ruaridh and Kimberly can help!

Benchmarking zarr

Benchmarking zarr

  • 3 images: heart (335 MB), dense segmentation, sparse segmentation

    • Heart: HiP-CT scan of a heart from the Human Organ Atlas
    • Dense: segmented neurons from electron microscopy
    • Sparse: A few select segmented neurons from electron microscopy
  • Images used are on zenodo

  • All sized 806 x 629 x 629

  • All 16-bit unsigned integer

Benchmarking zarr

  • 3 images: heart (335 MB), dense segmentation, sparse segmentation

    • Heart: HiP-CT scan of a heart from the Human Organ Atlas
    • Dense: segmented neurons from electron microscopy
    • Sparse: A few select segmented neurons from electron microscopy
  • Images used are on zenodo

  • All sized 806 x 629 x 629

  • All 16-bit unsigned integer

Benchmarking zarr

  • 3 images: heart (335 MB), dense segmentation, sparse segmentation

    • Heart: HiP-CT scan of a heart from the Human Organ Atlas
    • Dense: segmented neurons from electron microscopy
    • Sparse: A few select segmented neurons from electron microscopy
  • Images used are on zenodo

  • All sized 806 x 629 x 629

  • All 16-bit unsigned integer

Benchmarking zarr

Choice of compression library affects read time more than compression level.

Benchmarking zarr

Choice of compression library affects write time + higher compression levels take longer to write.

Benchmarking zarr

Tensorstore is a lot faster at reading data than zarr-python.

Benchmarking zarr

Large chunks compress worse + increase memory usage

Benchmarking zarr

Larger chunks are faster to read/write overall (+ make fewer files)

Benchmarking zarr

Compression ratio vs write time plot for heart image (left) and dense segmentation (right)

  • Segmentations normally compress a lot more (see compression ratio, y axis much higher values)

  • You may have to use different settings depending on your image.

Benchmarking zarr conclusions

Approximately “optimal” choices*

  • Use tensorstore library for fastest read/write
  • blosc-zstd is a good default compressor
  • Use a high compression level to get the smallest file size (usually means longer write times, but not much effect on read time)
  • Smaller chunks = smaller overall file size + less memory usage
  • Larger chunks = faster read/write times + fewer files
  • It’s a balance! People often use 64x64x64 or 128x128x128

This is different for different data! Worth testing some small samples of your own data with different settings

Pyramidal file formats

Chunks help with reading subsets of pixels, but what it you want to view the image as whole?

Pyramidal file formats

Idea: multiscale images!

OME zarr

Even better idea: multiscale images that follow a standard.

Further topics

  • tools are in flux
    • zarr v2 versus v3
    • which tools are compatible with which?!
  • sharding
    • filesystems don’t like many small files!

Plan

  • handling OME zarr data with Python
  • get your hands dirty
    • gain an intuition

Default adventure

  • convert a tiff stack to OME zarr
  • apply a threshold to it
  • colour thresholded data chunk by chunk

“The colourful mouse bone challenge”

The colourful mouse bone challenge

  • convert a tiff stack to OME zarr
  • apply a threshold to it
  • colour thresholded data chunk by chunk

Label a mouse tibia by chunks

Self-paced learning

  • encouraged to create your own adventure
    • work on your own data
    • some time to report back
  • encouraged to get involved with others
    • lots of diverse expertise in the room

Some tips

  • run on small data first
  • think about
    • do you need to update array contents?
    • are you overwriting existing contents?
    • check .zattr and .zgroup files
    • check folders and subfolder
  • ask for help

You may additionally need a reader plugin for your specific image data, e.g. if you have sldy images

pip install bioio_sldy

A warning

Run on small data first!

A warning!

Run on small data first

Installation

conda create -n big-imaging-data-tutorial python=3.13 -y
conda activate big-imaging-data-tutorial
pip install "matplotlib" "jupyterlab" "numcodecs==0.15.1" "numpy==2.3.2" "zarr==2.18.7" "pydantic-zarr==0.7.0" "ome-zarr-models==0.1.10" "joblib==1.5.1" "tifffile[zarr]<2025.5.21" "bioio" "bioio-tifffile"
pip install "napari[all]" "napari-ome-zarr"

Installation

git clone https://github.com/neuroinformatics-unit/slides-big-imaging-data-osw25.git

or download from https://github.com/neuroinformatics-unit/slides-big-imaging-data-osw25/blob/main/tutorials/ and then run

jupyter lab

from the tutorials/ folder in your terminal.

Ideas

  • Adapt code to visualise/threshold remote data
  • Benchmark-related
    • run benchmarks on your own data
    • vary chunksize/compression level etc.
  • Skeletal Biology
    • segment chunkwise into spongy (<50% bone per chunk) and compact (>50% bone per chunk)
  • Hard: can we get the tutorial to run with zarr3 and latest OME-zarr
  • Own ideas?!

Lunch break!

What next for community and careers in bioimage analysis?

Introductory context

  • setting the scene from my perspective
  • please disagree

Big imaging data

e.g. the IARPA MICrONS dataset

Big imaging data

This IARPA MICrONS dataset spans a 1.4mm x .87mm x .84 mm volume of cortex in a P87 mouse. The dataset was imaged using two-photon microscopy, microCT, and serial electron microscopy, and then reconstructed using a combination of AI and human proofreading.

  • Gaining insight from large imaging data requires diverse technical expertise
  • Collaboration and technical skills are essential!

Postdoc careers

Postdoc careers

Madeline Lancaster, a neuroscientist at the University of Cambridge, UK, can relate to that. In July, she received a total of 36 applications for a postdoctoral position in her laboratory, many fewer than the couple of hundred that she originally expected. “I had been nervous that I wouldn’t be able to go through all of the applications,” she says. Those 36 didn’t lead to a single appointment. “I still have not filled the position,” she says. “There seems to be lots of competition for strong candidates.” 1

Postdoc careers

Postdoc careers

Those who stayed and landed a coveted faculty position were more likely to have had a highly cited paper, changed their research topic between their PhD and postdoc, or moved abroad after receiving their doctorate. 1

My career in selected conferences

SSI Collaborations Workshop

  • 2015 (Oxford) (Hackday)
  • 2017 (Leeds)
  • 2018 (Cardiff) (Hackday)
  • 2019 (Loughborough)
  • 2020 (online)
  • 2023 (Manchester)
  • 2024 (Warwick)
  • 2025 (Stirling)

Imaging Conferences

  • NEUBIAS 2018
  • Microscience Microscopy Congress 2019 🧑🏻‍💻
  • CBIAS 2022
  • CBIAS 2023
  • GloBIAS 2024
  • CBIAS 2024 🧑🏻‍🎓
  • CBIAS 2025 - co-organiser

My SSI fellowship

Peter Sieling, CC BY 2.0.

A bridge between Bioimaging and Research Software Engineering

What bridge?

Why build the bridge

  • coordinate advocacy toward university leadership/policy makers/funders
  • best practice knowledge exchange
  • avoid reinventing the career path wheel
  • involve more voices in co-design of careers and community

RTP career paths at UCL

RTP=“(digital) Research Technology Professional”

The job framework, consists of several individual job descriptions ranging from:

  • Assistant Research Software Engineer (UCL Grade 6)
  • Research Software Engineer (UCL Grade 7)
  • Senior Research Software Engineer (UCL Grade 8)
  • Principal Research Software Engineer (UCL Grade 9)
  • Head Research Software Engineer (UCL Grade 10)

The framework allows clear definition and development of each role,…

RTP careers at UCL

a mixture of service delivery, research, innovation, and teaching activities according to your own preferences and skills, and appropriate to your level of seniority.

Research Software Developers, Research Infrastructure Developers, Research Data Stewards, and Research Data Scientists – knowing that these are fluid categories, and welcome those who cross the boundaries between these.

Bioimage Analyst careers

We consider that bioimaging involves four different types of expertise.

  • Life Scientists (e.g. Biologists) …
  • Instrumentalists (e.g. Microscopists) …
  • Developers (e.g. Image processing algorithm developers, programmers and computer scientists) …
  • Bioimage analysts are a new type of experts in BioImaging, they select appropriate image processing algorithms and their implementations, and assemble them for conducting practical Bioimage Analysis.

One of the aim of NEUBIAS is to explicitly promote the mutual communication between these four communities of experts and to establish the role of Bioimage Analysts in Life Science

Are Bioimage Analysts a specialist/domain-specific RTP?

GloBIAS survey 2024

From “GloBIAS: strengthening the foundations of BioImage Analysis”; AA Corbat, CG Walther, LR de la Ballina et al, arXiv preprint arXiv:2507.06407, 2025

How?

“Speedblogging” - an “evolved” way of writing up discussion notes from small groups * Split up in small groups and discuss * Write up notes together

How?

Speedblogging * choose question of interest * self-organise into groups * assign a chair and a scribe * chair ensures everyone gets opportunity to contribute * scribe takes notes

Speedblogging tips

  • approximately half discussion/half writing
  • work together
    • parallelise writing across sections
    • review each others writing
    • some research, some writing

::: https://www.software.ac.uk/guide/speed-blogging-and-tips-writing-speed-blog-post :::

Speedblogging template: Five important things

  • Approximately half discussion/half writing
  • Work together
    • Parallelise writing across sections
    • Review each others writing
    • Some research, some writing
  • Limit the scope

::: https://www.software.ac.uk/guide/speed-blogging-and-tips-writing-speed-blog-post :::

Speedblogging topics