Big Imaging Data

Bridging Bioimaging and Research Software Engineering

Alessandro Felder, Igor Tatarnikov, Ruaridh Gollifer, Kimberly Meechan

Introduction

Research Software Engineer
Core developer of BrainGlobe
UCL BioImage Interest Group co-lead
2025 SSI fellow

Schedule

AM (technical)

Intro to Big Imaging Data concepts
- Bonus content: benchmarks!
Handling big imaging data with Python
- self-paced, collaborative learning

PM (community)

A personal perspective
What next for careers and community in bioimage analysis?

Find these slides at https://neuroinformatics.dev/slides-big-imaging-data-osw25/.

Acknowledgements

Thank you to HEFTIE textbook authors: David Stansby, Ruaridh Gollifer and Kimberly Meechan!

Hands-on materials based on the HEFTIE textbook
Introduction based on slides by Josh Moore, the HEFTIE textbook, and slides by textbook authors.
Has really helped me get up-to-speed!

Bioimaging formats today

Data is getting larger
Data is not standardised
Community efforts to collaborate
- bioformats
- more recently, OME-zarr

TIFF

Current de facto standard (aside from proprietary)
- Large stacks: folders of 2D tiffs
Not designed for scientific applications
bioformats can help convert from proprietary

Limitations

Data does not fit into memory.
Would be nice to compress data,
- But need to uncompress whole files to read a few pixels.
- Uncompressing can be slow

Solution: OME-zarr (spoiler!)

community support¹
- helps with standardisation
uses “chunked storage”
uses “pyramidal file format”
- help with data size

Let’s dig deeper

Chunked storage and pyramidal file formats

Folder of 2D tiffs

A form of chunked storage

Folder of 2D tiffs

Can access pixels in a few planes without loading whole image into memory

Folder of 2D tiffs

Can access pixels in a few planes without loading whole image into memory
Still only 4% of read data actually needed

Folder of 2D tiffs

Even more limited in some situations:

Chunked storage

Choose chunk size when we create the overall image “file”
Save each chunk into a separate, compressed, file

Chunked storage

Note: This also favours parallel reading and writing of files!

Chunked storage

Allows reading and decompressing fewer pixels when accessing sub arrays

Chunked storage

zarr is an open-source specification for how a large N-dimensional arrays should be stored.

Lots of choices

where should we store the data?
- on the cloud or locally?
how big should we make the chunks?
how should we compress each chunk?
Criteria:
- reading speed!
  - writing speed maybe less important
- size on disk
Luckily, Ruaridh and Kimberly can help!

Benchmarking zarr

Open-source zarr-benchmarks repository which is freely available
Python based using pytest-benchmark
SOON - a report summarising findings with plots, but still in progress

Benchmarking zarr

3 images: heart (335 MB), dense segmentation, sparse segmentation
- Heart: HiP-CT scan of a heart from the Human Organ Atlas
- Dense: segmented neurons from electron microscopy
- Sparse: A few select segmented neurons from electron microscopy
Images used are on zenodo
All sized 806 x 629 x 629
All 16-bit unsigned integer

Benchmarking zarr

3 images: heart (335 MB), dense segmentation, sparse segmentation
- Heart: HiP-CT scan of a heart from the Human Organ Atlas
- Dense: segmented neurons from electron microscopy
- Sparse: A few select segmented neurons from electron microscopy
Images used are on zenodo
All sized 806 x 629 x 629
All 16-bit unsigned integer

Benchmarking zarr

3 images: heart (335 MB), dense segmentation, sparse segmentation
- Heart: HiP-CT scan of a heart from the Human Organ Atlas
- Dense: segmented neurons from electron microscopy
- Sparse: A few select segmented neurons from electron microscopy
Images used are on zenodo
All sized 806 x 629 x 629
All 16-bit unsigned integer

Benchmarking zarr

Choice of compression library affects read time more than compression level.

Benchmarking zarr

Choice of compression library affects write time + higher compression levels take longer to write.

Benchmarking zarr

Tensorstore is a lot faster at reading data than zarr-python.

Benchmarking zarr

Large chunks compress worse + increase memory usage

Benchmarking zarr

Larger chunks are faster to read/write overall (+ make fewer files)

Benchmarking zarr

Compression ratio vs write time plot for heart image (left) and dense segmentation (right)

Segmentations normally compress a lot more (see compression ratio, y axis much higher values)
You may have to use different settings depending on your image.

Benchmarking zarr conclusions

Approximately “optimal” choices*

Use tensorstore library for fastest read/write
blosc-zstd is a good default compressor
Use a high compression level to get the smallest file size (usually means longer write times, but not much effect on read time)
Smaller chunks = smaller overall file size + less memory usage
Larger chunks = faster read/write times + fewer files
It’s a balance! People often use 64x64x64 or 128x128x128

This is different for different data! Worth testing some small samples of your own data with different settings

Pyramidal file formats

Chunks help with reading subsets of pixels, but what it you want to view the image as whole?

Pyramidal file formats

Idea: multiscale images!

OME zarr

Even better idea: multiscale images that follow a standard.

Further topics

tools are in flux
- zarr v2 versus v3
- which tools are compatible with which?!
sharding
- filesystems don’t like many small files!

Plan

handling OME zarr data with Python
get your hands dirty
- gain an intuition

Default adventure

convert a tiff stack to OME zarr
apply a threshold to it
colour thresholded data chunk by chunk

“The colourful mouse bone challenge”

The colourful mouse bone challenge

convert a tiff stack to OME zarr
apply a threshold to it
colour thresholded data chunk by chunk

Label a mouse tibia by chunks

Self-paced learning

encouraged to create your own adventure
- work on your own data
- some time to report back
encouraged to get involved with others
- lots of diverse expertise in the room

Some tips

run on small data first
think about
- do you need to update array contents?
- are you overwriting existing contents?
- check .zattr and .zgroup files
- check folders and subfolder
ask for help

You may additionally need a reader plugin for your specific image data, e.g. if you have sldy images

pip install bioio_sldy

A warning

Run on small data first!

A warning!

Run on small data first

Installation

conda create -n big-imaging-data-tutorial python=3.13 -y

conda activate big-imaging-data-tutorial

pip install "matplotlib" "jupyterlab" "numcodecs==0.15.1" "numpy==2.3.2" "zarr==2.18.7" "pydantic-zarr==0.7.0" "ome-zarr-models==0.1.10" "joblib==1.5.1" "tifffile[zarr]<2025.5.21" "bioio" "bioio-tifffile"

pip install "napari[all]" "napari-ome-zarr"

Installation

git clone https://github.com/neuroinformatics-unit/slides-big-imaging-data-osw25.git

or download from https://github.com/neuroinformatics-unit/slides-big-imaging-data-osw25/blob/main/tutorials/ and then run

jupyter lab

from the tutorials/ folder in your terminal.

Ideas

Adapt code to visualise/threshold remote data
- Remote data available at https://idr.github.io/ome-ngff-samples/
- Parallelise operations over chunks for large data
  - Hard: what if you need info from more than one chunk?
Benchmark-related
- run benchmarks on your own data
- vary chunksize/compression level etc.
Skeletal Biology
- segment chunkwise into spongy (<50% bone per chunk) and compact (>50% bone per chunk)
Hard: can we get the tutorial to run with zarr3 and latest OME-zarr
Own ideas?!

Lunch break!

What next for community and careers in bioimage analysis?

Introductory context

setting the scene from my perspective
please disagree

Big imaging data

e.g. the IARPA MICrONS dataset

Big imaging data

This IARPA MICrONS dataset spans a 1.4mm x .87mm x .84 mm volume of cortex in a P87 mouse. The dataset was imaged using two-photon microscopy, microCT, and serial electron microscopy, and then reconstructed using a combination of AI and human proofreading.

Gaining insight from large imaging data requires diverse technical expertise
Collaboration and technical skills are essential!

Postdoc careers

Madeline Lancaster, a neuroscientist at the University of Cambridge, UK, can relate to that. In July, she received a total of 36 applications for a postdoctoral position in her laboratory, many fewer than the couple of hundred that she originally expected. “I had been nervous that I wouldn’t be able to go through all of the applications,” she says. Those 36 didn’t lead to a single appointment. “I still have not filled the position,” she says. “There seems to be lots of competition for strong candidates.” ¹

Postdoc careers

Those who stayed and landed a coveted faculty position were more likely to have had a highly cited paper, changed their research topic between their PhD and postdoc, or moved abroad after receiving their doctorate. ¹

My career in selected conferences

SSI Collaborations Workshop

2015 (Oxford) (Hackday)
2017 (Leeds)
2018 (Cardiff) (Hackday)
2019 (Loughborough)
2020 (online)
2023 (Manchester)
2024 (Warwick)
2025 (Stirling)

Imaging Conferences

NEUBIAS 2018
Microscience Microscopy Congress 2019 🧑🏻‍💻
CBIAS 2022
CBIAS 2023
GloBIAS 2024
CBIAS 2024 🧑🏻‍🎓
CBIAS 2025 - co-organiser

My SSI fellowship

Peter Sieling, CC BY 2.0.

A bridge between Bioimaging and Research Software Engineering

What bridge?

Why build the bridge

coordinate advocacy toward university leadership/policy makers/funders
best practice knowledge exchange
avoid reinventing the career path wheel
involve more voices in co-design of careers and community

RTP career paths at UCL

RTP=“(digital) Research Technology Professional”

The job framework, consists of several individual job descriptions ranging from:

Assistant Research Software Engineer (UCL Grade 6)
Research Software Engineer (UCL Grade 7)
Senior Research Software Engineer (UCL Grade 8)
Principal Research Software Engineer (UCL Grade 9)
Head Research Software Engineer (UCL Grade 10)

The framework allows clear definition and development of each role,…

RTP careers at UCL

a mixture of service delivery, research, innovation, and teaching activities according to your own preferences and skills, and appropriate to your level of seniority.

Research Software Developers, Research Infrastructure Developers, Research Data Stewards, and Research Data Scientists – knowing that these are fluid categories, and welcome those who cross the boundaries between these.

Bioimage Analyst careers

We consider that bioimaging involves four different types of expertise.

Life Scientists (e.g. Biologists) …

Instrumentalists (e.g. Microscopists) …

Developers (e.g. Image processing algorithm developers, programmers and computer scientists) …

Bioimage analysts are a new type of experts in BioImaging, they select appropriate image processing algorithms and their implementations, and assemble them for conducting practical Bioimage Analysis.

One of the aim of NEUBIAS is to explicitly promote the mutual communication between these four communities of experts and to establish the role of Bioimage Analysts in Life Science

Are Bioimage Analysts a specialist/domain-specific RTP?

GloBIAS survey 2024

From “GloBIAS: strengthening the foundations of BioImage Analysis”; AA Corbat, CG Walther, LR de la Ballina et al, arXiv preprint arXiv:2507.06407, 2025

How?

“Speedblogging” - an “evolved” way of writing up discussion notes from small groups * Split up in small groups and discuss * Write up notes together

How?

Speedblogging * choose question of interest * self-organise into groups * assign a chair and a scribe * chair ensures everyone gets opportunity to contribute * scribe takes notes

Speedblogging tips

approximately half discussion/half writing
work together
- parallelise writing across sections
- review each others writing
- some research, some writing

::: https://www.software.ac.uk/guide/speed-blogging-and-tips-writing-speed-blog-post :::

Speedblogging template: Five important things

Approximately half discussion/half writing
Work together
- Parallelise writing across sections
- Review each others writing
- Some research, some writing
Limit the scope

::: https://www.software.ac.uk/guide/speed-blogging-and-tips-writing-speed-blog-post :::

Big Imaging Data

Introduction

Schedule

Acknowledgements

Bioimaging formats today

TIFF

Limitations

Solution: OME-zarr (spoiler!)

Chunked storage and pyramidal file formats

Folder of 2D tiffs

Folder of 2D tiffs

Folder of 2D tiffs

Folder of 2D tiffs

Chunked storage

Chunked storage

Chunked storage

Chunked storage

Lots of choices

Benchmarking zarr

Benchmarking zarr

Benchmarking zarr

Benchmarking zarr

Benchmarking zarr

Benchmarking zarr

Benchmarking zarr

Benchmarking zarr

Benchmarking zarr

Benchmarking zarr

Benchmarking zarr conclusions

Pyramidal file formats

Pyramidal file formats

OME zarr

Further topics

Plan

Default adventure

The colourful mouse bone challenge

Self-paced learning

Some tips

A warning

A warning!

Installation

Installation

Ideas

Lunch break!

What next for community and careers in bioimage analysis?

Introductory context

Big imaging data

Big imaging data

Postdoc careers

Postdoc careers

Postdoc careers

Postdoc careers

My career in selected conferences

My SSI fellowship

What bridge?

Why build the bridge

RTP career paths at UCL

RTP careers at UCL

Bioimage Analyst careers

GloBIAS survey 2024

How?

How?

Speedblogging tips

Speedblogging template: Five important things

Speedblogging template: Summary and related areas

Speedblogging topics