wsistream

Modular online patch streaming from whole-slide images for training pathology foundation models.

wsistream delivers patches directly from WSIs during training -- no pre-extraction to disk, no storage overhead. Every component is pluggable: backends, tissue detectors, samplers, filters, transforms, views, and dataset adapters.

Why online patching?

Traditional pathology FM training pre-extracts millions of patches to disk before training begins. For example, UNI extracted ~100 million patches from 100K slides (Chen et al., 2024), requiring substantial storage and a preprocessing phase.

Online patching, proposed for FM pre-training by Kaiko (Aben et al., 2024) and refined in Midnight (Karasikov et al., 2025), eliminates this by sampling patches on-the-fly from the original slide files. Patches are read, filtered for tissue, augmented, and fed to the model during training.

See Online Patching for a detailed discussion.

What's included

Two WSI backends: OpenSlide (C-based) and TiffSlide (pure Python, cloud-compatible)
Four tissue detectors: Otsu, HSV, CLAM, and a combined detector
Three samplers: random (with rejection sampling), grid, and multi-magnification
Per-tile quality filtering: HSV pixel-based patch acceptance (Midnight-style)
Augmentations: HED stain augmentation, random flip/rotate, resize, normalize, and an albumentations wrapper
Multi-view outputs: multi-view augmentation (SimCLR/BYOL/MoCo-style), DINO-style multi-crop, and same-location magnification views
Pool-based slide interleaving: multiple slides open simultaneously with round-robin patch delivery
Infinite streaming: cycle=True for step-based training without epochs
Dataset adapters: TCGA barcode parsing with automatic metadata extraction
Pipeline statistics: tissue fractions, magnification counts, error tracking -- ready for logging (e.g., Weights & Biases)
PyTorch compatible: built-in WsiStreamDataset (IterableDataset), MonitoredLoader for throughput tracking, and DDP slide partitioning

Quick example

from wsistream.pipeline import PatchPipeline
from wsistream.backends import OpenSlideBackend
from wsistream.tissue import OtsuTissueDetector
from wsistream.filters import HSVPatchFilter
from wsistream.sampling import RandomSampler
from wsistream.transforms import (
    ComposeTransforms, HEDColorAugmentation, RandomFlipRotate, ResizeTransform,
)

pipeline = PatchPipeline(
    slide_paths="/path/to/slides",  # directory or list of files
    backend=OpenSlideBackend(),
    tissue_detector=OtsuTissueDetector(),
    sampler=RandomSampler(patch_size=256, num_patches=-1, seed=42),
    patch_filter=HSVPatchFilter(),
    transforms=ComposeTransforms(transforms=[
        HEDColorAugmentation(sigma=0.05),
        RandomFlipRotate(),
        ResizeTransform(target_size=224),
    ]),
    pool_size=1,
    patches_per_slide=100,
)

for result in pipeline:
    image = result.image              # numpy array, (224, 224, 3), uint8
    coord = result.coordinate         # PatchCoordinate (x, y, level, mpp, ...)
    tissue = result.tissue_fraction   # float in [0, 1]

Next steps

Getting Started -- installation and first pipeline
Online Patching -- the core concept
Architecture -- how the pipeline works