Midnight (Kaiko)
Karasikov et al., "Training state-of-the-art pathology foundation models with orders of magnitude less data", 2025. arXiv:2504.05186
Aben et al., "Towards Large-Scale Training of Pathology Foundation Models", 2024. arXiv:2404.15217
What they do
Midnight uses online patching — tiles are sampled uniformly at random from arbitrary positions of the WSIs during training, with no pre-extraction to disk. The online patching system was introduced in the earlier Kaiko-FM paper (Aben et al., 2024); Midnight adds per-tile HSV filtering (from Virchow) and HED color augmentation on top.
- Tile size: 256×256 pixels
- Magnifications: 0.25, 0.5, 1.0, 2.0 µm/px (≈40×, 20×, 10×, 5×). The paper does not state how magnifications are selected; we assume uniform.
- Foreground mask: the online patching pipeline uses a U-Net-based foreground segmentation model at thumbnail scale, trained on annotations provided by the Netherlands Cancer Institute. This model is not open-sourced.
- Foreground threshold: 40% — a candidate tile must have ≥40% overlap with the foreground mask
- Per-tile HSV filter: adopted from Virchow (Vorontsov et al., 2024). A tile is accepted only if ≥60% of its pixels have hue in [90, 180], saturation in [8, 255], and value in [103, 255].
- HED augmentation: color augmentations in the HED space (Tellez et al., 2019). The paper does not state the sigma value. Tellez et al. define "light" as σ=0.05 and "strong" as σ=0.2.
- Normalization: mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5), scaling pixel values to [−1, 1]. Applied in the DINOv2 data transforms, not the patching pipeline.
- Training: DINOv2 with a KDE regularizer replacing KoLeo, ViT-g/14 (1.1B params), initialized from ImageNet-pretrained DINOv2 checkpoints. 1M iterations on 32× H100 GPUs, effective batch size 768.
- DINOv2 multi-crop: tiles are fed as 256×256 to DINOv2. The paper states local and global crop output sizes of 98px and 224px for standard training (scaled to 168px and 392px for high-resolution post-training). Scale ranges, number of crops, and whether standard DINOv2 color augmentations (color jitter, Gaussian blur, solarize) are used alongside HED are not specified in the paper.
wsistream approximation
The exact paper-matched parts are tissue_threshold=0.4 for the 40% foreground-overlap requirement and HSVPatchFilter for the per-tile ≥60% HSV acceptance rule. We use HSVTissueDetector as a coarse thumbnail-stage heuristic to build the foreground mask; the paper instead uses a U-Net-based model that is not publicly available.
The paper uses random sampling (not grid extraction), which MultiMagnificationSampler captures at the level of random online patch draws across the four target MPPs. Because wsistream selects the closest existing pyramid level per slide, this is exact only when those MPPs are present in the slide pyramid.
from wsistream.pipeline import PatchPipeline
from wsistream.backends import OpenSlideBackend
from wsistream.tissue import HSVTissueDetector
from wsistream.filters import HSVPatchFilter
from wsistream.sampling import MultiMagnificationSampler
from wsistream.transforms import ComposeTransforms, HEDColorAugmentation
pipeline = PatchPipeline(
slide_paths=slide_paths,
backend=OpenSlideBackend(),
tissue_detector=HSVTissueDetector(
hue_range=(90, 180),
sat_range=(8, 255),
val_range=(103, 255),
),
patch_filter=HSVPatchFilter(
hue_range=(90, 180),
sat_range=(8, 255),
val_range=(103, 255),
min_pixel_fraction=0.6, # >=60% of pixels must pass HSV check
),
sampler=MultiMagnificationSampler(
target_mpps=[0.25, 0.5, 1.0, 2.0], # ~40x, ~20x, ~10x, ~5x
patch_size=256,
num_patches=-1, # infinite random with replacement
tissue_threshold=0.4, # 40% foreground required per patch
),
transforms=ComposeTransforms(transforms=[
HEDColorAugmentation(sigma=0.05), # sigma not stated; 0.05 = Tellez "light"
]),
slide_sampling="random",
pool_size=8,
patches_per_slide=100,
cycle=True,
)
num_patches=-1 means infinite streaming
The sampler generates random coordinates indefinitely (with replacement). This matches Midnight's online patching: each training step draws fresh random crops. The pipeline's patches_per_slide controls how many patches are drawn before rotating to the next slide.
With multi-crop views
To replicate DINOv2's internal multi-crop within wsistream, move HED augmentation to shared_transforms and add views. The paper confirms local crops are 98px and global crops are 224px for standard training. Scale ranges and crop counts are not stated in the paper — DINOv2 default scales and counts are used below. Whether standard DINOv2 color augmentations (color jitter, Gaussian blur) apply alongside HED is also not specified.
from wsistream.views import ViewConfig, RandomResizedCrop
pipeline = PatchPipeline(
slide_paths=slide_paths,
backend=OpenSlideBackend(),
tissue_detector=HSVTissueDetector(...),
patch_filter=HSVPatchFilter(...),
sampler=MultiMagnificationSampler(
target_mpps=[0.25, 0.5, 1.0, 2.0],
patch_size=256,
num_patches=-1,
tissue_threshold=0.4,
),
shared_transforms=HEDColorAugmentation(sigma=0.05), # applied once to the 256x256 tile
views=[
ViewConfig(
name="global",
crop=RandomResizedCrop(size=224, scale=(0.32, 1.0)),
count=2, # global_0, global_1 — DINOv2 default: 2 global crops
),
ViewConfig(
name="local",
crop=RandomResizedCrop(size=98, scale=(0.05, 0.32)), # 7×14 for ViT-g/14
count=8, # local_0 … local_7 — DINOv2 default: 8 local crops
),
],
slide_sampling="random",
pool_size=8,
patches_per_slide=100,
cycle=True,
)
Per-crop augmentations
The paper does not state whether standard DINOv2 photometric augmentations (Gaussian blur, grayscale, solarization) are applied alongside HED. To add them with view-asymmetric probabilities matching DINOv2 defaults, see the DINOv2-style multi-crop example.
Deviations from paper
| Step | Paper | wsistream | Match |
|---|---|---|---|
| Foreground mask | U-Net segmentation model (trained on NKI annotations, not open-sourced) | HSVTissueDetector |
Approximate — heuristic substitute for a learned model |
| Foreground threshold | 40% | tissue_threshold=0.4 |
Exact |
| Per-tile HSV filter | ≥60% pixels in hue [90, 180], sat [8, 255], val [103, 255] | HSVPatchFilter(min_pixel_fraction=0.6) |
Exact |
| Magnifications | 0.25, 0.5, 1.0, 2.0 µm/px | target_mpps=[0.25, 0.5, 1.0, 2.0] |
Exact (nearest level used when pyramid lacks a match) |
| Magnification weights | Not specified in paper | Uniform (default) | Unknown |
| Sampling strategy | Online random from arbitrary positions | num_patches=-1 (infinite random with replacement) |
Exact |
| Tile size | 256×256 | patch_size=256 |
Exact |
| HED augmentation | HED perturbation, sigma not stated | HEDColorAugmentation(sigma=0.05) |
Approximate — paper does not specify sigma |
| DINOv2 crop output sizes | local 98px, global 224px (paper-stated for standard training) | size=98 (local), size=224 (global) |
Exact |
| DINOv2 scale ranges | Not specified in paper | DINOv2 defaults (0.32–1.0 global, 0.05–0.32 local) | Unverified |
| DINOv2 crop counts | Not specified in paper | 2 global + 8 local (DINOv2 default) | Unverified |
| Per-crop color augmentations | Not specified — unknown if used alongside HED | Not included in wsistream config | Unverified |
| Normalization | mean/std = 0.5 | Training code | Exact for Kaiko-FM (Aben et al.); not explicitly stated for Midnight |
Earlier Kaiko-FM pipeline (Aben et al., 2024)
The Kaiko-FM paper introduced the online patching technique that Midnight builds upon. The patching core is identical — same U-Net foreground mask, same 40% threshold, same 256×256 tiles, same four magnification levels, same mean/std=0.5 normalization. The differences are all on the filtering and augmentation side: Kaiko-FM has no per-tile HSV filter and no HED augmentation. The only augmentations are the standard ones built into DINO/DINOv2 (color jitter, Gaussian blur, solarization, horizontal flip). Kaiko-FM also trained on all 29k TCGA slides (FFPE + Flash-Frozen) with DINO or DINOv2, whereas Midnight uses only the ~12k FFPE subset (plus optionally 80k proprietary NKI slides) with DINOv2 + KDE.
To reproduce the Kaiko-FM pipeline with wsistream, use the same configuration as above but drop patch_filter and transforms. The same foreground-mask deviation applies.