Virchow

Zimmermann, Vorontsov et al., "Virchow2: Scaling Self-Supervised Mixed Magnification Models in Pathology", 2024. arXiv:2408.00738

Vorontsov et al., "A foundation model for clinical-grade computational pathology and rare cancers detection", Nature Medicine, 2024. DOI: 10.1038/s41591-024-03141-0

What they do

Virchow2 extends the original Virchow with domain-specific modifications to DINOv2: extended-context translation (ECT) replacing standard crop-and-resize, KDE regularization replacing KoLeo, vertical flips, and optionally removing solarization. Tissue detection also changed from a simple HSV filter (Virchow 1) to a trained fully-convolutional network (Virchow 2).

Tile size: 392×392 pixel source regions (non-overlapping). ECT extracts global tiles (224×224) and local tiles (98×98) from these regions, using aspect ratio range (0.95, 1.05) and base scale range (0.9, 1.1) — minimal resizing to avoid distorting cell morphology.
Magnifications: tiles drawn from 5×, 10×, 20×, 40× with online balancing (approximately 20%, 40%, 20%, 20% for 40×, 20×, 10×, 5× respectively)
Tissue detection: a trained fully-convolutional network with Otsu thresholding post-processing at thresholds (0.4, 0.5). This is not open-sourced.
Tissue threshold: 65% — tiles must contain at least 65% tissue by area
Augmentations: ECT (extended-context translation), horizontal and vertical flips, grayscale, color jitter. Solarization included for Virchow2, removed for Virchow2G. No stain-specific augmentation.
Architecture: ViT-H/14 with 4 registers (Virchow2), ViT-G/14 with 8 registers (Virchow2G, 1.9B params)
Training: DINOv2 with KDE replacing KoLeo, 512× V100 GPUs, batch size 4096 (Virchow2) / 3072 (Virchow2G), 2B tiles total
Data: 3.1M WSIs (H&E + IHC) from 225K patients, including 15% external consultation slides

wsistream approximation

Virchow2 uses a trained FCN for tissue detection that is not publicly available. We substitute HSVTissueDetector as a heuristic, using the same HSV ranges that Virchow 1 used and that were also adopted for the Virchow 2 ablation experiments. Virchow2's ECT augmentation reads from 392×392 source regions — to approximate this, we extract 392×392 tiles and let the training framework handle the cropping.

from wsistream.pipeline import PatchPipeline
from wsistream.backends import OpenSlideBackend
from wsistream.tissue import HSVTissueDetector
from wsistream.sampling import RandomSampler

pipeline = PatchPipeline(
    slide_paths=slide_paths,
    backend=OpenSlideBackend(),
    tissue_detector=HSVTissueDetector(
        hue_range=(90, 180),
        sat_range=(8, 255),
        val_range=(103, 255),
    ),
    sampler=RandomSampler(
        patch_size=392,  # 392x392 source regions for ECT
        num_patches=-1,
        target_mpp=0.5,  # 20x magnification (primary resolution)
        tissue_threshold=0.65,  # 65% tissue required
    ),
    slide_sampling="random",
    pool_size=8,
    patches_per_slide=100,
    cycle=True,
)

Deviations from paper

Step	Paper	wsistream	Match
Tissue detection	Trained FCN + Otsu post-processing (not open-sourced)	`HSVTissueDetector` with Virchow 1 HSV ranges	Approximate — heuristic substitute for a learned model
Tissue threshold	65%	`tissue_threshold=0.65`	Exact
Tile size	392×392 source regions	`patch_size=392`	Exact
Magnifications	Mixed 5×/10×/20×/40× with online balancing	Single `target_mpp=0.5` (20×)	Approximate — wsistream samples at one magnification; paper balances across four
ECT augmentation	Extended-context translation (crops from 392 region, minimal resize)	Not in wsistream — handled by training code	N/A
Batch construction	4096 total batch, balanced across metadata	Standard DataLoader batching	Different — no metadata-based balancing in wsistream
Extraction	Offline, all non-overlapping tiles	Online random sampling (with replacement)	Different strategy

Original Virchow pipeline (Vorontsov et al., 2024)

Virchow 1 shares the same HSV ranges but differs from Virchow 2 in several ways: tiles are 224×224 (not 392), tissue detection uses a simple HSV per-pixel filter at a fixed 16× downsample (not a trained FCN), the tissue threshold is 25% (not 65%), training is at a single magnification of 20× (0.5 µm/px), and batch construction selects 1 WSI per GPU with 256 tiles per WSI. Training used standard DINOv2 (no ECT, no KDE, no vertical flips) on 1.5M H&E-only WSIs from MSKCC with a ViT-H/14 (632M params).

To approximate the Virchow 1 pipeline, change the sampler to patch_size=224, target_mpp=0.5, tissue_threshold=0.25 and optionally set pool_size=1, patches_per_slide=256 to match the 1-WSI-per-GPU batch construction.