Skip to content

Dataset Adapters

Dataset adapters extract metadata from slide paths. This metadata flows through the pipeline into PatchResult.slide_metadata and into pipeline.stats_dict() for logging.

TCGAAdapter

Parses TCGA barcodes from filenames. TCGA slide filenames encode patient ID, tissue source site, sample type (tumor vs. normal), and more.

Example filename: TCGA-3L-AA1B-01Z-00-DX1.8923A151-A690-40B7-9E5A-FCBEDFC2394F.svs

from wsistream.datasets import TCGAAdapter

adapter = TCGAAdapter()
meta = adapter.parse_metadata("/data/TCGA-BRCA/TCGA-3L-AA1B-01Z-00-DX1.svs")

meta.dataset_name   # "TCGA"
meta.patient_id     # "TCGA-3L-AA1B"
meta.cancer_type    # "TCGA-BRCA" (inferred from parent directory)
meta.sample_type    # "Primary Solid Tumor" (sample code 01)
meta.tissue_type    # "3L" (tissue source site code)
meta.extra          # {"tissue_source_site": "3L", "vial": "Z", "portion": "00",
                    #  "slide_id": "DX1", "slide_section": "DX", "is_frozen": False,
                    #  "sample_code": "01", "barcode": "TCGA-3L-AA1B-01Z-00-DX1"}

The cancer type is inferred from the parent directory name (e.g., TCGA-BRCA/). You can override it:

adapter = TCGAAdapter(cancer_type="LUAD")

The adapter also distinguishes diagnostic slides (DX) from frozen sections (TS, BS, MS) via the is_frozen field in extra.

Downloading TCGA slides

wsistream includes helpers to download TCGA slides directly from the GDC Data Portal. This is useful when setting up a new machine or VM.

Query available slides

from wsistream.datasets import query_tcga_slides

# See what's available for two cancer types
manifest = query_tcga_slides(
    cancer_types=["TCGA-BRCA", "TCGA-LUAD"],
    slide_type="diagnostic",       # "diagnostic" (FFPE/DX), "frozen" (TS/BS/MS), or "all"
    max_per_cancer_type=10,        # stratified cap per cancer type (None = all)
    seed=42,                       # reproducible subsampling
)
# Found 20 slides (18.3 GB):
#   TCGA-BRCA: 10 slides (10.2 GB)
#   TCGA-LUAD: 10 slides (8.1 GB)

query_tcga_slides returns a list of file records (dicts) without downloading anything. Each record contains file_id, filename, file_size, cancer_type, md5sum, and state.

Parameter Type Default Description
cancer_types str, list, or None None TCGA project IDs (e.g., "TCGA-BRCA"). None = all projects.
slide_type str "diagnostic" "diagnostic" (FFPE), "frozen" (tissue sections), or "all".
max_per_cancer_type int or None None Stratified cap. None = return all matching slides.
seed int or None 42 Random seed for reproducible subsampling.

Download slides

from wsistream.datasets import download_tcga_slides

paths = download_tcga_slides(
    manifest,
    output_dir="/data/tcga",       # saves as /data/tcga/TCGA-BRCA/file.svs
    organize_by="cancer_type",     # or "flat" for all files in one directory
    skip_existing=True,            # skip already-downloaded files
    max_workers=4,                 # parallel download threads
)
Parameter Type Default Description
manifest list[dict] required File records returned by query_tcga_slides.
output_dir str or Path required Root directory to save slides into.
organize_by str "cancer_type" "cancer_type" creates subdirectories; "flat" puts everything in output_dir/.
skip_existing bool True Skip files that already exist with matching size.
max_workers int 4 Number of parallel download threads.

Downloads run in parallel via a thread pool with a tqdm progress bar. For very large-scale downloads (thousands of slides), consider exporting a manifest and using the GDC Data Transfer Tool -- see below.

Export a GDC manifest

For downloading many slides, the GDC Data Transfer Tool (gdc-client) is faster than HTTPS because it supports parallel connections. Export a manifest and use gdc-client:

from wsistream.datasets import save_manifest

save_manifest(manifest, "my_manifest.tsv")

Then from the command line:

gdc-client download -m my_manifest.tsv -d /data/tcga

End-to-end example

Set up a fresh VM with 10 diagnostic slides per cancer type from BRCA and LUAD, then start training:

from wsistream.datasets import query_tcga_slides, download_tcga_slides, TCGAAdapter
from wsistream.pipeline import PatchPipeline
from wsistream.backends import OpenSlideBackend
from wsistream.tissue import OtsuTissueDetector
from wsistream.sampling import RandomSampler

# Step 1: Download slides
manifest = query_tcga_slides(
    cancer_types=["TCGA-BRCA", "TCGA-LUAD"],
    slide_type="diagnostic",
    max_per_cancer_type=10,
)
download_tcga_slides(manifest, output_dir="/data/tcga")

# Step 2: Stream patches
pipeline = PatchPipeline(
    slide_paths="/data/tcga",
    backend=OpenSlideBackend(),
    tissue_detector=OtsuTissueDetector(),
    sampler=RandomSampler(patch_size=256, num_patches=-1, target_mpp=0.5),
    dataset_adapter=TCGAAdapter(),
    pool_size=4,
    patches_per_slide=100,
    cycle=True,
)

for result in pipeline:
    print(result.image.shape, result.slide_metadata.cancer_type)

What gets logged

When a DatasetAdapter is configured, pipeline.stats_dict() includes dataset-specific counts:

{
    "pipeline/cancer_type/TCGA-BRCA": 150,
    "pipeline/cancer_type/TCGA-LUAD": 120,
    "pipeline/sample_type/primary_solid_tumor": 250,
    "pipeline/sample_type/solid_tissue_normal": 20,
    ...
}

These are ready for logging to Weights & Biases or similar tools.

Writing your own

from pathlib import Path
from wsistream.datasets.base import DatasetAdapter
from wsistream.types import SlideMetadata

class CamelyonAdapter(DatasetAdapter):
    def parse_metadata(self, slide_path: str) -> SlideMetadata:
        filename = Path(slide_path).stem
        is_tumor = "tumor" in filename.lower()
        return SlideMetadata(
            slide_path=slide_path,
            dataset_name="Camelyon16",
            sample_type="tumor" if is_tumor else "normal",
        )