Dataset Adapters
Dataset adapters extract metadata from slide paths. This metadata flows through the pipeline into PatchResult.slide_metadata and into pipeline.stats_dict() for logging.
TCGAAdapter
Parses TCGA barcodes from filenames. TCGA slide filenames encode patient ID, tissue source site, sample type (tumor vs. normal), and more.
Example filename: TCGA-3L-AA1B-01Z-00-DX1.8923A151-A690-40B7-9E5A-FCBEDFC2394F.svs
from wsistream.datasets import TCGAAdapter
adapter = TCGAAdapter()
meta = adapter.parse_metadata("/data/TCGA-BRCA/TCGA-3L-AA1B-01Z-00-DX1.svs")
meta.dataset_name # "TCGA"
meta.patient_id # "TCGA-3L-AA1B"
meta.cancer_type # "TCGA-BRCA" (inferred from parent directory)
meta.sample_type # "Primary Solid Tumor" (sample code 01)
meta.tissue_type # "3L" (tissue source site code)
meta.extra # {"tissue_source_site": "3L", "vial": "Z", "portion": "00",
# "slide_id": "DX1", "slide_section": "DX", "is_frozen": False,
# "sample_code": "01", "barcode": "TCGA-3L-AA1B-01Z-00-DX1"}
The cancer type is inferred from the parent directory name (e.g., TCGA-BRCA/). You can override it:
The adapter also distinguishes diagnostic slides (DX) from frozen sections (TS, BS, MS) via the is_frozen field in extra.
Downloading TCGA slides
wsistream includes helpers to download TCGA slides directly from the GDC Data Portal. This is useful when setting up a new machine or VM.
Query available slides
from wsistream.datasets import query_tcga_slides
# See what's available for two cancer types
manifest = query_tcga_slides(
cancer_types=["TCGA-BRCA", "TCGA-LUAD"],
slide_type="diagnostic", # "diagnostic" (FFPE/DX), "frozen" (TS/BS/MS), or "all"
max_per_cancer_type=10, # stratified cap per cancer type (None = all)
seed=42, # reproducible subsampling
)
# Found 20 slides (18.3 GB):
# TCGA-BRCA: 10 slides (10.2 GB)
# TCGA-LUAD: 10 slides (8.1 GB)
query_tcga_slides returns a list of file records (dicts) without downloading anything. Each record contains file_id, filename, file_size, cancer_type, md5sum, and state.
| Parameter | Type | Default | Description |
|---|---|---|---|
cancer_types |
str, list, or None | None |
TCGA project IDs (e.g., "TCGA-BRCA"). None = all projects. |
slide_type |
str | "diagnostic" |
"diagnostic" (FFPE), "frozen" (tissue sections), or "all". |
max_per_cancer_type |
int or None | None |
Stratified cap. None = return all matching slides. |
seed |
int or None | 42 |
Random seed for reproducible subsampling. |
Download slides
from wsistream.datasets import download_tcga_slides
paths = download_tcga_slides(
manifest,
output_dir="/data/tcga", # saves as /data/tcga/TCGA-BRCA/file.svs
organize_by="cancer_type", # or "flat" for all files in one directory
skip_existing=True, # skip already-downloaded files
max_workers=4, # parallel download threads
)
| Parameter | Type | Default | Description |
|---|---|---|---|
manifest |
list[dict] | required | File records returned by query_tcga_slides. |
output_dir |
str or Path | required | Root directory to save slides into. |
organize_by |
str | "cancer_type" |
"cancer_type" creates subdirectories; "flat" puts everything in output_dir/. |
skip_existing |
bool | True |
Skip files that already exist with matching size. |
max_workers |
int | 4 |
Number of parallel download threads. |
Downloads run in parallel via a thread pool with a tqdm progress bar. For very large-scale downloads (thousands of slides), consider exporting a manifest and using the GDC Data Transfer Tool -- see below.
Export a GDC manifest
For downloading many slides, the GDC Data Transfer Tool (gdc-client) is faster than HTTPS because it supports parallel connections. Export a manifest and use gdc-client:
Then from the command line:
End-to-end example
Set up a fresh VM with 10 diagnostic slides per cancer type from BRCA and LUAD, then start training:
from wsistream.datasets import query_tcga_slides, download_tcga_slides, TCGAAdapter
from wsistream.pipeline import PatchPipeline
from wsistream.backends import OpenSlideBackend
from wsistream.tissue import OtsuTissueDetector
from wsistream.sampling import RandomSampler
# Step 1: Download slides
manifest = query_tcga_slides(
cancer_types=["TCGA-BRCA", "TCGA-LUAD"],
slide_type="diagnostic",
max_per_cancer_type=10,
)
download_tcga_slides(manifest, output_dir="/data/tcga")
# Step 2: Stream patches
pipeline = PatchPipeline(
slide_paths="/data/tcga",
backend=OpenSlideBackend(),
tissue_detector=OtsuTissueDetector(),
sampler=RandomSampler(patch_size=256, num_patches=-1, target_mpp=0.5),
dataset_adapter=TCGAAdapter(),
pool_size=4,
patches_per_slide=100,
cycle=True,
)
for result in pipeline:
print(result.image.shape, result.slide_metadata.cancer_type)
What gets logged
When a DatasetAdapter is configured, pipeline.stats_dict() includes dataset-specific counts:
{
"pipeline/cancer_type/TCGA-BRCA": 150,
"pipeline/cancer_type/TCGA-LUAD": 120,
"pipeline/sample_type/primary_solid_tumor": 250,
"pipeline/sample_type/solid_tissue_normal": 20,
...
}
These are ready for logging to Weights & Biases or similar tools.
Writing your own
from pathlib import Path
from wsistream.datasets.base import DatasetAdapter
from wsistream.types import SlideMetadata
class CamelyonAdapter(DatasetAdapter):
def parse_metadata(self, slide_path: str) -> SlideMetadata:
filename = Path(slide_path).stem
is_tumor = "tumor" in filename.lower()
return SlideMetadata(
slide_path=slide_path,
dataset_name="Camelyon16",
sample_type="tumor" if is_tumor else "normal",
)