FOMO300K, FOMO60K datasets of brain scans

Download on Hugging Face. Description is in the paper A large-scale heterogeneous 3D magnetic resonance
brain imaging dataset for self-supervised learning

FOMO60K

is a subset of FOMO300K that includes 60,529 MRI scans collected from 13,900 MRI sessions across 11,187 subjects, aggregated from 16 publicly available datasets. In contrast to FOMO300K, all scans in FOMO60K were affinely co-registered within each session to the image with the highest spatial resolution. Additionally, each scan was either skull-stripped or defaced (details provided below). Table 3 summarizes the source datasets, including the number of subjects, sessions, and scans, as well as the MRI sequence types, applied preprocessing steps, and dataset licenses.

The preprocessing pipeline for FOMO60K consisted of three main stages: (1) reorienting images to RAS orientation (as performed in FOMO300K), (2) affine co-registration, and (3) skull stripping. First, all scans were reoriented to RAS and affinely co-registered using the mri_coreg command from FreeSurfer 7.4.1, with default parameters. Within each MRI session, scans were aligned to the image with the highest spatial resolution in order to preserve maximal anatomical detail.

FOMO300K

Some are skull-striped, some are not.

After downloading and unzipping to the same original file structure, this is a script that:

Picks one scan per dataset (PT001, PT002, etc.)
Loads and plots a middle slice
So you can visually inspect which datasets have skulls, artifacts,….

"""
FOMO300K Skull Strip Visual Inspector
--------------------------------------
Samples one scan per dataset (PTxxx_DatasetName) from mapping.tsv,
loads a middle axial slice, and plots a grid so you can visually
identify which datasets have skulls vs. are skull-stripped/defaced.

Usage:
    python visualize_skull_check.py \
        --fomo_root /path/to/FOMO300K \
        --mapping   /path/to/FOMO300K/mapping.tsv \
        --out_dir   ./skull_check_plots

    # Optional: only plot specific PT datasets
    python visualize_skull_check.py \
        --fomo_root /path/to/FOMO300K \
        --mapping   /path/to/FOMO300K/mapping.tsv \
        --out_dir   ./skull_check_plots \
        --datasets PT001 PT002 PT005
"""

import argparse
import os
import math
import random
from pathlib import Path
from collections import defaultdict

import numpy as np
import pandas as pd
import nibabel as nib
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec


# ── helpers ──────────────────────────────────────────────────────────────────

def load_middle_slice(nii_path: Path):
    """Return the middle axial slice of a 3-D NIfTI (or first vol of 4-D)."""
    img = nib.load(str(nii_path))
    data = img.get_fdata(dtype=np.float32)

    # Handle 4-D volumes (take first volume)
    if data.ndim == 4:
        data = data[..., 0]

    # Reorient to RAS so axial = last axis
    img_ras = nib.as_closest_canonical(img)
    data = img_ras.get_fdata(dtype=np.float32)
    if data.ndim == 4:
        data = data[..., 0]

    mid = data.shape[2] // 2
    slc = data[:, :, mid]
    return slc


def norm(slc: np.ndarray) -> np.ndarray:
    """Normalise slice to [0, 1] for display."""
    p2, p98 = np.percentile(slc[slc > 0], [2, 98]) if slc.any() else (0, 1)
    slc = np.clip(slc, p2, p98)
    rng = p98 - p2
    if rng == 0:
        return np.zeros_like(slc)
    return (slc - p2) / rng


# ── main ─────────────────────────────────────────────────────────────────────

def main():
    parser = argparse.ArgumentParser(description="Visual skull-strip checker for FOMO300K")
    parser.add_argument("--fomo_root", required=True, help="Root directory of FOMO300K")
    parser.add_argument("--mapping",   required=True, help="Path to mapping.tsv")
    parser.add_argument("--out_dir",   default="./skull_check_plots", help="Output directory for plots")
    parser.add_argument("--datasets",  nargs="*", default=None,
                        help="Subset of dataset prefixes to inspect, e.g. PT001 PT002. Default: all.")
    parser.add_argument("--seed",      type=int, default=42, help="Random seed for scan sampling")
    parser.add_argument("--cols",      type=int, default=6,  help="Number of columns in the output grid")
    args = parser.parse_args()

    random.seed(args.seed)
    fomo_root = Path(args.fomo_root)
    out_dir   = Path(args.out_dir)
    out_dir.mkdir(parents=True, exist_ok=True)

    # ── load mapping ──────────────────────────────────────────────────────────
    print(f"Loading mapping from: {args.mapping}")
    df = pd.read_csv(args.mapping, sep="\t", dtype=str)

    # Normalise column names (strip whitespace)
    df.columns = df.columns.str.strip()

    required = {"dataset", "new_path"}
    missing = required - set(df.columns)
    if missing:
        raise ValueError(f"mapping.tsv is missing columns: {missing}. Found: {list(df.columns)}")

    # ── filter to requested datasets ──────────────────────────────────────────
    all_datasets = sorted(df["dataset"].unique())
    print(f"Found {len(all_datasets)} datasets in mapping.tsv")

    if args.datasets:
        # Allow matching by prefix (e.g. "PT001") or full name
        keep = []
        for ds in all_datasets:
            for pat in args.datasets:
                if ds == pat or ds.startswith(pat):
                    keep.append(ds)
                    break
        datasets = sorted(set(keep))
        print(f"Filtered to {len(datasets)} datasets: {datasets}")
    else:
        datasets = all_datasets

    if not datasets:
        raise ValueError("No datasets matched. Check --datasets argument.")

    # ── sample one scan per dataset ───────────────────────────────────────────
    samples = {}   # dataset_name -> (nii_path, new_path_str)

    for ds in datasets:
        sub = df[df["dataset"] == ds].copy()

        # Prefer T1w / anat scans for clearest skull visibility
        t1_mask = sub["new_path"].str.contains("T1w|t1w|T1|mprage|MPRAGE", na=False)
        sub_t1 = sub[t1_mask]
        pool = sub_t1 if len(sub_t1) > 0 else sub

        row = pool.sample(1, random_state=args.seed).iloc[0]
        rel_path = row["new_path"]   # e.g. sub-01/ses-01/anat/sub-01_ses-01_T1w.nii.gz
        nii_path = fomo_root / ds / rel_path

        samples[ds] = (nii_path, rel_path)

    # ── load slices ───────────────────────────────────────────────────────────
    print(f"\nLoading {len(samples)} scans …")
    slices   = {}   # ds -> np array or None
    statuses = {}   # ds -> "ok" | "missing" | "error"

    for ds, (nii_path, rel) in samples.items():
        if not nii_path.exists():
            print(f"  [MISSING]  {ds}: {nii_path}")
            statuses[ds] = "missing"
            slices[ds]   = None
            continue
        try:
            slc = load_middle_slice(nii_path)
            slices[ds]   = norm(slc)
            statuses[ds] = "ok"
            print(f"  [OK]       {ds}: {nii_path.name}  shape={slc.shape}")
        except Exception as e:
            print(f"  [ERROR]    {ds}: {e}")
            statuses[ds] = f"error: {e}"
            slices[ds]   = None

    # ── plot grid ─────────────────────────────────────────────────────────────
    n      = len(datasets)
    ncols  = args.cols
    nrows  = math.ceil(n / ncols)

    fig_w  = ncols * 3.2
    fig_h  = nrows * 3.5
    fig    = plt.figure(figsize=(fig_w, fig_h), facecolor="#0d0d0d")
    fig.suptitle(
        "FOMO300K · Middle Axial Slice per Dataset\n"
        "(Visual check: with skull = skull visible, defaced = face region blanked, skull-stripped = brain only)",
        color="white", fontsize=11, y=0.995, va="top"
    )

    gs = gridspec.GridSpec(nrows, ncols, figure=fig, hspace=0.45, wspace=0.15)

    for idx, ds in enumerate(datasets):
        row_i = idx // ncols
        col_i = idx % ncols
        ax    = fig.add_subplot(gs[row_i, col_i])
        ax.set_facecolor("#0d0d0d")

        slc = slices[ds]
        if slc is not None:
            ax.imshow(np.rot90(slc), cmap="gray", vmin=0, vmax=1, aspect="equal",
                      interpolation="nearest")
            # Derive a short label: "PT001\nClevelandCCF"
            parts = ds.split("_", 1)
            label = f"{parts[0]}\n{parts[1]}" if len(parts) == 2 else ds
            ax.set_title(label, color="white", fontsize=6.5, pad=2, wrap=True)
        else:
            ax.text(0.5, 0.5, statuses[ds], color="red", fontsize=6,
                    ha="center", va="center", transform=ax.transAxes, wrap=True)
            parts = ds.split("_", 1)
            label = f"{parts[0]}\n{parts[1]}" if len(parts) == 2 else ds
            ax.set_title(label, color="#888", fontsize=6.5, pad=2)

        ax.axis("off")

    # Hide empty cells
    for idx in range(n, nrows * ncols):
        fig.add_subplot(gs[idx // ncols, idx % ncols]).set_visible(False)

    out_path = out_dir / "skull_check_all_datasets.png"
    fig.savefig(str(out_path), dpi=130, bbox_inches="tight",
                facecolor=fig.get_facecolor())
    plt.close(fig)
    print(f"\nSaved grid plot → {out_path}")

    # ── also save a per-scan summary TSV ─────────────────────────────────────
    rows = []
    for ds, (nii_path, rel) in samples.items():
        rows.append({
            "dataset":    ds,
            "sampled_scan": rel,
            "full_path":  str(nii_path),
            "status":     statuses[ds],
        })
    summary_df = pd.DataFrame(rows)
    summary_path = out_dir / "sampled_scans.tsv"
    summary_df.to_csv(str(summary_path), sep="\t", index=False)
    print(f"Saved sample summary  → {summary_path}")
    print("\nDone. Open the PNG to visually identify skull-stripped datasets.")


if __name__ == "__main__":
    main()

"""
FOMO300K Skull Strip Visual Inspector
--------------------------------------
Samples one scan per dataset (PTxxx_DatasetName) from mapping.tsv,
loads a middle axial slice, and plots a grid so you can visually
identify which datasets have skulls vs. are skull-stripped/defaced.

Usage:
    python visualize_skull_check.py \
        --fomo_root /path/to/FOMO300K \
        --mapping   /path/to/FOMO300K/mapping.tsv \
        --out_dir   ./skull_check_plots

    # Optional: only plot specific PT datasets
    python visualize_skull_check.py \
        --fomo_root /path/to/FOMO300K \
        --mapping   /path/to/FOMO300K/mapping.tsv \
        --out_dir   ./skull_check_plots \
        --datasets PT001 PT002 PT005
"""

import argparse
import os
import math
import random
from pathlib import Path
from collections import defaultdict

import numpy as np
import pandas as pd
import nibabel as nib
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec


# ── helpers ──────────────────────────────────────────────────────────────────

def load_middle_slice(nii_path: Path):
    """Return the middle axial slice of a 3-D NIfTI (or first vol of 4-D)."""
    img = nib.load(str(nii_path))
    data = img.get_fdata(dtype=np.float32)

    # Handle 4-D volumes (take first volume)
    if data.ndim == 4:
        data = data[..., 0]

    # Reorient to RAS so axial = last axis
    img_ras = nib.as_closest_canonical(img)
    data = img_ras.get_fdata(dtype=np.float32)
    if data.ndim == 4:
        data = data[..., 0]

    mid = data.shape[2] // 2
    slc = data[:, :, mid]
    return slc


def norm(slc: np.ndarray) -> np.ndarray:
    """Normalise slice to [0, 1] for display."""
    p2, p98 = np.percentile(slc[slc > 0], [2, 98]) if slc.any() else (0, 1)
    slc = np.clip(slc, p2, p98)
    rng = p98 - p2
    if rng == 0:
        return np.zeros_like(slc)
    return (slc - p2) / rng


# ── main ─────────────────────────────────────────────────────────────────────

def main():
    parser = argparse.ArgumentParser(description="Visual skull-strip checker for FOMO300K")
    parser.add_argument("--fomo_root", required=True, help="Root directory of FOMO300K")
    parser.add_argument("--mapping",   required=True, help="Path to mapping.tsv")
    parser.add_argument("--out_dir",   default="./skull_check_plots", help="Output directory for plots")
    parser.add_argument("--datasets",  nargs="*", default=None,
                        help="Subset of dataset prefixes to inspect, e.g. PT001 PT002. Default: all.")
    parser.add_argument("--seed",      type=int, default=42, help="Random seed for scan sampling")
    parser.add_argument("--cols",      type=int, default=6,  help="Number of columns in the output grid")
    args = parser.parse_args()

    random.seed(args.seed)
    fomo_root = Path(args.fomo_root)
    out_dir   = Path(args.out_dir)
    out_dir.mkdir(parents=True, exist_ok=True)

    # ── load mapping ──────────────────────────────────────────────────────────
    print(f"Loading mapping from: {args.mapping}")
    df = pd.read_csv(args.mapping, sep="\t", dtype=str)

    # Normalise column names (strip whitespace)
    df.columns = df.columns.str.strip()

    required = {"dataset", "new_path"}
    missing = required - set(df.columns)
    if missing:
        raise ValueError(f"mapping.tsv is missing columns: {missing}. Found: {list(df.columns)}")

    # ── filter to requested datasets ──────────────────────────────────────────
    all_datasets = sorted(df["dataset"].unique())
    print(f"Found {len(all_datasets)} datasets in mapping.tsv")

    if args.datasets:
        # Allow matching by prefix (e.g. "PT001") or full name
        keep = []
        for ds in all_datasets:
            for pat in args.datasets:
                if ds == pat or ds.startswith(pat):
                    keep.append(ds)
                    break
        datasets = sorted(set(keep))
        print(f"Filtered to {len(datasets)} datasets: {datasets}")
    else:
        datasets = all_datasets

    if not datasets:
        raise ValueError("No datasets matched. Check --datasets argument.")

    # ── sample one scan per dataset ───────────────────────────────────────────
    samples = {}   # dataset_name -> (nii_path, new_path_str)

    for ds in datasets:
        sub = df[df["dataset"] == ds].copy()

        # Prefer T1w / anat scans for clearest skull visibility
        t1_mask = sub["new_path"].str.contains("T1w|t1w|T1|mprage|MPRAGE", na=False)
        sub_t1 = sub[t1_mask]
        pool = sub_t1 if len(sub_t1) > 0 else sub

        row = pool.sample(1, random_state=args.seed).iloc[0]
        rel_path = row["new_path"]   # e.g. sub-01/ses-01/anat/sub-01_ses-01_T1w.nii.gz
        nii_path = fomo_root / ds / rel_path

        samples[ds] = (nii_path, rel_path)

    # ── load slices ───────────────────────────────────────────────────────────
    print(f"\nLoading {len(samples)} scans …")
    slices   = {}   # ds -> np array or None
    statuses = {}   # ds -> "ok" | "missing" | "error"

    for ds, (nii_path, rel) in samples.items():
        if not nii_path.exists():
            print(f"  [MISSING]  {ds}: {nii_path}")
            statuses[ds] = "missing"
            slices[ds]   = None
            continue
        try:
            slc = load_middle_slice(nii_path)
            slices[ds]   = norm(slc)
            statuses[ds] = "ok"
            print(f"  [OK]       {ds}: {nii_path.name}  shape={slc.shape}")
        except Exception as e:
            print(f"  [ERROR]    {ds}: {e}")
            statuses[ds] = f"error: {e}"
            slices[ds]   = None

    # ── plot grid ─────────────────────────────────────────────────────────────
    n      = len(datasets)
    ncols  = args.cols
    nrows  = math.ceil(n / ncols)

    fig_w  = ncols * 3.2
    fig_h  = nrows * 3.5
    fig    = plt.figure(figsize=(fig_w, fig_h), facecolor="#0d0d0d")
    fig.suptitle(
        "FOMO300K · Middle Axial Slice per Dataset\n"
        "(Visual check: with skull = skull visible, defaced = face region blanked, skull-stripped = brain only)",
        color="white", fontsize=11, y=0.995, va="top"
    )

    gs = gridspec.GridSpec(nrows, ncols, figure=fig, hspace=0.45, wspace=0.15)

    for idx, ds in enumerate(datasets):
        row_i = idx // ncols
        col_i = idx % ncols
        ax    = fig.add_subplot(gs[row_i, col_i])
        ax.set_facecolor("#0d0d0d")

        slc = slices[ds]
        if slc is not None:
            ax.imshow(np.rot90(slc), cmap="gray", vmin=0, vmax=1, aspect="equal",
                      interpolation="nearest")
            # Derive a short label: "PT001\nClevelandCCF"
            parts = ds.split("_", 1)
            label = f"{parts[0]}\n{parts[1]}" if len(parts) == 2 else ds
            ax.set_title(label, color="white", fontsize=6.5, pad=2, wrap=True)
        else:
            ax.text(0.5, 0.5, statuses[ds], color="red", fontsize=6,
                    ha="center", va="center", transform=ax.transAxes, wrap=True)
            parts = ds.split("_", 1)
            label = f"{parts[0]}\n{parts[1]}" if len(parts) == 2 else ds
            ax.set_title(label, color="#888", fontsize=6.5, pad=2)

        ax.axis("off")

    # Hide empty cells
    for idx in range(n, nrows * ncols):
        fig.add_subplot(gs[idx // ncols, idx % ncols]).set_visible(False)

    out_path = out_dir / "skull_check_all_datasets.png"
    fig.savefig(str(out_path), dpi=130, bbox_inches="tight",
                facecolor=fig.get_facecolor())
    plt.close(fig)
    print(f"\nSaved grid plot → {out_path}")

    # ── also save a per-scan summary TSV ─────────────────────────────────────
    rows = []
    for ds, (nii_path, rel) in samples.items():
        rows.append({
            "dataset":    ds,
            "sampled_scan": rel,
            "full_path":  str(nii_path),
            "status":     statuses[ds],
        })
    summary_df = pd.DataFrame(rows)
    summary_path = out_dir / "sampled_scans.tsv"
    summary_df.to_csv(str(summary_path), sep="\t", index=False)
    print(f"Saved sample summary  → {summary_path}")
    print("\nDone. Open the PNG to visually identify skull-stripped datasets.")


if __name__ == "__main__":
    main()

The following are skull striped

PT009 BraTS-GEN

PT015 MSD_BrainTumor

PT023 Infant_Development_Brain

PT025 MGH_Wild

PT030 OpenNeuro/ds00022
PT030 OpenNeuro/ds001110
PT030 OpenNeuro/ds001235
PT030 OpenNeuro/ds001339
PT030 OpenNeuro/ds001534
PT030 OpenNeuro/ds001551
PT030 OpenNeuro/ds001832
PT030 OpenNeuro/ds001882
PT030 OpenNeuro/ds002011
PT030 OpenNeuro/ds002076
PT030 OpenNeuro/ds002672
PT030 OpenNeuro/ds002675
PT030 OpenNeuro/ds002748
PT030 OpenNeuro/ds002995
PT030 OpenNeuro/ds003007
PT030 OpenNeuro/ds003340
PT030 OpenNeuro/ds003367
PT030 OpenNeuro/ds003511
PT030 OpenNeuro/ds003716
PT030 OpenNeuro/ds003777
PT030 OpenNeuro/ds003835
PT030 OpenNeuro/ds003972
PT030 OpenNeuro/ds004054
PT030 OpenNeuro/ds004187
PT030 OpenNeuro/ds004286
PT030 OpenNeuro/ds004312
PT030 OpenNeuro/ds004553
PT030 OpenNeuro/ds004564
PT030 OpenNeuro/ds004648
PT030 OpenNeuro/ds004666
PT030 OpenNeuro/ds004692
PT030 OpenNeuro/ds004710
PT030 OpenNeuro/ds004993
PT030 OpenNeuro/ds006188

PT009 BraTS-GEN

PT015 MSD_BrainTumor

PT023 Infant_Development_Brain

PT025 MGH_Wild

PT030 OpenNeuro/ds00022
PT030 OpenNeuro/ds001110
PT030 OpenNeuro/ds001235
PT030 OpenNeuro/ds001339
PT030 OpenNeuro/ds001534
PT030 OpenNeuro/ds001551
PT030 OpenNeuro/ds001832
PT030 OpenNeuro/ds001882
PT030 OpenNeuro/ds002011
PT030 OpenNeuro/ds002076
PT030 OpenNeuro/ds002672
PT030 OpenNeuro/ds002675
PT030 OpenNeuro/ds002748
PT030 OpenNeuro/ds002995
PT030 OpenNeuro/ds003007
PT030 OpenNeuro/ds003340
PT030 OpenNeuro/ds003367
PT030 OpenNeuro/ds003511
PT030 OpenNeuro/ds003716
PT030 OpenNeuro/ds003777
PT030 OpenNeuro/ds003835
PT030 OpenNeuro/ds003972
PT030 OpenNeuro/ds004054
PT030 OpenNeuro/ds004187
PT030 OpenNeuro/ds004286
PT030 OpenNeuro/ds004312
PT030 OpenNeuro/ds004553
PT030 OpenNeuro/ds004564
PT030 OpenNeuro/ds004648
PT030 OpenNeuro/ds004666
PT030 OpenNeuro/ds004692
PT030 OpenNeuro/ds004710
PT030 OpenNeuro/ds004993
PT030 OpenNeuro/ds006188

The following do not have full skulls:

PT026 MICA_MICs

PT030 OpenNeuro/ds000228

PT030 OpenNeuro/ds000229

PT030 OpenNeuro/ds001168

PT030 OpenNeuro/ds002606

PT026 MICA_MICs

PT030 OpenNeuro/ds000228

PT030 OpenNeuro/ds000229

PT030 OpenNeuro/ds001168

PT030 OpenNeuro/ds002606

PT007 ATAG contains files like sub-04_ses-01_run-1_T2starw.nii.gz that upon inspection looks like this

Check also

PT030 OpenNeuro/ds001912, PT002 Nigerian_Clinical, PT030 OpenNeuro/ds002367, PT030 OpenNeuro/ds003466, PT030 OpenNeuro/ds003763, PT030 OpenNeuro/ds003798, PT030 OpenNeuro/ds003836, PT030 OpenNeuro/ds003949, PT030 OpenNeuro/ds003967, PT030 OpenNeuro/ds003990, PT030 OpenNeuro/ds004798, PT030 OpenNeuro/ds004889, PT030 OpenNeuro/ds005205, PT030 OpenNeuro/ds005075, PT030 OpenNeuro/ds005138, PT030 OpenNeuro/ds005576,

PT030 OpenNeuro/ds001912, PT002 Nigerian_Clinical, PT030 OpenNeuro/ds002367, PT030 OpenNeuro/ds003466, PT030 OpenNeuro/ds003763, PT030 OpenNeuro/ds003798, PT030 OpenNeuro/ds003836, PT030 OpenNeuro/ds003949, PT030 OpenNeuro/ds003967, PT030 OpenNeuro/ds003990, PT030 OpenNeuro/ds004798, PT030 OpenNeuro/ds004889, PT030 OpenNeuro/ds005205, PT030 OpenNeuro/ds005075, PT030 OpenNeuro/ds005138, PT030 OpenNeuro/ds005576,

before inclusion to training data.

Some of the heads are a bit rotated. For example, PT030 OpenNeuro/ds001984, PT030 OpenNeuro/ds002006, PT030 OpenNeuro/ds002155, PT030 OpenNeuro/ds002711, PT030 OpenNeuro/ds002715, …

FOMO300K, FOMO60K datasets of brain scans

FOMO60K

FOMO300K

Related

Leave a ReplyCancel reply

FOMO300K, FOMO60K datasets of brain scans

FOMO60K

FOMO300K

Share this:

Related

Leave a ReplyCancel reply