"TMED-1 dataset (aka TMED-152-56)"

Our complete TMED-1 dataset release (dated 2021-07-30) contains:

Fully labeled set of studies from 260 total unique patients
- All patients have an aortic stenosis (AS) diagnostic label (one of none, mild/moderate, or severe; for more see our severity diagnosis label primer)
- All images have view label (one of PLAX/PSAX/other, for more see our view label primer)
- We partition these patients into several different "splits" of 156 training / 52 validation / 52 test studies
Partially labeled set of studies from 172 total unique patients
- These studies have AS diagnosis labels, but no view labels
Unlabeled set of studies from 2341 total unique patients
- No labels are available for any studies in this dataset

This original dataset is referred to in our MLHC '21 paper as the TMED-156-52 dataset, because models are trained on data from 156 labeled patients and each heldout set contains data from 52 patients.

We also make available a smaller version of this dataset: TMED-1 small (aka TMED-18-18).

Jump to: Summary Table Image preprocessing Dataset Format Example Code Differences from MLHC'21 paper

Summary Table

Summary statistics of our released TMED-1 (TMED-156-52) dataset

Dataset	Num. Patients	Num. Images
fully labeled set	260 156 train / 52 valid. / 52 test	27788
partially labeled set	172	19219
unlabeled set	2341	271474

Differences from Reported Experiments in MLHC '21 Paper

Note: This public release differs slightly from the datasets used in our MLHC '21 paper

To improve the quality of our public release, we removed several studies that were originally included in the partially labeled and unlabeled sets in our [MLHC 2021 manuscript](publications.html). We decided to remove these studies because despite our early best efforts, they were found to be from the same patient (but not necessarily the same exact imaging study) as some data in our labeled set. In this public release, all studies are guaranteed to be from distinct patients, which should simplify analysis and minimize confusion. Brief summary of the changes:

No changes were made to the paper's labeled set
130 studies were removed from the paper's unlabeled set
2 studies were removed from the paper's partially-labeled set (treated as unlabeled in all SSL experiments)

Image preprocessing

Every image in this dataset is a TTE image stored at 64x64 pixel resolution in PNG format.

For each included study, we included all available images after filtering by aspect ratio to discard non-2D images (see App. C of our MLHC paper for details). From each cineloop file, we chose exactly one image to analyze. Clinical collaborators suggested that any single frame could be used, so we took the first frame of each cineloop. The resulting data contains both color images and gray scale images with various resolutions. We converted each image to gray-scale, pad along its shorter axis to achieve a square aspect ratio, and resize to 64x64 pixels.

Dataset Format

The dataset is delivered as a shared folder on box.com to users who successfully Apply For Access.

The top-level directory contains:

labels, stored in comma-separated-value (CSV) plain-text files
images, stored within folders as 64x64 pixel grayscale PNG files

Labels and other metadata

Labels and assignments of the labeled set to different train/validation/test splits are stored in the following CSV files in the top-level directory.

- labels_per_image.csv
- TMED-156-52_fold0.csv
- TMED-156-52_fold1.csv
- TMED-156-52_fold2.csv
- TMED-156-52_fold3.csv

Note: some additional CSV files define the Small Version of our dataset: TMED-18-18.

Each CSV file has a row for each image file in the dataset, providing the relevant labels.

The specs for each type of CSV file are below:

Spec for labels_per_image.csv

CSV file with one row per image. Columns include:

- query_key: Filename of specific image, as a string. Example: "829_0.png". See below for explanation of PatientID_ImageID.png naming convention.
- view_label: View label, as a string. Options: {"PLAX", "PSAX AoV", "Other"}
- diagnosis_label: Diagnostic severity label, as a string. Options: {"no_as", "mild/moderate_as", "severe_as"}

Spec for TMED-156-52_foldX.csv

CSV file with one row per image. Integer X denotes the specific train/valid/test split, and could take values in {0, 1, 2, 3} (these splits correspond exactly to the 4 splits used in our paper's experiments). Columns include the following:

- query_key: Filename of specific image, as a string. Example: "829_0.png". See below for explanation of PatientID_ImageID.png naming convention.
- split: String that indicates which standard data split for semi-supervised learning this image belongs to within fold X. Options: {"train", "val", "test", "Unlabeled"}.
- view_label: View label, as defined above.
- diagnosis_label: Diagnostic severity label, as defined above.

Images

Images are stored within 3 different folders.

Each folder of images (labeled, partially_labeled, and unlabeled) contains many PNG image files.

The naming convention of these files is <PatientID>_<ImageID>.png.

Patient IDs are unique random identifiers. Each patient contributes exactly one study (one session of echocardiogram imagery captured on one day).

Image IDs are counts indicating the order of image capture.

- labeled/829_0.png
- labeled/829_1.png
- labeled/829_2.png
...
- labeled/829_169.png
- labeled/1913_0.png
- labeled/1913_1.png
- labeled/1913_2.png
...
- labeled/1913_132.png
...

Example code

See our provided Data Loading and Visualization Demo Notebook which shows how to load the data, visualize some example images and display the corresponding labels.