"TMED-2 dataset"


Our complete TMED-2 dataset release (dated 2022-07-12) contains three components:

  • view_and_diagnosis_labeled_set : 599 studies from 577 unique patients (some patients have multiple studies on distinct days).

    • All patients have an aortic stenosis (AS) diagnostic label (none, early AS, or significant AS; for more see our severity diagnosis label primer)
    • Some images from each study have view label annotations (one of PLAX/PSAX/A2C/A4C/other, for more see our view label primer)
    • We partition these by patient into different "splits" of 360 training / 119 validation / 120 test studies.
  • view_labeled_set : 705 studies from 703 unique patients

    • These studies have view labels, but no AS diagnosis labels
  • unlabeled_set : 5486 studies from 5287 patients

    • No labels are available for any studies in this set

This TMED-2 dataset is referred to in some of our manuscripts as the DEV479 dataset, because models are trained on development set of 479 studies (360 for train and 119 for validation). The heldout test set contains data 120 studies.

Jump to: Summary Table   Image preprocessing   Dataset Format   Example Code

Summary Table

Summary statistics of our released TMED-2 dataset
Dataset Num. Patients Num. Studies Num. Labeled Images Num. Unlabeled Images
fully labeled set 577 599 17270 26596
partially labeled set 703 705 7694 37576
unlabeled set 5287 5486 0 353500

Image preprocessing

Every image in this dataset is a 2D TTE image stored at 112x112 pixel resolution in PNG format.

In TMED-2, we used metadata available in the raw DICOM files to ensure only the 2D TTE images from each study are included (filtering out doppler images, m-mode images, and colorflow images). Note that this is more aggressive preprocessing than in TMED-1 (where we did some filtering by aspect ratio, but this may have not discarded all doppler images, m-mode images, or colorflow images).

Dataset Format

The dataset is delivered as a shared folder on box.com to users who successfully Apply For Access.

The top-level directory contains:

  • labels, stored in comma-separated-value (CSV) plain-text files
  • images, stored within folders as 112x112 pixel grayscale PNG files

Labels and other metadata

Labels and assignments of the labeled set to different train/validation/test splits are stored in the following CSV files in the top-level directory.

- labels_per_image.csv
- TMED2_fold0_labeledpart.csv
- TMED2_fold1_labeledpart.csv
- TMED2_fold2_labeledpart.csv
- TMED2_fold0_unlabeledpart.csv
- TMED2_fold1_unlabeledpart.csv
- TMED2_fold2_unlabeledpart.csv
- TMED2_train_unlabeled.csv

Each CSV file has a row for each image file in the dataset, providing the relevant labels.

The specs for each type of CSV file are below:

Spec for labels_per_image.csv

CSV file with one row per image. Columns include:

- query_key
Filename of specific image, as a string. Example: "2977s1_0.png". See below for explanation of PatientIDStudyID_ImageID.png naming convention.
- view_label
View label, as a string. Options: {"PLAX", "PSAX", "A4C", "A2C", "A4CorA2CorOther"}
- diagnosis_label
Diagnostic severity label, as a string. Options: {"no_as", "mild_as", "mildtomoderate_AS", "moderate_AS", "severe_AS", "Not_Provided"}

Spec for TMED2_foldX_labeledpart.csv

CSV file with one row per image. Integer X denotes the specific train/valid/test split, and could take values in {0, 1, 2} (these splits correspond exactly to the 3 splits used in our paper's experiments). Columns include the following:

- query_key
Filename of specific image, as a string. Example: "2977s1_0.png". See below for explanation of PatientIDStudyID_ImageID.png naming convention.
- view_classifier_split
String that indicates which standard data split for image-level view classifier this image belongs to within fold X. Options: {"train", "val", "test"}.
- diagnosis_classifier_split
String that indicates which standard data split for image-level diagnosis classifier this image belongs to within fold X. Options: {"train", "val", "test", "not_used"}.
- view_label
View label, as defined above.
- diagnosis_label
Diagnostic severity label, as defined above.
- SourceFolder
Folder that the image is located

Spec for TMED2_foldX_unlabeledpart.csv

CSV file with one row per image. Integer X denotes the specific train/valid/test split, and could take values in {0, 1, 2} (these splits correspond exactly to the 3 splits used in our paper's experiments). Columns include the following:

- query_key
Filename of specific image, as a string. Example: "2977s1_0.png". See below for explanation of PatientIDStudyID_ImageID.png naming convention.
- SourceFolder
Folder that the image is located

Images

Images are stored within a hierarchy of folders representing the sets that comprise TMED-2:

  • view_and_diagnosis_labeled_set/labeled/
  • view_and_diagnosis_labeled_set/unlabeled/
  • view_labeled_set/labeled/
  • view_labeled_set/unlabeled/
  • unlabeled_set/unlabeled/

Each set's labeled/ subfolder contains only images with view labels. Each set's unlabeled/ subfolder contains only images without any view labels.

The individual image files are stored as 112x112 pixel grayscale PNG files within the appropriate folder.

For example, the fully-labeled set looks like:

- view_and_diagnosis_labeled_set/labeled/2977s1_0.png
- view_and_diagnosis_labeled_set/labeled/2977s1_1.png
- view_and_diagnosis_labeled_set/labeled/2977s1_2.png
...
- view_and_diagnosis_labeled_set/labeled/2977s1_19.png
- view_and_diagnosis_labeled_set/labeled/1907s2_0.png
- view_and_diagnosis_labeled_set/labeled/1907s2_1.png
- view_and_diagnosis_labeled_set/labeled/1907s2_2.png
...
- view_and_diagnosis_labeled_set/labeled/1907s2_24.png
...

The naming convention of these files is [PatientID]s[StudyID]_[ImageID].png.

  • PatientIDs are unique random identifiers (consistent across the whole dataset)
  • Each StudyID (counting up from 1) indicates one session of echocardiogram imagery captured on one day.
  • Each ImageID (counting up from 0) distinguishes each image within a subset. ImageID is not unique across even the labeled/ and unlabeled/ image sets, so please use the full path if you need a unique identifier.

Example code

See the TMED-2 Data Loading and Visualization Demo that loads the data, visualizes it and displays the corresponding labels.