"TMED-2 dataset" | Tufts Medical Echocardiogram Dataset (TMED)

Our complete TMED-2 dataset release (dated 2022-07-12) contains three components:

view_and_diagnosis_labeled_set : 599 studies from 577 unique patients (some patients have multiple studies on distinct days).
- All patients have an aortic stenosis (AS) diagnostic label (none, early AS, or significant AS; for more see our severity diagnosis label primer)
- Some images from each study have view label annotations (one of PLAX/PSAX/A2C/A4C/other, for more see our view label primer)
- We partition these by patient into different "splits" of 360 training / 119 validation / 120 test studies.
view_labeled_set : 705 studies from 703 unique patients
- These studies have view labels, but no AS diagnosis labels
unlabeled_set : 5486 studies from 5287 patients
- No labels are available for any studies in this set

This TMED-2 dataset is referred to in some of our manuscripts as the DEV479 dataset, because models are trained on development set of 479 studies (360 for train and 119 for validation). The heldout test set contains data 120 studies.

Jump to: Summary Table Image preprocessing Dataset Format Example Code

Summary Table

Summary statistics of our released TMED-2 dataset

Dataset	Num. Patients	Num. Studies	Num. Labeled Images	Num. Unlabeled Images
fully labeled set	577	599	17270	26596
partially labeled set	703	705	7694	37576
unlabeled set	5287	5486	0	353500

Image preprocessing

Every image in this dataset is a 2D TTE image stored at 112x112 pixel resolution in PNG format.

In TMED-2, we used metadata available in the raw DICOM files to ensure only the 2D TTE images from each study are included (filtering out doppler images, m-mode images, and colorflow images). Note that this is more aggressive preprocessing than in TMED-1 (where we did some filtering by aspect ratio, but this may have not discarded all doppler images, m-mode images, or colorflow images).

Dataset Format

The dataset is delivered as a shared folder on box.com to users who successfully Apply For Access.

The top-level directory contains:

labels, stored in comma-separated-value (CSV) plain-text files
images, stored within folders as 112x112 pixel grayscale PNG files

Labels and other metadata

Labels and assignments of the labeled set to different train/validation/test splits are stored in the following CSV files in the top-level directory.

- labels_per_image.csv
- TMED2_fold0_labeledpart.csv
- TMED2_fold1_labeledpart.csv
- TMED2_fold2_labeledpart.csv
- TMED2_fold0_unlabeledpart.csv
- TMED2_fold1_unlabeledpart.csv
- TMED2_fold2_unlabeledpart.csv
- TMED2_train_unlabeled.csv

Each CSV file has a row for each image file in the dataset, providing the relevant labels.

The specs for each type of CSV file are below:

Spec for labels_per_image.csv

CSV file with one row per image. Columns include:

- query_key: Filename of specific image, as a string. Example: "2977s1_0.png". See below for explanation of PatientIDStudyID_ImageID.png naming convention.
- view_label: View label, as a string. Options: {"PLAX", "PSAX", "A4C", "A2C", "A4CorA2CorOther"}
- diagnosis_label: Diagnostic severity label, as a string. Options: {"no_as", "mild_as", "mildtomoderate_AS", "moderate_AS", "severe_AS", "Not_Provided"}

Spec for TMED2_foldX_labeledpart.csv

CSV file with one row per image. Integer X denotes the specific train/valid/test split, and could take values in {0, 1, 2} (these splits correspond exactly to the 3 splits used in our paper's experiments). Columns include the following:

- query_key: Filename of specific image, as a string. Example: "2977s1_0.png". See below for explanation of PatientIDStudyID_ImageID.png naming convention.
- view_classifier_split: String that indicates which standard data split for image-level view classifier this image belongs to within fold X. Options: {"train", "val", "test"}.
- diagnosis_classifier_split: String that indicates which standard data split for image-level diagnosis classifier this image belongs to within fold X. Options: {"train", "val", "test", "not_used"}.
- view_label: View label, as defined above.
- diagnosis_label: Diagnostic severity label, as defined above.
- SourceFolder: Folder that the image is located

Spec for TMED2_foldX_unlabeledpart.csv

- query_key: Filename of specific image, as a string. Example: "2977s1_0.png". See below for explanation of PatientIDStudyID_ImageID.png naming convention.
- SourceFolder: Folder that the image is located

Images

Images are stored within a hierarchy of folders representing the sets that comprise TMED-2:

view_and_diagnosis_labeled_set/labeled/
view_and_diagnosis_labeled_set/unlabeled/
view_labeled_set/labeled/
view_labeled_set/unlabeled/
unlabeled_set/unlabeled/

Each set's labeled/ subfolder contains only images with view labels. Each set's unlabeled/ subfolder contains only images without any view labels.

The individual image files are stored as 112x112 pixel grayscale PNG files within the appropriate folder.

For example, the fully-labeled set looks like:

- view_and_diagnosis_labeled_set/labeled/2977s1_0.png
- view_and_diagnosis_labeled_set/labeled/2977s1_1.png
- view_and_diagnosis_labeled_set/labeled/2977s1_2.png
...
- view_and_diagnosis_labeled_set/labeled/2977s1_19.png
- view_and_diagnosis_labeled_set/labeled/1907s2_0.png
- view_and_diagnosis_labeled_set/labeled/1907s2_1.png
- view_and_diagnosis_labeled_set/labeled/1907s2_2.png
...
- view_and_diagnosis_labeled_set/labeled/1907s2_24.png
...

The naming convention of these files is [PatientID]s[StudyID]_[ImageID].png.

PatientIDs are unique random identifiers (consistent across the whole dataset)
Each StudyID (counting up from 1) indicates one session of echocardiogram imagery captured on one day.
Each ImageID (counting up from 0) distinguishes each image within a subset. ImageID is not unique across even the labeled/ and unlabeled/ image sets, so please use the full path if you need a unique identifier.

Example code

See the TMED-2 Data Loading and Visualization Demo that loads the data, visualizes it and displays the corresponding labels.