Radiomics in Breast Cancer – Part 1: Exploring the CBIS-DDSM Dataset

#machinelearning #computervision #ai #science

1. Introduction

This article marks the first entry in a blog series presenting the main projects from my PhD research on radiomics and breast cancer imaging. The purpose is to disseminate my work in an accessible format while promoting open science.

Over the coming posts, I will outline the progression of my research:

dataset exploration → preprocessing → radiomics feature extraction → feature selection → ML benchmarking → interpretability.

This first post focuses on dataset exploration, specifically the Curated Breast Imaging Subset of DDSM (CBIS-DDSM), widely regarded as a benchmark dataset for breast cancer imaging research.

2. Background on CBIS-DDSM

The Digital Database for Screening Mammography (DDSM), developed in the 1990s, was among the first large, publicly available collections of digitized mammograms. Its original structure posed limitations for contemporary Machine Learning applications.

The CBIS-DDSM, released by the Cancer Imaging Archive (TCIA), is a curated and standardized subset with the following data:

1,566 patients
2,620 mammography images
Lesion annotations: masses and calcifications
Two standard views per breast: CC and MLO
Pathology labels: Malignant, Benign, and Benign with Callback

Figure 1 – Dataset Overview Diagram

Insert a flowchart here showing patients → images (CC/MLO) → lesions (benign/malignant). This gives readers a quick visual understanding of the dataset structure.

3. Objectives of the Dataset Exploration

The goal was to systematically assess:

Metadata – patient age, lesion type, pathology, image view.
Image characteristics – resolution, contrast, file size.
Class distribution – balance between benign and malignant cases and between lesion types.

This step was essential to design a robust preprocessing and analysis pipeline.

4. Findings

Through systematic exploration of CBIS-DDSM, several critical insights emerged, each with direct implications for radiomics analysis and machine learning model development.

Lesion Types and Distribution

The dataset includes two primary lesion types: masses and calcifications. Masses are larger, localized abnormalities, while calcifications are tiny deposits of calcium that may indicate malignancy. Understanding this distribution is essential because each lesion type may require different preprocessing and feature extraction approaches.

Figure 1: Lesion Type Distribution Bar Chart

The chart highlights that calcifications represent the majority of annotated lesions, indicating that models may naturally perform better on mass detection unless strategies are implemented to balance the contribution of calcifications.

Pathology Labels: Benign vs Malignant

Malignant lesions are substantially underrepresented compared to benign ones. This imbalance is critical because it can bias machine learning models toward overpredicting benign outcomes if not properly addressed.

Figure 2: Class Distribution Bar Chart

The chart clearly demonstrates the imbalance, where benign cases (including both “benign” and “benign with callback”) significantly outnumber malignant ones. This imbalance arises after merging the two benign categories into a single class, which is necessary to simplify the analysis. However, it also means that models may be biased toward predicting benign outcomes. To address this, evaluation metrics such as ROC-AUC and sensitivity are more appropriate than accuracy, since they better capture model performance on the underrepresented malignant cases.

Sample Mammograms

Examining actual images is critical to understand variability in imaging quality, lesion size, and annotation precision. Sample images also help communicate the nature of the dataset to readers who are less familiar with medical imaging.

Figure 3: Random sample of the mammograms

Distribution Across Metadata Variables

Beyond lesion type and pathology labels, it is critical to examine how cases are distributed across key imaging metadata variables, including mammography view (CC vs MLO), laterality (left vs right breast), and lesion type (mass vs calcification). These variables reflect both the technical aspects of image acquisition and the biological characteristics of the breast. An unbalanced representation across them can introduce hidden biases into machine learning models, which may reduce their clinical applicability.

Figure 4: Class Distribution in different key metadata variables

To address this, Figure 4 presents three complementary charts that summarize the distribution of pathology labels across these variables.

(Left) Pathology Distribution by Mammography View (CC vs MLO)

The first chart compares benign and malignant cases across the two standard mammography projections: craniocaudal (CC) and mediolateral oblique (MLO). While both views are routinely acquired in screening, the dataset shows a mild but notable imbalance: malignant cases are not equally represented in CC and MLO views.
This finding is significant for two reasons. First, models may become inadvertently sensitive to projection-dependent features rather than lesion-specific characteristics, leading to overfitting on technical differences. Second, when evaluating algorithm performance, results may vary depending on whether the test set contains a higher proportion of CC or MLO images. Explicitly reporting view distribution is therefore essential for transparency and reproducibility in radiomics-based studies.

(Center) Lesion Type Distribution by View

The third chart investigates how lesion type (mass vs calcification) is distributed across CC and MLO projections. This combined perspective is particularly relevant because it highlights subgroups that may be underrepresented in the dataset. For example, while benign masses are well represented in both CC and MLO views, certain subcategories—such as malignant calcifications in CC view—are comparatively rare.
This observation has critical implications. Models trained on such data may underperform in detecting rare but clinically important subgroups, not because the pathology is intrinsically more difficult to classify, but because of limited training samples. Furthermore, reporting global performance metrics without subgroup analysis could mask these deficiencies. Explicitly documenting subgroup imbalance encourages a more responsible interpretation of model results and highlights the need for either data augmentation or specialized evaluation strategies for minority subgroups.

(Right) Pathology Distribution by Breast Side (Left vs Right)

The second chart examines how benign and malignant cases are distributed across left and right breasts. As expected, the dataset appears relatively balanced with respect to laterality, given that mammography protocols acquire both breasts in each exam. However, the malignant class remains underrepresented on both sides.
Although laterality is not inherently expected to influence the biological likelihood of disease, it is worth noting that subtle technical differences (e.g., positioning, compression, or radiographer practice) could vary between sides. A balanced distribution minimizes the risk that models inadvertently learn from such laterality-related artifacts. Nevertheless, the overarching problem of class imbalance persists across both sides, reinforcing the importance of prioritizing evaluation metrics such as ROC-AUC, sensitivity, and specificity over raw accuracy.

5. Challenges Identified

The findings above naturally lead to several critical challenges, which must be considered when designing ML pipelines or radiomics feature extraction protocols:

Class Imbalance
- Evidence: Figure 2 illustrates a predominance of benign lesions over malignant ones.
- Implication: Standard accuracy metrics are insufficient. Models must be evaluated with metrics sensitive to class imbalance (e.g., ROC-AUC, F1-score, sensitivity). Techniques such as resampling or class weighting may be necessary.
Lesion Type Variation
- Evidence: Figure 1 shows uneven distribution of masses versus calcifications.
- Implication: Feature extraction and ML models may require tailored approaches for each lesion type. For example, texture-based radiomics features may perform differently on masses compared to calcifications.

6. Relevance for Radiomics and Machine Learning

The exploration of CBIS-DDSM is not merely a preliminary step; it establishes the foundation for the entire radiomics and machine learning workflow. Each insight gained informs subsequent decisions and ensures that the models and features extracted are both robust and clinically meaningful.

Class Imbalance Awareness
- The observed predominance of benign lesions (Figure 2) directly impacts model training. Without addressing this imbalance, ML models are likely to bias toward the majority class, producing inflated accuracy but poor detection of malignant lesions.
- This insight informed the decision to incorporate class weighting, resampling techniques, and sensitive evaluation metrics (ROC-AUC, F1-score, sensitivity), ensuring that the model’s predictive performance reflects clinical relevance rather than statistical bias.
Lesion Type Considerations
- Figure 1 demonstrates the uneven distribution of masses versus calcifications. Each lesion type presents distinct textural and morphological characteristics.
- Consequently, the feature extraction process (radiomics) must account for these differences. Certain features, such as texture or shape descriptors, may be more informative for one lesion type than another. This consideration guides both feature selection and model interpretability, ensuring that extracted radiomics features correspond to meaningful clinical phenomena.
Implications for Feature Extraction and ML Model Design
- A thorough understanding of these dataset characteristics allows for tailored preprocessing pipelines, informed feature selection, and appropriate model evaluation strategies.
- Without this exploration, radiomics features could be biased, misrepresentative, or noisy, leading to suboptimal ML performance and reduced clinical interpretability.

In summary, the exploration stage bridges the gap between raw clinical data and quantitative, analyzable features. It ensures that all subsequent steps — from radiomics extraction to model training and interpretation — are grounded in a well-characterized, reliable dataset, enhancing both scientific rigor and clinical applicability.

7. Conclusion

The exploration of CBIS-DDSM underscores the critical importance of systematic dataset characterization in radiomics and machine learning research. Key lessons include:

Dataset richness and limitations: CBIS-DDSM offers a valuable resource with thousands of annotated mammograms, yet presents challenges such as class imbalance, lesion variability, and image heterogeneity.
Impact on downstream analysis: Each observed feature of the dataset informs preprocessing, feature extraction, model design, and evaluation. Ignoring these factors can compromise both predictive performance and clinical relevance.
Foundation for reproducible research: By carefully documenting dataset characteristics and exploration steps, other researchers can reproduce the pipeline and validate findings, in alignment with open science principles.

Next Steps in the Series

This first post establishes a comprehensive understanding of the data that underpins all subsequent research. In Part 2, I will detail preprocessing mammograms for radiomics analysis, including steps for cleaning, normalizing, and preparing images for feature extraction.