DEV Community

Cover image for Building a Breast Cancer Prediction App with Machine Learning and Streamlit
Anshika
Anshika

Posted on

Building a Breast Cancer Prediction App with Machine Learning and Streamlit

Medical AI is revolutionizing healthcare, and machine learning models are becoming powerful tools for early disease detection. In this comprehensive tutorial, I'll walk you through building a complete breast cancer prediction system using the Wisconsin Breast Cancer dataset.

What We'll Build

By the end of this tutorial, you'll have:

  • A fully trained logistic regression model for cancer prediction
  • An interactive Streamlit web application
  • Comprehensive exploratory data analysis
  • A complete GitHub repository ready for deployment

Live Demo: Streamlit app

GitHub Repository: House Price Prediction

Understanding the Dataset

The Wisconsin Breast Cancer dataset contains 569 samples with 30 features each, computed from digitized images of breast mass fine needle aspirates. Each sample is classified as either:

  • Benign (B): Non-cancerous tumor
  • Malignant (M): Cancerous tumor

🔧 Setting Up the Environment

First, let's set up our development environment:

# Create virtual environment
python -m venv breast_cancer_env
source breast_cancer_env/bin/activate  # On Windows: breast_cancer_env\Scripts\activate

# Install required packages
pip install pandas numpy scikit-learn matplotlib seaborn streamlit plotly joblib
Enter fullscreen mode Exit fullscreen mode

Exploratory Data Analysis

The first step in any machine learning project is understanding your data. Here's what we discovered:

Key Insights:

  • Dataset Balance: ~63% benign, ~37% malignant cases
  • Feature Correlations: Strong correlations between mean, SE, and worst values of the same measurements
  • Distinguishing Features: concave_points_worst, perimeter_worst, and concave_points_mean show the highest correlation with malignancy

Visualization Highlights:

# Target variable distribution
df['diagnosis'].value_counts().plot(kind='bar')
plt.title('Distribution of Diagnosis')
plt.show()

# Correlation matrix
sns.heatmap(df.corr(), annot=False, cmap='coolwarm')
plt.title('Feature Correlation Matrix')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Building the Machine Learning Model

Data Preprocessing

# Convert diagnosis to binary
df['diagnosis'] = df['diagnosis'].map({'B': 0, 'M': 1})

# Separate features and target
X = df.drop(['diagnosis', 'id'], axis=1)
y = df['diagnosis']

# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Enter fullscreen mode Exit fullscreen mode

Model Training

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42)

# Train logistic regression
lr = LogisticRegression()
lr.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
Enter fullscreen mode Exit fullscreen mode

Model Performance

Our logistic regression model achieved impressive results:

  • Accuracy: 98%
  • Precision: High precision for both classes
  • Recall: Excellent recall for malignant cases

Medical Disclaimer & Ethics

Important: This application is for educational purposes only. Key considerations:

  • Always consult qualified healthcare professionals
  • AI should augment, not replace, medical expertise
  • Consider bias in training data
  • Ensure patient data privacy and security
  • Regular model retraining and validation

Deployment Options

Local Development

streamlit run app.py
Enter fullscreen mode Exit fullscreen mode

Streamlit Cloud

  1. Push code to GitHub
  2. Connect repository to Streamlit Cloud
  3. Deploy with one click

Docker Deployment

FROM python:3.9-slim
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8501
CMD ["streamlit", "run", "app.py"]
Enter fullscreen mode Exit fullscreen mode

Future Enhancements

Model Improvements

  • Ensemble Methods: Random Forest, XGBoost
  • Deep Learning: Neural networks for complex patterns
  • Feature Engineering: Automated feature selection

Application Features

  • Multi-language Support: Reach global healthcare providers
  • API Integration: Connect with hospital systems
  • Mobile App: Native iOS/Android applications
  • Real-time Monitoring: Track model performance

Advanced Analytics

  • Explainable AI: SHAP values for feature importance
  • Uncertainty Quantification: Confidence intervals
  • Bias Detection: Fairness across demographic groups

Key Takeaways

  1. Data Quality Matters: Clean, well-preprocessed data is crucial
  2. Model Simplicity: Logistic regression can be highly effective
  3. User Experience: Medical applications need intuitive interfaces
  4. Validation is Critical: Rigorous testing ensures reliability
  5. Ethical Considerations: Always prioritize patient safety

Technical Stack Summary

  • Data Science: pandas, numpy, scikit-learn
  • Visualization: matplotlib, seaborn, plotly
  • Web Framework: Streamlit
  • Deployment: Streamlit Cloud, Docker
  • Version Control: Git, GitHub

Resources & References

Conclusion

Building this breast cancer prediction system taught me the importance of combining technical excellence with ethical responsibility. Machine learning in healthcare requires not just accurate models, but also thoughtful user experience design and careful consideration of real-world implications.

The project demonstrates how modern tools like Streamlit can democratize AI deployment, making sophisticated machine learning models accessible to healthcare professionals without extensive technical backgrounds.

Remember: the goal isn't to replace medical professionals, but to provide them with powerful tools that can help save lives through early detection and improved diagnosis accuracy.


Have you built similar healthcare ML applications? What challenges did you face? Share your experiences in the comments below!


If you found this helpful, please give it a ❤️ and consider following for more AI and machine learning content!

Top comments (0)