Anshika

Posted on Jul 7

Building a Breast Cancer Prediction App with Machine Learning and Streamlit

Medical AI is revolutionizing healthcare, and machine learning models are becoming powerful tools for early disease detection. In this comprehensive tutorial, I'll walk you through building a complete breast cancer prediction system using the Wisconsin Breast Cancer dataset.

What We'll Build

By the end of this tutorial, you'll have:

A fully trained logistic regression model for cancer prediction
An interactive Streamlit web application
Comprehensive exploratory data analysis
A complete GitHub repository ready for deployment

Live Demo: Streamlit app

GitHub Repository: House Price Prediction

Understanding the Dataset

The Wisconsin Breast Cancer dataset contains 569 samples with 30 features each, computed from digitized images of breast mass fine needle aspirates. Each sample is classified as either:

Benign (B): Non-cancerous tumor
Malignant (M): Cancerous tumor

🔧 Setting Up the Environment

First, let's set up our development environment:

# Create virtual environment
python -m venv breast_cancer_env
source breast_cancer_env/bin/activate  # On Windows: breast_cancer_env\Scripts\activate

# Install required packages
pip install pandas numpy scikit-learn matplotlib seaborn streamlit plotly joblib

Exploratory Data Analysis

The first step in any machine learning project is understanding your data. Here's what we discovered:

Key Insights:

Dataset Balance: ~63% benign, ~37% malignant cases
Feature Correlations: Strong correlations between mean, SE, and worst values of the same measurements
Distinguishing Features: concave_points_worst, perimeter_worst, and concave_points_mean show the highest correlation with malignancy

Visualization Highlights:

# Target variable distribution
df['diagnosis'].value_counts().plot(kind='bar')
plt.title('Distribution of Diagnosis')
plt.show()

# Correlation matrix
sns.heatmap(df.corr(), annot=False, cmap='coolwarm')
plt.title('Feature Correlation Matrix')
plt.show()

Building the Machine Learning Model

Data Preprocessing

# Convert diagnosis to binary
df['diagnosis'] = df['diagnosis'].map({'B': 0, 'M': 1})

# Separate features and target
X = df.drop(['diagnosis', 'id'], axis=1)
y = df['diagnosis']

# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Model Training

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42)

# Train logistic regression
lr = LogisticRegression()
lr.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Model Performance

Our logistic regression model achieved impressive results:

Accuracy: 98%
Precision: High precision for both classes
Recall: Excellent recall for malignant cases

Medical Disclaimer & Ethics

Important: This application is for educational purposes only. Key considerations:

Always consult qualified healthcare professionals
AI should augment, not replace, medical expertise
Consider bias in training data
Ensure patient data privacy and security
Regular model retraining and validation

Deployment Options

Local Development

streamlit run app.py

Streamlit Cloud

Push code to GitHub
Connect repository to Streamlit Cloud
Deploy with one click

Docker Deployment

FROM python:3.9-slim
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8501
CMD ["streamlit", "run", "app.py"]

Future Enhancements

Model Improvements

Ensemble Methods: Random Forest, XGBoost
Deep Learning: Neural networks for complex patterns
Feature Engineering: Automated feature selection

Application Features

Multi-language Support: Reach global healthcare providers
API Integration: Connect with hospital systems
Mobile App: Native iOS/Android applications
Real-time Monitoring: Track model performance

Advanced Analytics

Explainable AI: SHAP values for feature importance
Uncertainty Quantification: Confidence intervals
Bias Detection: Fairness across demographic groups

Key Takeaways

Data Quality Matters: Clean, well-preprocessed data is crucial
Model Simplicity: Logistic regression can be highly effective
User Experience: Medical applications need intuitive interfaces
Validation is Critical: Rigorous testing ensures reliability
Ethical Considerations: Always prioritize patient safety

Technical Stack Summary

Data Science: pandas, numpy, scikit-learn
Visualization: matplotlib, seaborn, plotly
Web Framework: Streamlit
Deployment: Streamlit Cloud, Docker
Version Control: Git, GitHub

Resources & References

Conclusion

Building this breast cancer prediction system taught me the importance of combining technical excellence with ethical responsibility. Machine learning in healthcare requires not just accurate models, but also thoughtful user experience design and careful consideration of real-world implications.

The project demonstrates how modern tools like Streamlit can democratize AI deployment, making sophisticated machine learning models accessible to healthcare professionals without extensive technical backgrounds.

Remember: the goal isn't to replace medical professionals, but to provide them with powerful tools that can help save lives through early detection and improved diagnosis accuracy.

Have you built similar healthcare ML applications? What challenges did you face? Share your experiences in the comments below!

If you found this helpful, please give it a ❤️ and consider following for more AI and machine learning content!

DEV Community