Manish Kotha

Posted on Mar 16

How I Built an AI System to Reduce Healthcare No-Shows Using Flask, Random Forest & SimPy

#python #machinelearning #webdev #beginners

How I Built an AI System to Reduce Healthcare No-Shows Using Flask, Random Forest & SimPy.

A walkthrough of my final year project — from problem statement to working simulation

The Problem I Wanted to Solve

Anyone who has visited a clinic knows the frustration — long wait times, overbooked doctors, and yet somehow, empty slots because patients didn't show up.

No-shows are one of the biggest inefficiencies in healthcare. Clinics lose revenue. Doctors waste time. Other patients who actually needed that slot couldn't get one.

I wanted to build something that tackles this with a data-driven approach. The result: an AI-Based Healthcare Appointment Scheduling Optimization System — my final year project built with Python, Flask, scikit-learn, and SimPy.

Here's how I built it, what I learned, and what I'd do differently.

What the System Does

At its core, the system does three things:

Predicts which patients are likely to miss their appointment (no-show prediction)
Uses that prediction to assign slots smartly (priority-based scheduling)
Simulates a full clinic day to prove the approach actually works (SimPy simulation)

There are two portals:

A Patient Portal where patients register, book appointments, and see their no-show risk
An Admin Dashboard where clinic staff manage doctors, generate slots, and run simulations

Tech Stack

Layer	Technology
Backend	Python 3.11, Flask 3.0
Database	SQLite + SQLAlchemy ORM
Machine Learning	scikit-learn (Random Forest)
Simulation	SimPy (Discrete-Event)
Frontend	Bootstrap 5, Chart.js
Data	pandas, numpy

Part 1: The No-Show Predictor

This is the heart of the project.

I trained a Random Forest Classifier to predict the probability that a patient will miss their appointment. The model outputs a score between 0 and 1, which I then bucket into three risk levels:

LOW — probability < 40%
MEDIUM — probability between 40–70%
HIGH — probability ≥ 70%

Features used:

- previous_no_shows       (how many times they've missed before)
- days_until_appointment  (further away = higher risk)
- appointment_hour        (early morning slots have higher no-show rates)
- day_of_week             (Mondays and Fridays are worse)
- age
- gender
- reminder_sent           (did they get a reminder?)
- distance_km             (how far they live from the clinic)

Model config:

RandomForestClassifier(
    n_estimators=100,
    max_depth=8,
    class_weight='balanced'  # important — no-shows are a minority class
)

I used class_weight='balanced' because no-shows are naturally less common than shows. Without this, the model would just learn to predict "will show up" for everyone and get high accuracy while being useless.

Training data:

I generated 1,200 synthetic patient records using a custom generate_data.py script. Obviously, real hospital data would be better — but for a final year project, synthetic data with realistic distributions works well enough to demonstrate the concept.

Part 2: Priority-Based Slot Allocation

Once I have the no-show probability, I use it to compute a priority score for each booking request:

Score = 0.5 × (urgency / 5) + 0.3 × (wait_days / 30) + 0.2 × (1 - no_show_prob)

Breaking this down:

Urgency (50% weight) — a patient with a critical condition gets priority
Wait time (30% weight) — patients waiting longer get bumped up
Reliability (20% weight) — lower no-show probability = more trustworthy booking

The system then assigns the highest-priority patient to the best available slot.

Overbooking Strategy

This is where it gets interesting. Based on the risk tier:

HIGH risk (≥70%): The slot stays open after booking — another patient can fill it if needed
MEDIUM risk (40–70%): Booked normally, but a reminder flag is set
LOW risk (<40%): Normal booking, slot is closed

This is a simplified version of how airlines overbook flights — except here, we're trying to ensure sick people actually get seen, not maximize revenue.

Part 3: SimPy Simulation

The ML model tells us who is likely to no-show. But does the overall strategy actually improve clinic efficiency? That's where SimPy comes in.

SimPy is a Python library for discrete-event simulation. I used it to simulate an entire 8-hour clinic day.

What the simulation models:

Patients arriving at scheduled times
Doctors processing appointments (with variable duration)
No-shows happening at a defined rate
Queue buildup and wait times

Comparing baseline vs. optimized:

Metric	Baseline	AI-Optimized
No-show rate	25%	~10% effective
Avg wait time	Higher	Lower
Doctor utilization	Lower	Higher
Patients seen	Fewer	More

The simulation confirms that the overbooking + priority strategy meaningfully improves throughput and reduces wasted slots.

Project Structure

healthcare_scheduler/
├── app.py                    # Main Flask application
├── config.py
├── seed_db.py                # Populates DB with sample data
├── RUN_PROJECT.bat           # One-click Windows launcher
│
├── ai_modules/
│   ├── no_show_predictor.py  # Random Forest model
│   ├── scheduler.py          # Priority slot allocator
│   └── simulation.py         # SimPy simulation
│
├── models/                   # SQLAlchemy DB models
├── routes/                   # Flask API endpoints
├── templates/                # HTML templates
└── data/
    └── generate_data.py      # Synthetic dataset generator

## How to Run It Locally (Windows)

bash

Option 1: Just double-click RUN_PROJECT.bat

It handles everything automatically

Option 2: Manual

python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
python data\generate_data.py
python ai_modules\no_show_predictor.py
python seed_db.py
python app.py



Then open `http://127.0.0.1:5000` in your browser.
Demo credentials:
- Patient: `ravi@mail.com` / `pass123`
- Admin: `http://127.0.0.1:5000/admin/`

---
## What I Learned

1. The ML pipeline is the easy part.
Training the model took a few hours. Getting Flask, SQLAlchemy, and the ML model to work together cleanly took much longer. Integration is where real projects live.

2. Synthetic data has real limits.
My model performs well on my synthetic test set. Whether it would hold up on real patient data is a completely different question. Real-world class imbalance, missing values, and biases would make this much harder.

3. SimPy is underrated.
Most developers have never heard of discrete-event simulation. But for modeling anything with queues, arrivals, and service times — clinics, call centers, manufacturing lines — SimPy is incredibly powerful and worth learning.

4.`class_weight='balanced'` matters.
Before I added this, my model had 85% accuracy but was nearly useless — it just predicted "will show up" every time. Balanced class weights fixed this. Always check your class distribution before celebrating accuracy scores.

---
## What I'd Improve With More Time

- **Real dataset** — The [KaggleHealthcare No-Show dataset](https://www.kaggle.com/joniarroba/noshowappointments) has 110,000 real records. Training on that would make the model actually meaningful.
- **Cross-validation & hyperparameter tuning** — I used defaults mostly. GridSearchCV would squeeze more performance out of the model.
- **Better features** — Weather on appointment day, insurance type, appointment type (follow-up vs. new patient) are all predictive in research literature.
- **Deploy it** — Currently Windows-only. Dockerizing it and deploying to Render or Railway would make it actually accessible.
- **Send real reminders** — Right now the "reminder_sent" flag is manual. Integrating Twilio or email would make the overbooking strategy actually work end-to-end.

---
## GitHub

The full source code is here: **https://github.com/ManishKumar981/-healthcare-scheduler**

If you found this useful, a ⭐ on the repo goes a long way!
---
*Thanks for reading. If you have questions about the ML approach, the SimPy simulation, or the Flask architecture — drop them in the comments. Happy to discuss.*

DEV Community