I Built a Pneumonia Detection AI on My MacBook — Here's Exactly How It Works

nandhu_sauce — Fri, 24 Apr 2026 11:37:31 +0000

I just finished building a deep learning system that identifies pneumonia from chest X-rays with 96% accuracy and an AUC-ROC of 0.99. I ran the entire training process on my MacBook Pro M5 using the GPU acceleration provided by Apple's Metal Performance Shaders (MPS). This project wasn't about complex math; it was about using transfer learning to turn a consumer laptop into a medical diagnostic tool. Understanding how to handle messy data and verify what an AI is actually "seeing" is more important than having a massive server room.

WHAT I ACTUALLY BUILT

At its core, I built a computer program that acts like a specialized set of eyes for doctors. You feed it a digital chest X-ray image, and in less than a second, it tells you whether it sees signs of pneumonia or a normal, healthy lung. It doesn't just guess; it provides a confidence percentage for its decision. The system is designed to catch cases that might be subtle to the human eye, acting as a second pair of eyes to reduce diagnostic errors.

THE DATASET PROBLEM NOBODY MENTIONS

When I first looked at the data, I found a massive problem: class imbalance. The training set had 1,341 "Normal" images but a whopping 3,875 "Pneumonia" images. If I had trained the model as-is, it would have quickly learned that "guessing pneumonia every time" results in 74% accuracy without actually learning a single thing about lungs. This is a common trap in AI where a high accuracy score hides a completely useless model.

To fix this, I used two specific techniques to level the playing field. First, I implemented a WeightedRandomSampler, which you can think of as "photocopying rare examples." It ensures that during training, the model sees the "Normal" images more frequently so it doesn't forget what a healthy lung looks like. Second, I added class weights to the loss function—essentially a "bigger penalty for missing rare cases." If the model misclassifies a "Normal" lung as pneumonia, the error signal it receives is mathematically amplified, forcing it to pay closer attention to those specific patterns.

WHY I DIDN'T TRAIN FROM SCRATCH

I didn't start with a blank slate. Instead, I used a technique called transfer learning with an architecture known as ResNet-18. Think of it like this: a radiologist who already knows what edges, textures, and shapes look like doesn't need to relearn basic vision from scratch. They already have the "visual foundation" from years of looking at the world; they just need to learn what sick lungs look like specifically. ResNet-18 comes pre-trained on millions of everyday images (like dogs, cars, and trees), so it already understands how to detect lines and textures.

I used a two-phase training strategy to refine this pre-existing knowledge. In Phase 1 (Epochs 1-5), I froze most of the model and only trained the very last layer. This allowed the model to get a "feel" for the new medical data without overwriting its basic visual skills. In Phase 2 (Epochs 6-20), I unfroze everything and used a tiny learning rate to fine-tune the entire network. At the start of Phase 2, I saw a temporary spike in the loss curve. This is normal and expected—it's the mathematical equivalent of the model being slightly "confused" as it starts adjusting its deep-seated visual patterns to the nuances of X-ray tissue.

CAN YOU TRUST IT? GRAD-CAM EXPLAINABILITY

A 96% accurate AI is useless if it's a "black box" that you can't verify. In medical AI, you have to know why a decision was made. I implemented Grad-CAM (Gradient-weighted Class Activation Mapping), which is a tool that "shows you which pixels the model was looking at when it made its decision." Without this, a model might achieve 99% accuracy just by learning that images with a certain hospital's patient ID label in the corner are usually the pneumonia cases.

When I ran Grad-CAM on my results, the heatmaps were revealing. For pneumonia cases, the "heat" (red and yellow zones) was concentrated directly on the lung tissue where opacities usually appear. For normal cases, the focus was much more diffuse across the entire chest cavity. This gave me the confidence that the model was actually learning medical features, not just memorizing background noise or image artifacts.

THE RESULTS

After 20 epochs and about 45 minutes of training on my MacBook, here is how the system performed on the 624 images in the test set:

Metric	Score
Test Accuracy	96%
AUC-ROC	0.99
NORMAL F1	0.94
PNEUMONIA F1	0.96
False Negatives	15/390
False Positives	13/234

Let's break these down into plain English. Test Accuracy means the model was right 96 times out of 100 overall. AUC-ROC measures how well the model can distinguish between the two classes across different confidence levels; 0.99 is nearly perfect separation. F1 Score is a balanced average of precision (not flagging healthy people as sick) and recall (not missing sick people).

Most importantly, we have to look at the 15 missed pneumonia cases (False Negatives). In a clinical context, missing a sick patient is far worse than accidentally flagging a healthy one. This is why "Recall" matters more than "Accuracy" in medical AI. While 15 misses out of 390 is low, it highlights that this system is a diagnostic assistant, not a replacement for a human doctor who would catch those edge cases.

WHAT I LEARNED

This project taught me a few genuine technical lessons that go beyond the usual tutorials:

The Validation Set Trap: The original dataset only had 16 images in the validation folder. This made the validation accuracy bounce around wildly and become meaningless during training. You need a representative validation set to know if your model is actually improving.
Watch the Loss, Not the Accuracy: Accuracy is a "lagging indicator." The loss curve tells you the "quality" of the model's learning. If the loss is still going down but accuracy is flat, you're still making progress.
Grad-CAM is Mandatory: For medical AI, explainability isn't a "nice to have." It's the difference between a useful tool and a legal liability. If you can't see the heatmaps, you shouldn't trust the predictions.
Apple Silicon is Ready: Training this on MPS (Metal Performance Shaders) was surprisingly fast. For this size of workload, you don't need a dedicated Linux server with a massive GPU; a modern MacBook Pro handles it in under an hour.

WHAT'S NEXT

I'm not finished with this system yet. My next steps are specific:

Fix the Validation Set: I'm going to move about 500 images from the training set into the validation set to get more reliable feedback during training.
Try DenseNet-121: This architecture is the current gold standard in chest X-ray research papers because of how it handles feature reuse.
Build a Web UI: I want to use Streamlit to create a simple drag-and-drop interface so anyone can test the model without looking at code.
Kaggle Notebook: I've already published a self-contained version of this project as a Kaggle notebook for the community to play with.

CLOSING

This project demonstrated that you don't need a supercomputer to build high-performing medical AI. By using transfer learning and being smart about how you handle imbalanced data, you can achieve professional-grade results on consumer hardware. It's a testament to how accessible deep learning has become, provided you focus on the data and the "why" behind the predictions.

Disclaimer: This model is for educational purposes only and is not intended for clinical use.

View the full notebook on Kaggle
View the code on GitHub

DEV Community: nandhu_sauce