Over the past few years, we’ve been building a mobile-first AI skin analysis system used by +1.000.000 users worldwide (except the USA and Canada). Unlike most research setups, this system operates on real-world smartphone images — not clinical data, but noisy, user-generated photos taken in uncontrolled conditions.
To date, we’ve processed millions of images, with a curated subset of a few hundred thousand used for training. A fixed validation set of ~27,000 real-world images has been used to track performance consistently across model versions.
This article isn’t about building a model from scratch. It’s about what actually works when you try to improve one in production — over years, not weeks.
1. A Fixed Validation Set Is More Valuable Than a Bigger One
One of the most important decisions we made was also one of the least exciting.
We stopped updating our validation dataset.
Every model version was evaluated on the same ~27k real-world images. No rebalancing, no cleaning, no improvements.
This made progress slower — but more honest.
When metrics improved, we knew it wasn’t because the test data got easier. It was because the model actually got better.
2. More Data Stops Helping Faster Than You Think
We assumed that scaling data would continuously improve performance.
It didn’t.
Once we reached millions of images, the marginal gain from additional data dropped significantly.
The real improvement came from filtering:
- removing low-quality images
- reducing redundancy
- increasing representation of rare but important cases
In practice, a curated subset of a few hundred thousand images was more useful than the full dataset.
3. Garbage In, Garbage Out — So We Filter Before Inference
One thing we underestimated early on was how much of the input wouldn’t even be valid.
Not just low-quality images — completely irrelevant ones.
Users upload everything: blurred frames, partial shots, or images that don’t contain any useful signal at all. In practice, around 30–40% of raw user uploads had to be filtered out before reaching the model.
Instead of trying to make the model robust to everything, we introduced a preprocessing pipeline.
On-device, we run a lightweight object detector (initially YOLO-based, later replaced with a more optimized version) to localize regions of interest and automatically crop the relevant area. This helps standardize inputs without requiring perfect user behavior.
On the backend, we apply an additional relevance check. If an image doesn’t appear to contain skin, we don’t process it further and instead prompt the user to retake the photo.
For borderline cases, we attempt basic enhancement — denoising, sharpening, contrast adjustments. If the image becomes usable, it proceeds through the pipeline. If not, it is discarded.
This step alone significantly improved the overall system reliability — not by changing the model, but by improving the input.
4. Real-World Images Break Simplified Assumptions
Most models are trained on clean, well-centered images.
Real users don’t behave like datasets.
Photos can include multiple objects, poor framing, inconsistent lighting, or irrelevant content. Treating the entire image as a single input often leads to unstable behavior.
Moving toward detection-based approaches — where the model focuses on specific regions — significantly improved real-world performance.
Not because it improved benchmarks immediately, but because it aligned the system with reality.
5. Optimizing One Metric Can Hurt the Product
Early versions of the model prioritized sensitivity.
This reduced missed cases — but increased false positives.
From a metrics perspective, this looked like progress.
From a product perspective, it created friction.
Over time, improving precision became just as important. The goal shifted from “detect everything” to “provide useful and trustworthy outputs.”
The key lesson:
Model quality is not defined by a single metric — but by how metrics interact.
6. Better Models Don’t Always Win
We experimented with multiple architectures over time.
Some were more advanced. Some performed better in controlled settings.
But the biggest gains didn’t come from model upgrades.
They came from:
- better data selection
- more consistent labeling
- stable evaluation
In several cases, a simpler model trained on better data outperformed a more complex one trained on everything.
*Diversity of real-world data *
What Actually Makes a Production Model Better
Looking back, the improvements came from a combination of decisions that are often overlooked:
- keeping evaluation consistent
- focusing on data quality instead of volume
- aligning the model with real-world inputs
- balancing metrics instead of maximizing one
None of these are particularly novel.
But together, they made the system significantly more reliable.
Final Thought
If you’re building models on real-world user data — especially from mobile devices — your biggest challenge isn’t training.
It’s making sure your improvements are real.
One Open Question
One thing we’re still actively thinking about is where the optimal balance actually lies.
Should a system prioritize detecting as much as possible?
Or should it prioritize being trusted by the user?
In our experience, those two goals are not always aligned.
Full Breakdown & Demo
We published a full breakdown of the dataset, validation setup, and model evolution (with charts and metrics) here:
👉 https://skinive.com/skinive-accuracy2026/
If you want to see how this works in practice in a real skin analysis app, there’s also a demo available here:
👉 https://skinive.com/get-skinive/

Top comments (0)