I gave the same 6,497 wines to two models and asked them different questions

#showdev #python #machinelearning #datascience

Update (July 2026): the metrics below are from v1. Auditing this same pipeline for a university assessment, I found 1,177 exact duplicate rows leaking across my train/test split - the honest numbers are lower, and the story of finding them is at the end of this post.

Most ML tutorials stop at a notebook with a green R² cell and a shrug. I wanted to go one step further: take two models I'd actually trained, and turn them into something you can poke at — a typed API and a little web app anyone can open.

So I built sommelier-api: one dataset, two questions, two surfaces.

Two lenses on the same wine

The UCI Wine Quality dataset
(Cortez et al., 2009) has 6,497 wines — 1,599 red, 4,898 white — each with 11 physicochemical measurements (acidity, residual sugar, sulphates, alcohol…) and a quality score from 0 to 10 assigned by human tasters.

You can ask that data two different questions:

How good is this wine? — a regression problem. Predict the score.
Is this wine good? — a classification problem. High (≥6) or low (<6)?

Same features, two lenses. I trained one model for each:

a RandomForestRegressor for the score, and
a tuned DecisionTreeClassifier for the grade.

The modelling (and one honest number)

Both models share the exact same 12 inputs — the 11 chemical readings plus an engineered wine_type flag (red=1, white=0) so a single model can see both colours.
Because both are tree-based, there's no feature scaling at inference — one of the small things that makes serving them clean.

Here's the whole path, from two CSVs to two saved models:

Here's the part most posts skip: the regressor's R² is about 0.50. It explains roughly half the variance in the scores. That's not a bug to hide — it's the nature of the problem. Wine quality is a subjective human judgement; there's a real ceiling on how well chemistry alone predicts a tasting panel. The classifier does better on its easier yes/no question — ROC-AUC ≈ 0.81 — but the honest framing matters more than a vanity metric. (It's also where the project gets its name: it can bottle about half the lab; the other half is human.)

What the models do agree on is what matters most: alcohol and volatile acidity dominate both — high alcohol and low volatile acidity track with better wine.

Turning models into a service

The interesting engineering isn't the .fit() call — it's everything around it. The repo is built around a framework-agnostic core that knows nothing about web frameworks:

ml/
  features.py   # build_features() + FEATURE_ORDER — the single source of truth
  train.py      # deterministic re-train from the raw CSVs -> joblib artifacts
  predict.py    # load_artifacts(), predict_score(), predict_grade()

Here's how a prediction actually flows — two adapters, one core:

Everything else is a thin adapter over ml/:

FastAPI exposes the models over a typed REST API, deployed on Render with interactive Swagger docs at /docs you can actually poke — fill in a wine, hit Execute, watch the prediction come back. Pydantic validates every input (and returns a clean 422 when your wine has negative alcohol). GET /model/info returns the real metrics straight from the training run — no hard-coded numbers. It's on the free tier, so the first call after a quiet spell takes ~50s to wake the service — an honest tradeoff for $0 hosting, and exactly why the Streamlit app doesn't depend on it (below).

@app.post("/predict")
def predict_endpoint(wine: WineFeatures):
    both = predict_both(wine.to_features())
    return PredictResponse(score=both["score"], grade=GradeResponse(**both["grade"]))

Streamlit is the friendly face: drag some sliders, hit Taste it, watch a gauge and a grade badge update. It runs the models in-process by default (so the public demo never depends on a sleeping backend), but it can flip to calling the live API — and if that API is cold, it falls back to local automatically and tells you so.

One discipline ties it together: the training scikit-learn version is pinned, the artifacts are committed, and that same version is surfaced at /model/info. The joblib I trained on my laptop is bit-for-bit the joblib that serves in production. No "works-on-my-machine" drift.

Try it

🍷 App: sommelier-api.streamlit.app — paste a wine's chemistry, get both verdicts.
📜 API docs: live Swagger — the same models over REST.
💻 Code: github.com/lfariabr/sommelier-api

Update: the audit that ate my metrics

A few weeks after publishing this, I reused the same dataset for my Master's classification assessment - this time with a stricter data-quality audit. The audit found something this pipeline had missed: 1,177 of the 6,497 rows are exact duplicates.

Why does that matter? With a random 80/20 split, identical wines land on both sides of the boundary. The model gets graded on rows it has already seen during training - part of the exam leaked in advance. Every number I published above was quietly inflated by it.

So sommelier-api v0.1.0 deduplicates before the split (5,320 unique wines) and retrains. The honest numbers:

Metric	v1 (published above)	v0.1.0 (deduplicated)
Regression R²	0.50	0.41
Regression RMSE	0.61	0.66
Classification ROC-AUC	0.81	0.79

The models did not get worse. The evaluation got corrected - the earlier one was answering questions it had already seen. Ironically the regression, the lens I was already calling "half human", took the bigger hit.

While I was in there, I also swapped the grade model for the one my assessment actually approved: the same tree structure with class_weight="balanced". It trades some false alarms for catching 73% of genuinely low wines instead of 59% - because in a screening problem, the miss is the expensive error. The API now reports sensitivity, specificity and the full confusion matrix at /model/info, straight from the retrained artifacts.

Lesson worth the R² it cost me: check for duplicates before you split. It is one line of pandas, and it is the difference between a metric and a vanity metric.

What's next — and what's not worth it

v1 was about the full path from notebook to deployed service, not a perfect model. From here there's a real roadmap — but a good roadmap also says no. Here's how I'd weigh the obvious next steps:

Next step	Worth it?	Why
Log predictions to a DB (SQLite → Postgres)	✅ soon	Cheapest high-value add — a usage log gives analytics, drift monitoring, and a real-world dataset. SQLite is plenty to start.
More / better data (other wine datasets & APIs)	✅ highest leverage	The R² ceiling here is data-bound, not model-bound. More wines and richer features (price, region, vintage) beat any fancier algorithm.
SHAP explanations	✅ yes	Let the app say why a wine scored low — turns the black box into a teaching tool for a few lines of code.
Gradient boosting + probability calibration	✅ quick win	XGBoost/LightGBM usually edge out a random forest on tabular data; calibration makes a "73%" actually mean 73%.
Rate limiting (e.g. slowapi)	⚠️ once it has traffic	A public API needs it eventually to curb abuse and protect the free tier — but premature on day one.
Redis	⚠️ pairs with the above	Earns its keep only behind rate-limit counters or a cache shared across instances. Overkill for a single free dyno today.
Deep learning	⚠️ for learning, not accuracy	On ~6,500 rows of tabular data, trees almost always beat neural nets. A great DL exercise — not a way to move the metric.
Auth + freemium (5 free, then sign in)	⚠️ only if productizing	Adds friction to a demo whose whole point is "try it instantly". Makes sense only if this becomes a real product.
More engineered features	⚠️ limited upside	The 11 chemical inputs are largely tapped out; interaction terms are cheap to try but won't break the data ceiling.
Email (Resend)	❌ not yet	No natural trigger — no accounts, no reports to send. A tool looking for a problem until a feature needs one.

The thread through that table: more/better data and explainability beat fancier infrastructure. The point of v1 wasn't the perfect model — it was the full path from a notebook to a deployed, typed, tested service. The half you can bottle.

References & code

Dataset — P. Cortez, A. Cerdeira, F. Almeida, T. Matos, J. Reis. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems 47(4), 547–553 (2009). UCI Wine Quality.
Code — github.com/lfariabr/sommelier-api: the full serving layer, re-implemented from the public CSVs.
The two models originate in my Master of Software Engineering (AI) coursework (MLN601 — regression + classification).