Youssef Bellouz

Posted on Dec 23, 2025

Data Science Meets Wine: Predicting Wine Quality & Recommending Wines Using Machine Learning

#programming #ai #python #opensource

Introduction

With the rapid growth of data-driven platforms like Vivino, wine quality assessment is no longer driven only by expert sommeliers or traditional rules-based systems. Millions of users now generate real-world data reflecting authentic consumer preferences.

In this project, we aim to answer a key business question:
Can we predict wine quality and recommend similar wines using only physicochemical properties?

To achieve this, we built a complete Data Science pipeline that:

Cleans and explores wine data
Analyzes relationships between features
Trains Machine Learning models
Explains predictions using feature importance
Recommends similar wines using cosine similarity

This project follows a scientific approach, from hypothesis formulation to evaluation and interpretation.

Dataset Overview

We use the Wine Quality Dataset (UCI Machine Learning Repository), which contains two datasets:

Red Wine
White Wine

Each dataset includes physicochemical properties such as acidity, sugar, pH, alcohol, and a quality score assigned by experts.

Target variable:

quality (integer score between 0 and 10)

Data Collecting & Cleaning

Cleaning Strategy

Before modeling, data quality is critical. The following steps were applied:

Removed duplicate rows
Converted all values to numeric types
Dropped missing values
Removed non-logical values (e.g., negative alcohol, unrealistic pH)
Removed outliers using IQR (Interquartile Range)
Applied log1p transformation on residual sugar to reduce skewness

This ensures the model learns from reliable and realistic data only.

Data Exploration

Feature Distributions

We explored distributions of all physicochemical features to:

Detect skewed variables
Identify outliers
Understand natural value ranges

This step guided later preprocessing decisions.

Correlation Analysis

A correlation heatmap revealed important relationships:

Alcohol has a positive correlation with quality
Volatile acidity negatively impacts wine quality
Some features show weak or no correlation, justifying the use of non-linear models

Modeling Strategy

Feature Preparation

The dataset is split into:
- X → physicochemical features
- y → quality score
Separate models are trained for:
- Red wine
- White wine
Data is split into training (80%) and testing (20%)

Machine Learning Model

We use a Random Forest Classifier, chosen because:

It handles non-linear relationships well
It is robust to noise
It provides feature importance for explainability

RandomForestClassifier(n_estimators=100, random_state=42)

Model Evaluation

Accuracy & Classification Report

The models achieve strong accuracy on both datasets, with balanced performance across quality classes.

The confusion matrix shows:

High accuracy for medium-quality wines
Acceptable misclassification between neighboring quality scores (expected due to subjectivity)

This confirms that the model learned meaningful patterns from the data.

Feature Importance (Model Explainability)

One of the most valuable insights comes from feature importance analysis.

Top predictors include:

Alcohol
Sulphates
Volatile acidity
Density

This aligns with real-world wine chemistry, validating the model’s credibility.

Wine Recommendation System

Beyond prediction, we implemented a recommendation system.

How it Works:

Predict the quality of a new wine sample
Compute cosine similarity between the new wine and existing wines
Recommend the top N most similar wines

This transforms the project from a pure ML model into a business-ready feature.

Business Impact

This system can be used to:

1. Recommend wines to users based on taste similarity
2. Assist sellers in pricing and positioning wines
3. Help customers discover high-quality wines beyond price bias
4. Modernize rule-based recommendation systems

👉 Cheaper wines can still be high-quality, and our model proves it.

Conclusion

In this project, we successfully:

Built a full Data Science pipeline
Cleaned and explored real-world wine data
Trained interpretable Machine Learning models
Predicted wine quality accurately
Implemented a similarity-based recommendation system

This demonstrates how data science can enhance customer experience, improve decision-making, and drive business value in digital marketplaces like Vivino.

Future Improvements

Convert quality into categorical classes (Low / Medium / High)
Add hyperparameter tuning (GridSearch)
Deploy the model as an API
Build a web interface for real-time recommendations
Integrate user preference data

Final Note

This project reflects a production-ready mindset:

Clean code
Modular pipeline
Explainable ML
Business-oriented thinking

Data science doesn’t replace taste — it enhances it.

DEV Community