DEV Community

Cover image for Data Science Meets Wine: Predicting Wine Quality & Recommending Wines Using Machine Learning
Youssef Bellouz
Youssef Bellouz

Posted on

Data Science Meets Wine: Predicting Wine Quality & Recommending Wines Using Machine Learning

Introduction

With the rapid growth of data-driven platforms like Vivino, wine quality assessment is no longer driven only by expert sommeliers or traditional rules-based systems. Millions of users now generate real-world data reflecting authentic consumer preferences.

In this project, we aim to answer a key business question:
Can we predict wine quality and recommend similar wines using only physicochemical properties?

To achieve this, we built a complete Data Science pipeline that:

  • Cleans and explores wine data
  • Analyzes relationships between features
  • Trains Machine Learning models
  • Explains predictions using feature importance
  • Recommends similar wines using cosine similarity

This project follows a scientific approach, from hypothesis formulation to evaluation and interpretation.

Dataset Overview

We use the Wine Quality Dataset (UCI Machine Learning Repository), which contains two datasets:

  • Red Wine
  • White Wine

Each dataset includes physicochemical properties such as acidity, sugar, pH, alcohol, and a quality score assigned by experts.

Target variable:

  • quality (integer score between 0 and 10)

Data Collecting & Cleaning

Cleaning Strategy

Before modeling, data quality is critical. The following steps were applied:

  • Removed duplicate rows
  • Converted all values to numeric types
  • Dropped missing values
  • Removed non-logical values (e.g., negative alcohol, unrealistic pH)
  • Removed outliers using IQR (Interquartile Range)
  • Applied log1p transformation on residual sugar to reduce skewness

This ensures the model learns from reliable and realistic data only.

Data Exploration

Feature Distributions

We explored distributions of all physicochemical features to:

  • Detect skewed variables
  • Identify outliers
  • Understand natural value ranges

This step guided later preprocessing decisions.

Correlation Analysis

A correlation heatmap revealed important relationships:

  • Alcohol has a positive correlation with quality
  • Volatile acidity negatively impacts wine quality
  • Some features show weak or no correlation, justifying the use of non-linear models

Modeling Strategy

Feature Preparation

  • The dataset is split into:

    • X → physicochemical features
    • y → quality score
  • Separate models are trained for:

    • Red wine
    • White wine
  • Data is split into training (80%) and testing (20%)

Machine Learning Model

We use a Random Forest Classifier, chosen because:

  • It handles non-linear relationships well
  • It is robust to noise
  • It provides feature importance for explainability

RandomForestClassifier(n_estimators=100, random_state=42)

Model Evaluation

Accuracy & Classification Report

The models achieve strong accuracy on both datasets, with balanced performance across quality classes.

The confusion matrix shows:

  • High accuracy for medium-quality wines
  • Acceptable misclassification between neighboring quality scores (expected due to subjectivity)

This confirms that the model learned meaningful patterns from the data.

Feature Importance (Model Explainability)

One of the most valuable insights comes from feature importance analysis.

Top predictors include:

  • Alcohol
  • Sulphates
  • Volatile acidity
  • Density

This aligns with real-world wine chemistry, validating the model’s credibility.

Wine Recommendation System

Beyond prediction, we implemented a recommendation system.

How it Works:

  1. Predict the quality of a new wine sample
  2. Compute cosine similarity between the new wine and existing wines
  3. Recommend the top N most similar wines

This transforms the project from a pure ML model into a business-ready feature.

Business Impact

This system can be used to:

  • 1. Recommend wines to users based on taste similarity
  • 2. Assist sellers in pricing and positioning wines
  • 3. Help customers discover high-quality wines beyond price bias
  • 4. Modernize rule-based recommendation systems

👉 Cheaper wines can still be high-quality, and our model proves it.

Conclusion

In this project, we successfully:

  • Built a full Data Science pipeline
  • Cleaned and explored real-world wine data
  • Trained interpretable Machine Learning models
  • Predicted wine quality accurately
  • Implemented a similarity-based recommendation system

This demonstrates how data science can enhance customer experience, improve decision-making, and drive business value in digital marketplaces like Vivino.

Future Improvements

  • Convert quality into categorical classes (Low / Medium / High)
  • Add hyperparameter tuning (GridSearch)
  • Deploy the model as an API
  • Build a web interface for real-time recommendations
  • Integrate user preference data

Final Note

This project reflects a production-ready mindset:

  • Clean code
  • Modular pipeline
  • Explainable ML
  • Business-oriented thinking

Data science doesn’t replace taste — it enhances it.

Top comments (0)