Introduction
With the rapid growth of data-driven platforms like Vivino, wine quality assessment is no longer driven only by expert sommeliers or traditional rules-based systems. Millions of users now generate real-world data reflecting authentic consumer preferences.
In this project, we aim to answer a key business question:
Can we predict wine quality and recommend similar wines using only physicochemical properties?
To achieve this, we built a complete Data Science pipeline that:
- Cleans and explores wine data
- Analyzes relationships between features
- Trains Machine Learning models
- Explains predictions using feature importance
- Recommends similar wines using cosine similarity
This project follows a scientific approach, from hypothesis formulation to evaluation and interpretation.
Dataset Overview
We use the Wine Quality Dataset (UCI Machine Learning Repository), which contains two datasets:
- Red Wine
- White Wine
Each dataset includes physicochemical properties such as acidity, sugar, pH, alcohol, and a quality score assigned by experts.
Target variable:
- quality (integer score between 0 and 10)
Data Collecting & Cleaning
Cleaning Strategy
Before modeling, data quality is critical. The following steps were applied:
- Removed duplicate rows
- Converted all values to numeric types
- Dropped missing values
- Removed non-logical values (e.g., negative alcohol, unrealistic pH)
- Removed outliers using IQR (Interquartile Range)
- Applied log1p transformation on residual sugar to reduce skewness
This ensures the model learns from reliable and realistic data only.
Data Exploration
We explored distributions of all physicochemical features to:
- Detect skewed variables
- Identify outliers
- Understand natural value ranges
This step guided later preprocessing decisions.
Correlation Analysis
A correlation heatmap revealed important relationships:
- Alcohol has a positive correlation with quality
- Volatile acidity negatively impacts wine quality
- Some features show weak or no correlation, justifying the use of non-linear models
Modeling Strategy
Feature Preparation
-
The dataset is split into:
- X → physicochemical features
- y → quality score
-
Separate models are trained for:
- Red wine
- White wine
Data is split into training (80%) and testing (20%)
Machine Learning Model
We use a Random Forest Classifier, chosen because:
- It handles non-linear relationships well
- It is robust to noise
- It provides feature importance for explainability
RandomForestClassifier(n_estimators=100, random_state=42)
Model Evaluation
Accuracy & Classification Report
The models achieve strong accuracy on both datasets, with balanced performance across quality classes.
The confusion matrix shows:
- High accuracy for medium-quality wines
- Acceptable misclassification between neighboring quality scores (expected due to subjectivity)
This confirms that the model learned meaningful patterns from the data.
Feature Importance (Model Explainability)
One of the most valuable insights comes from feature importance analysis.
Top predictors include:
- Alcohol
- Sulphates
- Volatile acidity
- Density
This aligns with real-world wine chemistry, validating the model’s credibility.
Wine Recommendation System
Beyond prediction, we implemented a recommendation system.
How it Works:
- Predict the quality of a new wine sample
- Compute cosine similarity between the new wine and existing wines
- Recommend the top N most similar wines
This transforms the project from a pure ML model into a business-ready feature.
Business Impact
This system can be used to:
- 1. Recommend wines to users based on taste similarity
- 2. Assist sellers in pricing and positioning wines
- 3. Help customers discover high-quality wines beyond price bias
- 4. Modernize rule-based recommendation systems
👉 Cheaper wines can still be high-quality, and our model proves it.
Conclusion
In this project, we successfully:
- Built a full Data Science pipeline
- Cleaned and explored real-world wine data
- Trained interpretable Machine Learning models
- Predicted wine quality accurately
- Implemented a similarity-based recommendation system
This demonstrates how data science can enhance customer experience, improve decision-making, and drive business value in digital marketplaces like Vivino.
Future Improvements
- Convert quality into categorical classes (Low / Medium / High)
- Add hyperparameter tuning (GridSearch)
- Deploy the model as an API
- Build a web interface for real-time recommendations
- Integrate user preference data
Final Note
This project reflects a production-ready mindset:
- Clean code
- Modular pipeline
- Explainable ML
- Business-oriented thinking
Data science doesn’t replace taste — it enhances it.






Top comments (0)