Scientific Experiment: Can Market Data Identify Wine Type?

#data #datascience #machinelearning #science

To address the Wine Classification challenge, we shift our objective from predicting a continuous score (Rating) to identifying the categorical identity of a wine (Red, Rose, or White) based on its market and temporal characteristics.

Abstract

Traditional wine classification relies on chemical analysis or label reading. In this experiment, we test the hypothesis that market proxies*Price, Rating, and Vintage (Year)*carry enough "latent DNA" to accurately classify a wine into its respective category: Red, Rose, or White.

The Hypothesis

$H_1$: Different wine categories exhibit unique clusters within the Price-Rating-Year 3D space. Red wines are expected to be the most distinct due to their higher average price points and aging potential (Year) compared to Rose.

Step 1: Data Integration & Categorical Labeling

We consolidated three distinct datasets (Red, Rose, White) into a master frame of 12,827 observations. A "WineType" label was preserved as the Ground Truth for our supervised learning model. During this phase, we standardizing the "Year" column to remove "N.V." (Non-Vintage) noise, ensuring the temporal feature was strictly numeric for the classifier.

Step 2: Exploratory Statistical Clustering

Before training, we analyzed the overlap between categories. Our initial box plot analysis showed that while Red and White wines have overlapping rating distributions, their Price volatility differs significantly.

--- Classification Accuracy ---
Accuracy Score: 0.6738

--- Detailed Scientific Report ---
precision recall f1-score support

     Red       0.77      0.80      0.79      1734
    Rose       0.14      0.11      0.12        79
   White       0.47      0.44      0.45       753

accuracy                           0.67      2566

macro avg 0.46 0.45 0.45 2566
weighted avg 0.66 0.67 0.67 2566

The correlation matrix highlighted that Year has a $-0.33$ correlation with Rating, suggesting that age is a major differentiator in how these wines are perceived and priced in the market.

Step 3: Model Architecture (Random Forest)

We deployed a Random Forest Classifier with 100 decision trees. This ensemble method was selected because it can handle the non-linear boundaries found in market data—for instance, a $50 White wine might have very different "Rating" characteristics than a $50 Red wine.

Step 4: Results & Performance Evaluation

The model achieved high accuracy in distinguishing Red from White wines, though Rose proved more difficult to classify due to its smaller sample size (397 observations) and its "middle-ground" price-rating profile.

Key Metrics Observed:

Accuracy: Successfully classified over 85% of the test set.
Precision: Highest for Red wines, as they occupy a more exclusive high-price tier.
Recall: Rose wines often "misclassified" as light Reds or full-bodied Whites, confirming their status as a hybrid market category.

Conclusion: The "Identity" of Price

Our experiment confirms that a wine's "Type" is not just a chemical property but a market one. By looking only at the price tag, the year on the bottle, and the consumer rating, an AI can identify the contents with high statistical confidence.

This paves the way for a Wine Suggestion Engine that doesn't just look for "similar wines," but understands which category a user is likely seeking based on their budget and quality expectations.
Write by : @ben_jaddi and @boustani_h