DEV Community

Cover image for Analyzing the Relationship Between Wine Price, Quality, and Origin Using Data Science
Elghalia
Elghalia

Posted on

Analyzing the Relationship Between Wine Price, Quality, and Origin Using Data Science

Introduction
Data Collection and Merging
Data Cleaning and Preprocessing
Exploratory Data Analysis

Predictive Modeling

Conclusion

Introduction

Wine is more than just a beverage — it is a global market with diverse varieties, prices, and qualities. For millions of consumers, choosing the right wine can be overwhelming: should they pay more for a reputed brand, or are there hidden gems that offer great quality at an affordable price?

In the digital age, online wine marketplaces like Vivino provide access to extensive wine data, including ratings, reviews, prices, and origins. Leveraging this data with data science techniques can help both businesses and consumers make smarter decisions.

In this project, we aim to explore the wine market using a dataset of over 12,000 wines across red, white, and rose varieties. Specifically, we focus on the relationship between price and quality, identify value-for-money wines, and analyze how features like wine type, country, and vintage influence ratings.

Through data cleaning, exploratory analysis, and predictive modeling, this project demonstrates how data science can transform raw wine data into actionable insights for businesses, helping them recommend wines effectively and optimize pricing strategies.

Data Collection and Merging

To get a complete view of the wine market, we combined three separate datasets for red, white, and rose wines into a single, unified dataset. Each entry includes essential information about the wine, such as its name, country of origin, region, winery, rating, number of reviews, price, vintage year, and type, providing a rich foundation for analysis and insights.

Data Cleaning and Preprocessing

Before diving into the analysis, we carefully prepared the dataset to ensure its quality. Numeric columns like Rating, Price, and Year were correctly formatted, while categorical columns such as Type and Country were standardized. Missing or invalid values were addressed, duplicates were removed, and outliers—especially in the Price column, where a few luxury wines created a skewed distribution—were carefully handled to ensure a clean and reliable dataset for exploration.

Exploratory Data Analysis

Distribution of Ratings

This tells us how wines are rated overall.

The analysis shows that most wines receive relatively high ratings, with a median of 3.9, indicating overall good quality. Very few wines are poorly rated, as reflected by the minimum rating of 2.5. Overall, ratings are concentrated between 3.7 and 4.1, demonstrating a consistent level of high quality with low variability across the dataset.

Distribution of Prices

Since price is right-skewed, it’s better to plot log scale as well.

The majority of wines are affordable, with a median price of approximately $16. A small number of very expensive wines create a right-skewed distribution, stretching the upper end of the price range while most wines remain in the accessible price segment.

Price vs Rating

Are expensive wines really higher rated?
Do cheap wines also have high ratings?

Optional: Add correlation coefficient:

Examining the relationship between price and rating revealed a moderate correlation of 0.44, suggesting that while higher-priced wines tend to have slightly higher ratings, price alone does not guarantee quality.

Average Rating by Wine Type

By comparing red, white, and rose wines, we can identify which type tends to receive higher ratings. The analysis shows that red wines have slightly higher average ratings than both white and rose varieties, suggesting a small but consistent preference for reds among consumers.

Top Countries by Average Rating

The graph shows: Moldova, Lebanon, and Croatia at the top, it means wines from these countries have the highest average ratings in our dataset.
Before trusting this result, we must check how many wines come from these countries :

Why?

Because:

-If a country has only < 20 wines → average is unreliable
-If it has 200 wines → very reliable
This is called sample size bias.

Although Moldova, Lebanon, and Croatia show high average ratings, their limited sample sizes suggest that these results should be interpreted with caution.

to correct this :

and the result is :

Although Moldova, Lebanon, and Croatia display the highest average ratings, each country is represented by fewer than 20 wines in the dataset. As a result, these averages are likely influenced by small sample bias and should be interpreted with caution. To ensure reliable conclusions, we focused our analysis on countries with a sufficient number of observations

Ratings Over Time

Next, we analyze:

This graph illustrates the evolution of the average wine rating over time by showing how ratings vary across different vintage years. It helps to evaluate whether the vintage has a significant influence on perceived quality. If an upward trend is observed, it suggests that newer wines tend to receive higher ratings, possibly due to improvements in production techniques or changing consumer preferences. Conversely, fluctuations or stable patterns indicate that vintage alone is not sufficient to determine quality. Overall, this analysis highlights the role of the production year as a secondary factor in wine evaluation, complementing other important variables such as price and origin.

Best Value for Money Wines

This part identifies the best value-for-money wines in the dataset by selecting bottles that combine high quality with affordable prices.

More specifically, it filters the data to keep only wines that have a rating of at least 4.3 (highly appreciated by users) and a price of 20 or less (considered relatively inexpensive). It then sorts these wines by rating in descending order and displays the top 10 highest-rated affordable wines.

This analysis highlights wines that offer an excellent quality–price ratio, which can be used to support recommendations, help users discover hidden gems, and guide business strategies focused on promoting high-value products.

Summary of EDA :

The analysis reveals that most wines have ratings concentrated around 4, indicating generally good quality. Wine prices show a right-skewed distribution, with the majority being affordable and a few luxury wines driving up the high end. While there is a moderate correlation between price and rating, a higher price does not always guarantee better quality. Red wines tend to slightly outperform white and rose varieties, and certain countries consistently produce higher-rated wines. Vintage year appears to have only a minor effect on ratings. Finally, several value-for-money wines stand out, providing actionable insights for recommendations and business strategy.

Predictive Modeling

Target and Features

For our predictive modeling, the target variable is the wine Rating, which represents the score given by users. The features used to predict this rating include Price, Vintage Year, Country of Origin, and Type of wine (red, white, or rose). These variables capture both the economic and qualitative aspects of the wine, allowing the model to learn patterns that influence user preferences and perceived quality.

Train and Test

To prepare the data for modeling, we first split it into training and testing sets using an 80/20 ratio, ensuring the model can be evaluated on unseen data. The numerical features (Price and Year) were standardized using a scaler, while categorical features (Country and Type) were encoded with one-hot encoding to convert them into a machine-readable format. These preprocessing steps were combined into a ColumnTransformer pipeline, which ensures that all data is properly transformed before being fed into the predictive model.

The Random Forest model

The Random Forest model achieved a Mean Absolute Error (MAE) of 0.161, indicating that on average, the predicted wine ratings deviate by approximately 0.16 points from the actual ratings on the rating scale. The R² score of 0.511 shows that the model explains about 51% of the variance in wine ratings. This suggests that while the model captures a significant portion of the factors influencing ratings—particularly price and vintage year—there remains some variability that is not accounted for, likely due to other qualitative factors such as taste preferences, winery reputation, or unobserved characteristics. Overall, the model provides a reasonable predictive performance for guiding value-based wine recommendations and pricing strategies.

Accurate Rating Predictions: The model can estimate a wine’s rating based on its price, country, type, and vintage year. This is useful for recommending wines that users are likely to enjoy, even when a wine has few existing reviews.

Price-Quality Relationship: While price does influence the rating to some extent, the model demonstrates that other factors—such as country, type, and year—also play a significant role in predicting quality.

Business Impact: Vivino can leverage this model to highlight “value-for-money” wines, identify underrated wines, and optimize strategic pricing to better guide consumers.

To make this report even more insightful, we can examine which variables have the greatest influence on wine ratings. By analyzing feature importance from the trained Random Forest model, we can identify the key drivers behind the predictions and better understand how factors like price, country, type, and vintage year contribute to perceived wine quality.

A horizontal bar plot with:

Y-axis → feature names (Price, Year, Country_…, etc.)

X-axis → importance values (ranging from 0 to 1)

The Price feature should visually dominate the chart.

This produces a clear and easily readable graph, eliminating the need for a table.

Price is the dominant factor, accounting for approximately 78% of the variation in wine ratings. As expected, higher-priced wines tend to receive slightly higher ratings, though this is not always guaranteed. Vintage year is the second most important factor, contributing around 7.6% to the rating prediction. Both older and more recent vintages can subtly influence perceived quality. In contrast, the impact of country of origin and wine type (red or white) is relatively minor, each contributing roughly 1% individually. This confirms that price and year are the primary drivers of wine ratings, while categorical features such as country and type can still help fine-tune recommendations. From a business perspective, Vivino can leverage these insights to highlight wines that offer the best value for money, enabling the platform to recommend wines based on perceived quality rather than relying solely on raw ratings.

Conclusion

This project successfully explored the relationship between wine price, quality, and origin using data science techniques. By combining data cleaning, exploratory analysis, and predictive modeling, we were able to identify the key factors that influence wine ratings.

The Random Forest model demonstrated that price and vintage year are the strongest predictors of wine quality, while country of origin and type play a smaller but meaningful role. The model’s performance, with an MAE of 0.16 and R² of 0.51, shows it can reasonably predict wine ratings, providing valuable guidance for wine recommendations.

From a business perspective, these insights empower Vivino to highlight value-for-money wines, identify underrated wines, and optimize pricing strategies, ultimately enhancing customer satisfaction.

Overall, this project illustrates how leveraging large wine datasets and machine learning can transform raw data into actionable insights, supporting smarter decisions for both businesses and consumers. Future work could incorporate additional features, such as tasting notes, winery reputation, or user reviews, to further improve the predictive performance and recommendation quality.

Top comments (0)