DEV Community: Elghalia

Analyzing the Relationship Between Wine Price, Quality, and Origin Using Data Science

Elghalia — Tue, 27 Jan 2026 15:28:57 +0000

Introduction
Data Collection and Merging
Data Cleaning and Preprocessing
Exploratory Data Analysis

Distribution of Ratings
Distribution of Prices
Price vs Rating
Average Rating by Wine Type
Top Countries by Average Rating
Ratings Over Time
Best Value for Money Wines

Predictive Modeling

Target and Features
Train and Test
The Random Forest model

Conclusion

Introduction

Wine is more than just a beverage — it is a global market with diverse varieties, prices, and qualities. For millions of consumers, choosing the right wine can be overwhelming: should they pay more for a reputed brand, or are there hidden gems that offer great quality at an affordable price?

In the digital age, online wine marketplaces like Vivino provide access to extensive wine data, including ratings, reviews, prices, and origins. Leveraging this data with data science techniques can help both businesses and consumers make smarter decisions.

In this project, we aim to explore the wine market using a dataset of over 12,000 wines across red, white, and rose varieties. Specifically, we focus on the relationship between price and quality, identify value-for-money wines, and analyze how features like wine type, country, and vintage influence ratings.

Through data cleaning, exploratory analysis, and predictive modeling, this project demonstrates how data science can transform raw wine data into actionable insights for businesses, helping them recommend wines effectively and optimize pricing strategies.

Data Collection and Merging

To get a complete view of the wine market, we combined three separate datasets for red, white, and rose wines into a single, unified dataset. Each entry includes essential information about the wine, such as its name, country of origin, region, winery, rating, number of reviews, price, vintage year, and type, providing a rich foundation for analysis and insights.

Data Cleaning and Preprocessing

Before diving into the analysis, we carefully prepared the dataset to ensure its quality. Numeric columns like Rating, Price, and Year were correctly formatted, while categorical columns such as Type and Country were standardized. Missing or invalid values were addressed, duplicates were removed, and outliers—especially in the Price column, where a few luxury wines created a skewed distribution—were carefully handled to ensure a clean and reliable dataset for exploration.

Exploratory Data Analysis

Distribution of Ratings

This tells us how wines are rated overall.

The analysis shows that most wines receive relatively high ratings, with a median of 3.9, indicating overall good quality. Very few wines are poorly rated, as reflected by the minimum rating of 2.5. Overall, ratings are concentrated between 3.7 and 4.1, demonstrating a consistent level of high quality with low variability across the dataset.

Distribution of Prices

Since price is right-skewed, it’s better to plot log scale as well.

The majority of wines are affordable, with a median price of approximately $16. A small number of very expensive wines create a right-skewed distribution, stretching the upper end of the price range while most wines remain in the accessible price segment.

Price vs Rating

Are expensive wines really higher rated?
Do cheap wines also have high ratings?

Optional: Add correlation coefficient:

Examining the relationship between price and rating revealed a moderate correlation of 0.44, suggesting that while higher-priced wines tend to have slightly higher ratings, price alone does not guarantee quality.

Average Rating by Wine Type

By comparing red, white, and rose wines, we can identify which type tends to receive higher ratings. The analysis shows that red wines have slightly higher average ratings than both white and rose varieties, suggesting a small but consistent preference for reds among consumers.

Top Countries by Average Rating

The graph shows: Moldova, Lebanon, and Croatia at the top, it means wines from these countries have the highest average ratings in our dataset.
Before trusting this result, we must check how many wines come from these countries :

Why?

Because:

-If a country has only < 20 wines → average is unreliable
-If it has 200 wines → very reliable
This is called sample size bias.

Although Moldova, Lebanon, and Croatia show high average ratings, their limited sample sizes suggest that these results should be interpreted with caution.

to correct this :

and the result is :

Although Moldova, Lebanon, and Croatia display the highest average ratings, each country is represented by fewer than 20 wines in the dataset. As a result, these averages are likely influenced by small sample bias and should be interpreted with caution. To ensure reliable conclusions, we focused our analysis on countries with a sufficient number of observations

Ratings Over Time

Next, we analyze:

This graph illustrates the evolution of the average wine rating over time by showing how ratings vary across different vintage years. It helps to evaluate whether the vintage has a significant influence on perceived quality. If an upward trend is observed, it suggests that newer wines tend to receive higher ratings, possibly due to improvements in production techniques or changing consumer preferences. Conversely, fluctuations or stable patterns indicate that vintage alone is not sufficient to determine quality. Overall, this analysis highlights the role of the production year as a secondary factor in wine evaluation, complementing other important variables such as price and origin.

Best Value for Money Wines

This part identifies the best value-for-money wines in the dataset by selecting bottles that combine high quality with affordable prices.

More specifically, it filters the data to keep only wines that have a rating of at least 4.3 (highly appreciated by users) and a price of 20 or less (considered relatively inexpensive). It then sorts these wines by rating in descending order and displays the top 10 highest-rated affordable wines.

This analysis highlights wines that offer an excellent quality–price ratio, which can be used to support recommendations, help users discover hidden gems, and guide business strategies focused on promoting high-value products.

Summary of EDA :

The analysis reveals that most wines have ratings concentrated around 4, indicating generally good quality. Wine prices show a right-skewed distribution, with the majority being affordable and a few luxury wines driving up the high end. While there is a moderate correlation between price and rating, a higher price does not always guarantee better quality. Red wines tend to slightly outperform white and rose varieties, and certain countries consistently produce higher-rated wines. Vintage year appears to have only a minor effect on ratings. Finally, several value-for-money wines stand out, providing actionable insights for recommendations and business strategy.

Predictive Modeling

Target and Features

For our predictive modeling, the target variable is the wine Rating, which represents the score given by users. The features used to predict this rating include Price, Vintage Year, Country of Origin, and Type of wine (red, white, or rose). These variables capture both the economic and qualitative aspects of the wine, allowing the model to learn patterns that influence user preferences and perceived quality.

Train and Test

To prepare the data for modeling, we first split it into training and testing sets using an 80/20 ratio, ensuring the model can be evaluated on unseen data. The numerical features (Price and Year) were standardized using a scaler, while categorical features (Country and Type) were encoded with one-hot encoding to convert them into a machine-readable format. These preprocessing steps were combined into a ColumnTransformer pipeline, which ensures that all data is properly transformed before being fed into the predictive model.

The Random Forest model

The Random Forest model achieved a Mean Absolute Error (MAE) of 0.161, indicating that on average, the predicted wine ratings deviate by approximately 0.16 points from the actual ratings on the rating scale. The R² score of 0.511 shows that the model explains about 51% of the variance in wine ratings. This suggests that while the model captures a significant portion of the factors influencing ratings—particularly price and vintage year—there remains some variability that is not accounted for, likely due to other qualitative factors such as taste preferences, winery reputation, or unobserved characteristics. Overall, the model provides a reasonable predictive performance for guiding value-based wine recommendations and pricing strategies.

Accurate Rating Predictions: The model can estimate a wine’s rating based on its price, country, type, and vintage year. This is useful for recommending wines that users are likely to enjoy, even when a wine has few existing reviews.

Price-Quality Relationship: While price does influence the rating to some extent, the model demonstrates that other factors—such as country, type, and year—also play a significant role in predicting quality.

Business Impact: Vivino can leverage this model to highlight “value-for-money” wines, identify underrated wines, and optimize strategic pricing to better guide consumers.

To make this report even more insightful, we can examine which variables have the greatest influence on wine ratings. By analyzing feature importance from the trained Random Forest model, we can identify the key drivers behind the predictions and better understand how factors like price, country, type, and vintage year contribute to perceived wine quality.

A horizontal bar plot with:

Y-axis → feature names (Price, Year, Country_…, etc.)

X-axis → importance values (ranging from 0 to 1)

The Price feature should visually dominate the chart.

This produces a clear and easily readable graph, eliminating the need for a table.

Price is the dominant factor, accounting for approximately 78% of the variation in wine ratings. As expected, higher-priced wines tend to receive slightly higher ratings, though this is not always guaranteed. Vintage year is the second most important factor, contributing around 7.6% to the rating prediction. Both older and more recent vintages can subtly influence perceived quality. In contrast, the impact of country of origin and wine type (red or white) is relatively minor, each contributing roughly 1% individually. This confirms that price and year are the primary drivers of wine ratings, while categorical features such as country and type can still help fine-tune recommendations. From a business perspective, Vivino can leverage these insights to highlight wines that offer the best value for money, enabling the platform to recommend wines based on perceived quality rather than relying solely on raw ratings.

Conclusion

This project successfully explored the relationship between wine price, quality, and origin using data science techniques. By combining data cleaning, exploratory analysis, and predictive modeling, we were able to identify the key factors that influence wine ratings.

The Random Forest model demonstrated that price and vintage year are the strongest predictors of wine quality, while country of origin and type play a smaller but meaningful role. The model’s performance, with an MAE of 0.16 and R² of 0.51, shows it can reasonably predict wine ratings, providing valuable guidance for wine recommendations.

From a business perspective, these insights empower Vivino to highlight value-for-money wines, identify underrated wines, and optimize pricing strategies, ultimately enhancing customer satisfaction.

Overall, this project illustrates how leveraging large wine datasets and machine learning can transform raw data into actionable insights, supporting smarter decisions for both businesses and consumers. Future work could incorporate additional features, such as tasting notes, winery reputation, or user reviews, to further improve the predictive performance and recommendation quality.

My Mobapp Studio

Elghalia — Fri, 23 Jan 2026 15:56:59 +0000

Introduction

Most Popular Paid Apps in the Family Category

Bar chart visualization
Interpretation

Introduction

Mobile applications are now a central part of digital life, and the Family category on Google Play plays a key role in education, entertainment, and parental support. While most market analyses focus on free applications, paid apps follow a different economic logic, emphasizing perceived value, niche targeting, and premium positioning. This report presents a data-driven analysis of paid applications, with a particular focus on the Family category, to identify the most popular paid Family apps, the dominant genres in terms of installations, the distribution of installs across categories, price variations between categories, and the most expensive apps in each segment. Using visual analytics, the objective is to transform raw Google Play Store data into actionable insights that support strategic decision-making. The dataset includes app name, category, genre, number of installs, price, and app type (free or paid), and was prepared through standard data cleaning steps such as removing missing values, converting installs and prices into numerical formats, and filtering paid applications where relevant.

Most Popular Paid Apps in the Family Category

We begin by focusing exclusively on paid apps within the Family category and ranking them by the number of installations.

Bar chart 1

Interpretation 1:

The distribution of installations among paid Family applications exhibits a strong right-skew, indicating a highly concentrated market structure. A limited number of apps capture a disproportionate share of total installs, while the majority remain in a low-download long tail. Notably, the presence of a paywall does not prevent certain applications from reaching several hundred thousand installations, suggesting that demand is driven by perceived value factors such as educational utility, safety, and ad-free usage. This pattern highlights the importance of differentiation and value signaling in paid app adoption within the Family category.

Most Popular Genres in Paid Family Apps

Next, we analyze how installations are distributed across genres within paid Family apps.

Pie chart 2

Genre distribution by number of installs (Paid Family apps)

Interpretation 2:

The distribution of installations across genres within paid Family applications is highly uneven, with educational and learning-oriented genres capturing the majority of total installs. Entertainment-focused genres are present but account for a comparatively smaller share, indicating that functional utility and educational value are stronger drivers of willingness to pay than pure entertainment. Due to the high cardinality of genre categories, the initial visualization suffered from readability limitations; consequently, low-frequency genres were aggregated into an “Others” category to improve interpretability while preserving the overall distribution. The resulting analysis reveals a clear dominance of a limited number of genres, a long-tail structure with marginal contribution from niche genres, and a concentrated demand that suggests strategic opportunities primarily lie within high-demand, education-centric segments.

Number of Installations per Category

To put the Family category into context, we compute the total number of installations per category across all paid apps.

Array: Number of installations per category :

Category
GAME 31,544,024,415.0
COMMUNICATION 24,152,276,251.0
SOCIAL 12,513,867,902.0
PRODUCTIVITY 12,463,091,369.0
TOOLS 11,452,771,915.0
FAMILY 10,041,692,505.0
PHOTOGRAPHY 9,721,247,655.0
TRAVEL_AND_LOCAL 6,361,887,146.0
VIDEO_PLAYERS 6,222,002,720.0
NEWS_AND_MAGAZINES 5,393,217,760.0
SHOPPING 2,573,348,785.0
ENTERTAINMENT 2,455,660,000.0
PERSONALIZATION 2,074,494,782.0
BOOKS_AND_REFERENCE 1,916,469,576.0
SPORTS 1,528,574,498.0
HEALTH_AND_FITNESS 1,361,022,512.0
BUSINESS 863,664,865.0
FINANCE 770,348,734.0
MAPS_AND_NAVIGATION 724,281,890.0
LIFESTYLE 534,823,539.0
EDUCATION 533,952,000.0
WEATHER 426,100,520.0
FOOD_AND_DRINK 257,898,751.0
DATING 206,536,107.0
HOUSE_AND_HOME 125,212,461.0
ART_AND_DESIGN 124,338,100.0
LIBRARIES_AND_DEMO 62,995,910.0
COMICS 56,086,150.0
AUTO_AND_VEHICLES 53,130,211.0
MEDICAL 42,204,177.0
PARENTING 31,521,110.0
BEAUTY 27,197,050.0
EVENTS 15,973,161.0
Name: Installs, dtype: object

Interpretation 3:

Installations are highly concentrated in a few categories, such as Games, Communication, and Social, while most categories remain niche. This long-tail distribution shows that high downloads do not automatically signal opportunity due to intense competition. Paid apps perform best in select segments, with the Family category standing out as a strong monetization opportunity.

Distribution of Installations per Category

To better visualize this distribution, we plot a pie chart.

Pie chart 3

Distribution of installs per category

Interpretation 4:

A small number of categories account for the majority of the market, while most other categories occupy niche segments, forming a long-tail distribution. This highlights that high download volumes do not automatically indicate attractive opportunities, as competition is intense in dominant categories.

Data Visualization Principle:

A chart must be readable before being accurate. The initial pie chart, which included all categories, was cluttered and difficult to interpret. To enhance clarity, we focused on the top categories and aggregated the remaining ones into an “Others” group, enabling a clear comparison of market dominance while preserving the overall distribution.

Insight:

The market is concentrated in a few dominant categories, where high download volumes do not necessarily translate into strong monetization. Greater opportunities exist in mid-sized categories with lower competition, where targeted apps can achieve both adoption and revenue.

Mean Price per Category

Price is a strong signal of perceived value. We calculate and visualize the average price of paid apps per category.

Bar chart 4

Mean price per category

Interpretation 5:

Average prices are highest in Business, Medical, and Productivity apps, while Family apps strike a balance between affordability and premium positioning, reflecting audience- and use-case–driven pricing strategies.

Most Expensive Apps per Category

Finally, we identify the most expensive paid app in each category.

Table of the most expensive paid apps per category

Interpretation 6:

The presence of extremely high-priced apps highlights that not all categories target general consumers. Instead, these apps often serve professional, enterprise, or specialized markets. Categories such as Medical, Business, and Finance exhibit significantly higher price ceilings, reflecting B2B or expert-focused use cases rather than mass-market consumer applications.

Conclusion

The Family category demonstrates strong monetization potential, with paid apps capable of achieving significant installation volumes. Pricing strategies vary considerably across categories, reflecting differences in audience and use case. While the market is dominated by a few high-volume categories, niche segments present targeted opportunities.
Recommendation for My MobApp Studio: Focus on developing Family- or Education-oriented apps, using a paid or freemium model, and targeting specific genres that deliver clear value to users.
Next steps / future experiments:
Perform NLP analysis on user reviews to identify unmet needs and sentiment trends.
Conduct time-series analysis of installs to uncover adoption patterns over time.
Analyze free-to-paid conversion rates to optimize monetization strategies.

DEV Community: Elghalia

Analyzing the Relationship Between Wine Price, Quality, and Origin Using Data Science

Introduction

Data Collection and Merging

Data Cleaning and Preprocessing

Exploratory Data Analysis

Distribution of Ratings

Distribution of Prices

Price vs Rating

Average Rating by Wine Type

Top Countries by Average Rating

Ratings Over Time

Best Value for Money Wines

Predictive Modeling

Target and Features

Train and Test

The Random Forest model

Conclusion

My Mobapp Studio

Table Of Contents

Introduction

Most Popular Paid Apps in the Family Category

Bar chart 1

Interpretation 1:

Most Popular Genres in Paid Family Apps

Pie chart 2

Interpretation 2:

Number of Installations per Category

Array: Number of installations per category :

Interpretation 3:

Distribution of Installations per Category

Pie chart 3

Interpretation 4:

Data Visualization Principle:

Insight:

Mean Price per Category

Bar chart 4

Interpretation 5:

Most Expensive Apps per Category

Table of the most expensive paid apps per category

Interpretation 6:

Conclusion