DEV Community: Mohamed ID-ABDELLAH

Data Science Meets Wine: Building a Model to Predict Quality Accurately

Mohamed ID-ABDELLAH — Mon, 08 Dec 2025 22:22:43 +0000

Introduction
Data Collecting & Cleaning
- Basic Structure
- The Target
- Relationship Checks
Data Exploration
- Correlations Heatmap
- Rating Distribution
- Does expensive wines means better quality
- Which country dominant wine quality
- Which brand consistently performs well
- Does wine quality improves or declines over the years
Building Machine Learning Model
- Preparing the data
- Our first prediction
Communication

Introduction

This study models wine quality and analyzes these variables using data-driven methods. To find patterns and connections in the dataset, it starts with exploratory analysis and visualizations. Machine-learning models are then created to forecast wine quality and determine which characteristics most significantly influence higher ratings. By fusing predictive modeling with statistical understanding,Large user-generated systems like Vivino now offer extensive data that reflects actual consumer tastes at scale, whereas expert evaluations and sommelier reviews have historically guided perceptions of quality. This dataset, which includes about 13,000 wines, is ideal for examining trends in wine quality since it captures a number of characteristics, including ratings, price, country, and winery.

Data Collecting and Cleaning

The data I have, which we will be using for the study, was found in a GitHub repository. It represents wine data from the well-known Vivino website and consists of more than 13,800 wines. The data follows this pattern.
Winery, Region, Country, Rating, NumberOfRatings, Price, Year
As for cleaning, I honestly didn’t need much time at this stage because the data was originally clean and contained no missing values. However, in the Sparkling wine category, most of the values written in the year column were “N.V.” At first, I thought it meant value not available and considered it a missing value. Later, I discovered that its actual meaning is Non-Vintage, which means the wine is a blend of wines from different years, not a single wine from a specific year.

Since this is an important value in our study, the procedure I took was to replace it with the average year for that specific wine type, while converting the column’s data type to numeric.

Also, during the cleaning process, I removed the year from the wine names to maintain a consistent format.

Data Exploration

Basic Structure

The following image shows 20 rows of the data to illustrate its structure and format. The dataset contains 13,834 rows and 9 columns. Just to note, this snapshot was taken before the cleaning process, but as mentioned earlier, there were no major changes.

The data types are clear from the image, and the explanation of each column is as follows:

Column	Explanation
Name	The name of the wine
Type	Type of wine (red/white/rose/sparkling)
Winery	Which facture made the wine
Region / Country / Rating / NbrOfRating / Price	Self Explanatory
Year	Wine Birth day :)

The Target

The Rating value is the main value in our study and our objective. From this point on, everything we do will focus on identifying the relationships between the other features and this rating.

The highest rating in our dataset is 4.9, the lowest is 2.2, and the average is 3.87.

Now we will define the main quality categories of the wine based on its rating value. The classifications are as follows:

Low wine quality: rating less than 3.5
Medium wine quality: rating between 3.6 and 4.5
High wine quality: rating higher than 4.5

These categories are what we will train the AI model on so that it can predict the quality of a wine we haven’t seen before based on its characteristics.

Relationship Checks

As you can see in the image below, it shows the relationships between all the numerical values. For example, we notice a medium-to-weak relationship between price and rating, as well as a weak negative relationship between year and rating, etc. We will go into more detail about these relationships later on.

Data Visualization

Correlations Heatmap

This section is a visual illustration of the previous one, “relationship checks”, and it graphically shows the relationships between the numerical values in our dataset.

Rating Distribution

As we can see in the chart below, most of the ratings fall between 3.5 and 4, which suggests that most wines are of medium quality. Only a small portion of the wines are low-quality or high-quality.

Does expensive wines means better quality

As is commonly known in the global market—and for any type of product—the general assumption is that if the price of a product is low or cheap, it usually indicates lower or poor quality, while a high price often guarantees higher quality.

So, does the same idea apply to the Vivino wine market? Does expensive wine mean better quality?

This is what the following chart will answer.

Here we notice that there is no clear linear relationship, as most of the low- or medium-priced wines have high ratings. This answers our question with a no, meaning that cheaper products are often well-rated, and a higher price does not necessarily indicate a higher rating or better quality.

Which country dominant wine quality

In other product markets as well, certain countries stand out for producing high-quality goods. For example, everyone knows that German-made cars are renowned for their exceptional quality. The same idea applies to many other types of products.

In this section, we will identify the countries that lead in terms of wine quality.

Initially, the data we have includes 33 countries, which are as follows:

'France', 'Italy', 'Austria', 'New Zealand', 'Chile', 'Australia', 'South Africa', 'Spain', 'United States', 'Portugal', 'Hungary', 'Brazil', 'Argentina', 'Romania', 'Germany', 'Greece', 'Mexico', 'Moldova', 'Switzerland', 'Slovenia', 'Israel', 'Georgia', 'Lebanon', 'Uruguay', 'Turkey', 'Croatia', 'China', 'Slovakia', 'Bulgaria', 'Canada', 'Luxembourg', 'United Kingdom', 'Czech Republic'

This chart shows the average wine ratings for each country individually. As observed, the top ten countries in terms of quality, in order, are: Moldova, Lebanon, Croatia, Czech Republic, United Kingdom, Georgia, France, United States, Italy, and Germany.

And this means that, of course, the country plays an important role in determining the quality of the product.

Which brand consistently performs well

After realizing that quality varies from country to country, does the same apply to the producer itself? In other words, does the brand also play a role in determining wine quality?

To find out, we refer to the chart below, which includes 30 wineries out of a total of 3,000 producers.

As we can see, the 30 producers shown in the chart perform well, all having good average ratings, which indicates better product quality. However, as mentioned before, this chart includes only 30 out of 3,000 producers, and naturally, we cannot display them all. Behind the scenes, I noticed that the lowest-rated producer has an average rating of about 2.9.

This means that almost all producers produce wines of medium to high quality, with only a few producing lower-quality wines. We can also conclude that producers do play a role in determining the product’s quality—whether it is good or poor.

Does wine quality improves or declines over the years

Now we will explore the relationship between the years and wine quality—does wine improve with age or decline?

The chart below illustrates this relationship.

As we can see, the chart shows fluctuations in the average rating over the years.

Between 1960 and 1990, there is a clear and steady improvement in wine quality. Between 1990 and 2000, we observe a downward fluctuation in quality, with rises and falls. From 2000 to 2020, quality rises again, followed by a noticeable decline in the last two decades.

What can be concluded from this is that wine quality indeed changes over the years.

Building Machine Learning Model

In this section, we will train a Machine Learning model that will allow us to predict the quality of new wines that we haven’t seen in our dataset before.

We will use the Random Forest Classifier algorithm, which falls under the category of Supervised Learning.

Preparing the data

The first step in the preparation process was adding a new column called Quality_Classification, which is based on the Rating value, as mentioned earlier, with values classified as low, medium, or high.

Next, I converted the Country and Winery columns to the average rating for each individual value. Then I transformed the Year column into wine age. After that, I scaled the Price and NbrOfRating columns using np.log1p because the difference between their minimum and maximum values is very large.

In the final step, I removed the unwanted columns.

Our first prediction

As you can see in the image below, this is a snippet of the script performing our first prediction of the quality of a wine that was not present in our dataset.

What we did, in simple terms, was first separate the data into X and Y.

X represents the numerical values, while Y represents the wine classification that corresponds to each row of numerical features.

After that, we split X and Y into two parts:

The first part (X_train, Y_train) is used to train the model.
The second part (X_validation, Y_validation) is used to test how accurate the model is in its predictions.

You’ll notice that after we trained the model using X_train and Y_train (using the fit method), we then made predictions using the X_validation values. The prediction result is the first line below, which contains an array of classifications.

To evaluate the model’s accuracy, we used the accuracy_score method from the sklearn library, comparing the prediction result with Y_validation.

The accuracy test gives value close to 1 which means the model learned well from the data it was given and is now capable of producing realistic predictions when provided with new wine data.

Communication

In this study, we analyzed a dataset of over 13,800 wines from Vivino to understand the factors that influence wine quality. Our main focus was the Rating value, which serves as the target variable for predicting wine quality. Based on the ratings, we classified wines into low, medium, and high quality, forming the foundation for our Machine Learning model.

Through exploratory analysis, we observed that:

Price does not have a clear linear relationship with quality; cheaper wines can often be highly rated, and expensive wines do not guarantee superior quality.
Country and producer (winery) play a significant role in wine quality. Some countries and top producers consistently produce higher-quality wines.
Wine age influences quality, but the trend is not strictly linear; quality fluctuates over decades, showing periods of improvement and decline.

We preprocessed the data carefully:

Added a Quality_Classification column based on Rating.
Converted Country and Winery to average ratings.
Transformed Year into wine age.
Scaled Price and NbrOfRating using np.log1p.
Removed irrelevant columns to focus on predictive features.

Finally, we trained a Random Forest Classifier, a supervised learning algorithm, to predict the quality of new wines based on their characteristics. This model allows us to anticipate wine quality even for wines not present in our dataset, providing actionable insights for producers, retailers, and wine enthusiasts.

The results demonstrate that while price alone is not a reliable indicator, a combination of country, winery, age, and other features can effectively predict wine quality, highlighting the value of data-driven decision-making in the wine industry.

Inside the App Store: Insights from 10,000 Google Play Applications

Mohamed ID-ABDELLAH — Thu, 13 Nov 2025 22:17:21 +0000

Introduction

In this project, and as part of the training course on the Qwasar Silicon Valley platform, we analyze a database containing more than ten thousand applications from the Google Play Store for the years 2017 and 2018. Our goal is to interpret this data and extract conclusions that provide a deeper understanding of the app market, while addressing practical questions such as: What is the size of the market? (in terms of the number of downloads and the total revenue of paid applications), and what is the distribution of categories and their percentages?

We will follow a clear scientific methodology that includes data cleaning, visual and statistical exploratory analysis, and calculating indicators such as average prices, as well as ranking applications by popularity and price. In the end, we will draw conclusions and propose future steps to continue the research.

Technical note: the data is in CSV format, and it will be handled using the pandas library in Python.

Dataset Overview

The database we will be working with, as mentioned earlier, contains more than ten thousand applications from the Google Play Store, diverse in their categories and types.

Its exact shape is (10840, 13), meaning it includes 10,840 data rows with 13 columns.

The columns and their descriptions are as follows:

Column Name	The Data It Contains
App	Application Name
Category	The main category under which the application is classified.
Rating	Application rating in the Google Play Store
Reviews	Number of reviews the app has
Size	App size
Installs	Number of downloads
Type	App type — whether it is paid or free
Price	Zero if it is free; otherwise, the price
Content Rating	The age group or audience targeted by the application
Genres	The sub-categories under which the application falls
Last Updated	The last date on which the application was updated
Current Version	The current version of the application
Android Version	The minimum Android OS version required for the application to run

And to give you a clearer picture, here is an image of the first twenty rows of data from the database.

Naturally, no dataset is free of missing values, and this is what we will address in the following section.

Data Cleaning

The first thing I noticed after opening the CSV data file is that there was a single row that contained only 12 values, while the dataset has 13 columns. When I inspected it, I found that the missing column in that row was Category. I deleted it immediately because the application has no analytical value if its category is unknown.

What I did is correct, because when analyzing and interpreting app data, there are key columns such as Category, Rating, Installs, and others, while some columns are less important, like Current Version and Last Updated.

I also standardized the data — for example, I converted numerical columns to the number data type, and applied some additional adjustments to improve formatting and readability.

I also noticed that some values in different columns were missing, so these errors had to be corrected. For essential analytical columns like Rating, Reviews, Size, and Installs, if any value was missing, I replaced it with the mean of that column based on the category it belongs to.

Finally, here is an image of what the data looks like after cleaning and refinement.

Exploratory Data Analysis

Visualizing Distributions

From the histograms above, we observe that most applications have a rating between 4 and 5, and only a few have low ratings. We also notice that most apps are small in size and do not exceed 250 million downloads or less.

We further observe that the prices of paid applications range between one dollar and twenty dollars.

Correlations

In order to understand the relationships between the data, the table below uses the Pearson Correlations method to calculate the relationship between two variables—whether it is positive or negative. The closer the result is to one, the more positive the relationship is, meaning one variable increases as the other increases. If the result is below zero, the relationship is negative. And if the result is zero, then there is no relationship between the two variables.

Interpretations & Insights

Categories Distribution

The application database we have is divided into several categories, with a total of 33 unique categories.

To understand the distribution of these categories, we will use the pie chart below.

As we can see, the categories dominating the dataset are Family, Game, and Tools, with the Family category taking the lead at 18%.

It is also worth noting that the categories are sorted in descending order, from the most frequent to the least. To ensure clear visualization and avoid clutter in the chart, I grouped the smaller, less frequent categories into a single group labeled Others.

Downloads percentages per category

As the chart above shows, the categories with the highest number of downloads are Game, Communication, and Productivity, with the Games category taking the lead at 18%, followed by Communication at 10% of the total downloads.

It can also be observed that although the Games category is only the second in quantity, as mentioned in the previous section, it still ranks first in total downloads.

Mean rating per category

What can be inferred from this chart is that all categories, without exception, have a high average rating.

Mean price per category

What can be concluded from this chart is that the two categories with the highest average prices are Finance and Lifestyle, with the Finance category leading at an average of $8 per application.

Most popular paid apps in family category

Returning to the leading category in terms of the number of applications, which is the Family category, the chart above shows that the most downloaded apps belong to this category.

Conclusion

If there’s one thing to remember, it’s that the real story is often hidden in the data. The world of data analytics is very vast, and the amount of insights, explanations, and analyses that can be extracted from just ten thousand applications is already enormous—let alone if the dataset itself were massive.

DEV Community: Mohamed ID-ABDELLAH

Data Science Meets Wine: Building a Model to Predict Quality Accurately

Table of contents

Introduction

Data Collecting and Cleaning

Data Exploration

Basic Structure

The Target

Relationship Checks

Data Visualization

Correlations Heatmap

Rating Distribution

Does expensive wines means better quality

Which country dominant wine quality

Which brand consistently performs well

Does wine quality improves or declines over the years

Building Machine Learning Model

Preparing the data

Our first prediction

Communication

Inside the App Store: Insights from 10,000 Google Play Applications

Introduction

Dataset Overview

Data Cleaning

Exploratory Data Analysis

Visualizing Distributions

Correlations

Interpretations & Insights

Categories Distribution

Downloads percentages per category

Mean rating per category

Mean price per category

Most popular paid apps in family category

Conclusion