Table of contents
- Introduction
- Data Collecting & Cleaning
- Data Exploration
- Building Machine Learning Model
- Communication
Introduction
This study models wine quality and analyzes these variables using data-driven methods. To find patterns and connections in the dataset, it starts with exploratory analysis and visualizations. Machine-learning models are then created to forecast wine quality and determine which characteristics most significantly influence higher ratings. By fusing predictive modeling with statistical understanding,Large user-generated systems like Vivino now offer extensive data that reflects actual consumer tastes at scale, whereas expert evaluations and sommelier reviews have historically guided perceptions of quality. This dataset, which includes about 13,000 wines, is ideal for examining trends in wine quality since it captures a number of characteristics, including ratings, price, country, and winery.
Data Collecting and Cleaning
The data I have, which we will be using for the study, was found in a GitHub repository. It represents wine data from the well-known Vivino website and consists of more than 13,800 wines. The data follows this pattern.
Winery, Region, Country, Rating, NumberOfRatings, Price, Year
As for cleaning, I honestly didn’t need much time at this stage because the data was originally clean and contained no missing values. However, in the Sparkling wine category, most of the values written in the year column were “N.V.” At first, I thought it meant value not available and considered it a missing value. Later, I discovered that its actual meaning is Non-Vintage, which means the wine is a blend of wines from different years, not a single wine from a specific year.
Since this is an important value in our study, the procedure I took was to replace it with the average year for that specific wine type, while converting the column’s data type to numeric.
Also, during the cleaning process, I removed the year from the wine names to maintain a consistent format.
Data Exploration
Basic Structure
The following image shows 20 rows of the data to illustrate its structure and format. The dataset contains 13,834 rows and 9 columns. Just to note, this snapshot was taken before the cleaning process, but as mentioned earlier, there were no major changes.
The data types are clear from the image, and the explanation of each column is as follows:
| Column | Explanation |
|---|---|
| Name | The name of the wine |
| Type | Type of wine (red/white/rose/sparkling) |
| Winery | Which facture made the wine |
| Region / Country / Rating / NbrOfRating / Price | Self Explanatory |
| Year | Wine Birth day :) |
The Target
The Rating value is the main value in our study and our objective. From this point on, everything we do will focus on identifying the relationships between the other features and this rating.
The highest rating in our dataset is 4.9, the lowest is 2.2, and the average is 3.87.
Now we will define the main quality categories of the wine based on its rating value. The classifications are as follows:
- Low wine quality: rating less than 3.5
- Medium wine quality: rating between 3.6 and 4.5
- High wine quality: rating higher than 4.5
These categories are what we will train the AI model on so that it can predict the quality of a wine we haven’t seen before based on its characteristics.
Relationship Checks
As you can see in the image below, it shows the relationships between all the numerical values. For example, we notice a medium-to-weak relationship between price and rating, as well as a weak negative relationship between year and rating, etc. We will go into more detail about these relationships later on.
Data Visualization
Correlations Heatmap
This section is a visual illustration of the previous one, “relationship checks”, and it graphically shows the relationships between the numerical values in our dataset.
Rating Distribution
As we can see in the chart below, most of the ratings fall between 3.5 and 4, which suggests that most wines are of medium quality. Only a small portion of the wines are low-quality or high-quality.
Does expensive wines means better quality
As is commonly known in the global market—and for any type of product—the general assumption is that if the price of a product is low or cheap, it usually indicates lower or poor quality, while a high price often guarantees higher quality.
So, does the same idea apply to the Vivino wine market? Does expensive wine mean better quality?
This is what the following chart will answer.
Here we notice that there is no clear linear relationship, as most of the low- or medium-priced wines have high ratings. This answers our question with a no, meaning that cheaper products are often well-rated, and a higher price does not necessarily indicate a higher rating or better quality.
Which country dominant wine quality
In other product markets as well, certain countries stand out for producing high-quality goods. For example, everyone knows that German-made cars are renowned for their exceptional quality. The same idea applies to many other types of products.
In this section, we will identify the countries that lead in terms of wine quality.
Initially, the data we have includes 33 countries, which are as follows:
'France', 'Italy', 'Austria', 'New Zealand', 'Chile', 'Australia', 'South Africa', 'Spain', 'United States', 'Portugal', 'Hungary', 'Brazil', 'Argentina', 'Romania', 'Germany', 'Greece', 'Mexico', 'Moldova', 'Switzerland', 'Slovenia', 'Israel', 'Georgia', 'Lebanon', 'Uruguay', 'Turkey', 'Croatia', 'China', 'Slovakia', 'Bulgaria', 'Canada', 'Luxembourg', 'United Kingdom', 'Czech Republic'
This chart shows the average wine ratings for each country individually. As observed, the top ten countries in terms of quality, in order, are: Moldova, Lebanon, Croatia, Czech Republic, United Kingdom, Georgia, France, United States, Italy, and Germany.
And this means that, of course, the country plays an important role in determining the quality of the product.
Which brand consistently performs well
After realizing that quality varies from country to country, does the same apply to the producer itself? In other words, does the brand also play a role in determining wine quality?
To find out, we refer to the chart below, which includes 30 wineries out of a total of 3,000 producers.
As we can see, the 30 producers shown in the chart perform well, all having good average ratings, which indicates better product quality. However, as mentioned before, this chart includes only 30 out of 3,000 producers, and naturally, we cannot display them all. Behind the scenes, I noticed that the lowest-rated producer has an average rating of about 2.9.
This means that almost all producers produce wines of medium to high quality, with only a few producing lower-quality wines. We can also conclude that producers do play a role in determining the product’s quality—whether it is good or poor.
Does wine quality improves or declines over the years
Now we will explore the relationship between the years and wine quality—does wine improve with age or decline?
The chart below illustrates this relationship.
As we can see, the chart shows fluctuations in the average rating over the years.
Between 1960 and 1990, there is a clear and steady improvement in wine quality. Between 1990 and 2000, we observe a downward fluctuation in quality, with rises and falls. From 2000 to 2020, quality rises again, followed by a noticeable decline in the last two decades.
What can be concluded from this is that wine quality indeed changes over the years.
Building Machine Learning Model
In this section, we will train a Machine Learning model that will allow us to predict the quality of new wines that we haven’t seen in our dataset before.
We will use the Random Forest Classifier algorithm, which falls under the category of Supervised Learning.
Preparing the data
The first step in the preparation process was adding a new column called Quality_Classification, which is based on the Rating value, as mentioned earlier, with values classified as low, medium, or high.
Next, I converted the Country and Winery columns to the average rating for each individual value. Then I transformed the Year column into wine age. After that, I scaled the Price and NbrOfRating columns using np.log1p because the difference between their minimum and maximum values is very large.
In the final step, I removed the unwanted columns.
Our first prediction
As you can see in the image below, this is a snippet of the script performing our first prediction of the quality of a wine that was not present in our dataset.
What we did, in simple terms, was first separate the data into X and Y.
X represents the numerical values, while Y represents the wine classification that corresponds to each row of numerical features.
After that, we split X and Y into two parts:
- The first part (X_train, Y_train) is used to train the model.
- The second part (X_validation, Y_validation) is used to test how accurate the model is in its predictions.
You’ll notice that after we trained the model using X_train and Y_train (using the fit method), we then made predictions using the X_validation values. The prediction result is the first line below, which contains an array of classifications.
To evaluate the model’s accuracy, we used the accuracy_score method from the sklearn library, comparing the prediction result with Y_validation.
The accuracy test gives value close to 1 which means the model learned well from the data it was given and is now capable of producing realistic predictions when provided with new wine data.
Communication
In this study, we analyzed a dataset of over 13,800 wines from Vivino to understand the factors that influence wine quality. Our main focus was the Rating value, which serves as the target variable for predicting wine quality. Based on the ratings, we classified wines into low, medium, and high quality, forming the foundation for our Machine Learning model.
Through exploratory analysis, we observed that:
- Price does not have a clear linear relationship with quality; cheaper wines can often be highly rated, and expensive wines do not guarantee superior quality.
- Country and producer (winery) play a significant role in wine quality. Some countries and top producers consistently produce higher-quality wines.
- Wine age influences quality, but the trend is not strictly linear; quality fluctuates over decades, showing periods of improvement and decline.
We preprocessed the data carefully:
- Added a
Quality_Classificationcolumn based onRating. - Converted
CountryandWineryto average ratings. - Transformed
Yearinto wine age. - Scaled
PriceandNbrOfRatingusingnp.log1p. - Removed irrelevant columns to focus on predictive features.
Finally, we trained a Random Forest Classifier, a supervised learning algorithm, to predict the quality of new wines based on their characteristics. This model allows us to anticipate wine quality even for wines not present in our dataset, providing actionable insights for producers, retailers, and wine enthusiasts.
The results demonstrate that while price alone is not a reliable indicator, a combination of country, winery, age, and other features can effectively predict wine quality, highlighting the value of data-driven decision-making in the wine industry.











Top comments (0)