<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mohamed ID-ABDELLAH</title>
    <description>The latest articles on DEV Community by Mohamed ID-ABDELLAH (@idabdellah).</description>
    <link>https://dev.to/idabdellah</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1157904%2F32040bfd-e903-4c92-a466-f77a64763078.jpeg</url>
      <title>DEV Community: Mohamed ID-ABDELLAH</title>
      <link>https://dev.to/idabdellah</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/idabdellah"/>
    <language>en</language>
    <item>
      <title>Data Science Meets Wine: Building a Model to Predict Quality Accurately</title>
      <dc:creator>Mohamed ID-ABDELLAH</dc:creator>
      <pubDate>Mon, 08 Dec 2025 22:22:43 +0000</pubDate>
      <link>https://dev.to/idabdellah/data-science-meets-wine-building-a-model-to-predict-quality-accurately-13oc</link>
      <guid>https://dev.to/idabdellah/data-science-meets-wine-building-a-model-to-predict-quality-accurately-13oc</guid>
      <description>&lt;h2&gt;
  
  
  Table of contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Introduction&lt;/li&gt;
&lt;li&gt;
Data Collecting &amp;amp; Cleaning

&lt;ul&gt;
&lt;li&gt;Basic Structure&lt;/li&gt;
&lt;li&gt;The Target&lt;/li&gt;
&lt;li&gt;Relationship Checks&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

Data Exploration

&lt;ul&gt;
&lt;li&gt;Correlations Heatmap&lt;/li&gt;
&lt;li&gt;Rating Distribution&lt;/li&gt;
&lt;li&gt;Does expensive wines means better quality&lt;/li&gt;
&lt;li&gt;Which country dominant wine quality&lt;/li&gt;
&lt;li&gt;Which brand consistently performs well&lt;/li&gt;
&lt;li&gt;Does wine quality improves or declines over the years&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

Building Machine Learning Model

&lt;ul&gt;
&lt;li&gt;Preparing the data&lt;/li&gt;
&lt;li&gt;Our first prediction&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Communication&lt;/li&gt;

&lt;/ul&gt;

&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;This study models wine quality and analyzes these variables using data-driven methods. To find patterns and connections in the dataset, it starts with exploratory analysis and visualizations. Machine-learning models are then created to forecast wine quality and determine which characteristics most significantly influence higher ratings. By fusing predictive modeling with statistical understanding,Large user-generated systems like Vivino now offer extensive data that reflects actual consumer tastes at scale, whereas expert evaluations and sommelier reviews have historically guided perceptions of quality. This dataset, which includes about 13,000 wines, is ideal for examining trends in wine quality since it captures a number of characteristics, including ratings, price, country, and winery.&lt;/p&gt;

&lt;h1&gt;
  
  
  Data Collecting and Cleaning
&lt;/h1&gt;

&lt;p&gt;The data I have, which we will be using for the study, was found in a GitHub repository. It represents wine data from the well-known Vivino website and consists of more than 13,800 wines. The data follows this pattern.&lt;br&gt;
&lt;code&gt;Winery, Region, Country, Rating, NumberOfRatings, Price, Year&lt;/code&gt;&lt;br&gt;
As for cleaning, I honestly didn’t need much time at this stage because the data was originally clean and contained no missing values. However, in the Sparkling wine category, most of the values written in the year column were “N.V.” At first, I thought it meant value not available and considered it a missing value. Later, I discovered that its actual meaning is Non-Vintage, which means the wine is a blend of wines from different years, not a single wine from a specific year.&lt;/p&gt;

&lt;p&gt;Since this is an important value in our study, the procedure I took was to replace it with the average year for that specific wine type, while converting the column’s data type to numeric.&lt;/p&gt;

&lt;p&gt;Also, during the cleaning process, I removed the year from the wine names to maintain a consistent format.&lt;/p&gt;
&lt;h1&gt;
  
  
  Data Exploration
&lt;/h1&gt;
&lt;h2&gt;
  
  
  Basic Structure
&lt;/h2&gt;

&lt;p&gt;The following image shows 20 rows of the data to illustrate its structure and format. The dataset contains 13,834 rows and 9 columns. Just to note, this snapshot was taken before the cleaning process, but as mentioned earlier, there were no major changes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5jumxfhd7mkcb99hqsxu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5jumxfhd7mkcb99hqsxu.png" alt=" " width="800" height="199"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The data types are clear from the image, and the explanation of each column is as follows:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Column&lt;/th&gt;
&lt;th&gt;Explanation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Name&lt;/td&gt;
&lt;td&gt;The name of the wine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Type&lt;/td&gt;
&lt;td&gt;Type of wine (red/white/rose/sparkling)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Winery&lt;/td&gt;
&lt;td&gt;Which facture made the wine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Region / Country / Rating /  NbrOfRating / Price&lt;/td&gt;
&lt;td&gt;Self Explanatory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Year&lt;/td&gt;
&lt;td&gt;Wine Birth day :)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h2&gt;
  
  
  The Target
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;Rating&lt;/strong&gt; value is the main value in our study and our objective. From this point on, everything we do will focus on identifying the relationships between the other features and this rating.&lt;br&gt;&lt;br&gt;
The highest rating in our dataset is &lt;strong&gt;4.9&lt;/strong&gt;, the lowest is &lt;strong&gt;2.2&lt;/strong&gt;, and the average is &lt;strong&gt;3.87&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Now we will define the main quality categories of the wine based on its rating value. The classifications are as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Low wine quality&lt;/strong&gt;: rating less than &lt;strong&gt;3.5&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Medium wine quality&lt;/strong&gt;: rating between &lt;strong&gt;3.6&lt;/strong&gt; and &lt;strong&gt;4.5&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High wine quality&lt;/strong&gt;: rating higher than &lt;strong&gt;4.5&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These categories are what we will train the AI model on so that it can predict the quality of a wine we haven’t seen before based on its characteristics.&lt;/p&gt;
&lt;h2&gt;
  
  
  Relationship Checks
&lt;/h2&gt;

&lt;p&gt;As you can see in the image below, it shows the relationships between all the numerical values. For example, we notice a medium-to-weak relationship between price and rating, as well as a weak negative relationship between year and rating, etc. We will go into more detail about these relationships later on.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F579dr660gcevthha5sz1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F579dr660gcevthha5sz1.png" alt=" " width="515" height="105"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1&gt;
  
  
  Data Visualization
&lt;/h1&gt;
&lt;h2&gt;
  
  
  Correlations Heatmap
&lt;/h2&gt;

&lt;p&gt;This section is a visual illustration of the previous one, &lt;em&gt;“relationship checks”&lt;/em&gt;, and it graphically shows the relationships between the numerical values in our dataset.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3tp94b0rezoyb9hdnhs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3tp94b0rezoyb9hdnhs.png" alt=" " width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Rating Distribution
&lt;/h2&gt;

&lt;p&gt;As we can see in the chart below, most of the ratings fall between &lt;strong&gt;3.5 and 4&lt;/strong&gt;, which suggests that most wines are of medium quality. Only a small portion of the wines are low-quality or high-quality.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6zg1v1y28lmkmlxfls2u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6zg1v1y28lmkmlxfls2u.png" alt=" " width="640" height="480"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Does expensive wines means better quality
&lt;/h2&gt;

&lt;p&gt;As is commonly known in the global market—and for any type of product—the general assumption is that if the price of a product is low or cheap, it usually indicates lower or poor quality, while a high price often guarantees higher quality.&lt;/p&gt;

&lt;p&gt;So, does the same idea apply to the Vivino wine market? Does expensive wine mean better quality?&lt;/p&gt;

&lt;p&gt;This is what the following chart will answer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuuqx9z5bcvvid55sir25.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuuqx9z5bcvvid55sir25.png" alt=" " width="640" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here we notice that there is no clear linear relationship, as most of the low- or medium-priced wines have high ratings. This answers our question with a &lt;em&gt;no&lt;/em&gt;, meaning that cheaper products are often well-rated, and a higher price does not necessarily indicate a higher rating or better quality.&lt;/p&gt;
&lt;h2&gt;
  
  
  Which country dominant wine quality
&lt;/h2&gt;

&lt;p&gt;In other product markets as well, certain countries stand out for producing high-quality goods. For example, everyone knows that German-made cars are renowned for their exceptional quality. The same idea applies to many other types of products.&lt;/p&gt;

&lt;p&gt;In this section, we will identify the countries that lead in terms of wine quality.&lt;/p&gt;

&lt;p&gt;Initially, the data we have includes &lt;strong&gt;33 countries&lt;/strong&gt;, which are as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;'France', 'Italy', 'Austria', 'New Zealand', 'Chile', 'Australia', 'South Africa', 'Spain', 'United States', 'Portugal', 'Hungary', 'Brazil', 'Argentina', 'Romania', 'Germany', 'Greece', 'Mexico', 'Moldova', 'Switzerland', 'Slovenia', 'Israel', 'Georgia', 'Lebanon', 'Uruguay', 'Turkey', 'Croatia', 'China', 'Slovakia', 'Bulgaria', 'Canada', 'Luxembourg', 'United Kingdom', 'Czech Republic'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffxxl0418pdyrhmvsqdvg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffxxl0418pdyrhmvsqdvg.png" alt=" " width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This chart shows the average wine ratings for each country individually. As observed, the top ten countries in terms of quality, in order, are: &lt;strong&gt;Moldova, Lebanon, Croatia, Czech Republic, United Kingdom, Georgia, France, United States, Italy, and Germany&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk8ly1h3frjwfrtdgz9e2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk8ly1h3frjwfrtdgz9e2.png" alt=" " width="250" height="203"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And this means that, of course, the country plays an important role in determining the quality of the product.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which brand consistently performs well
&lt;/h2&gt;

&lt;p&gt;After realizing that quality varies from country to country, does the same apply to the producer itself? In other words, does the brand also play a role in determining wine quality?&lt;/p&gt;

&lt;p&gt;To find out, we refer to the chart below, which includes &lt;strong&gt;30 wineries&lt;/strong&gt; out of a total of 3,000 producers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi8gq38cyuoh16pzcb79r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi8gq38cyuoh16pzcb79r.png" alt=" " width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we can see, the 30 producers shown in the chart perform well, all having good average ratings, which indicates better product quality. However, as mentioned before, this chart includes only 30 out of 3,000 producers, and naturally, we cannot display them all. Behind the scenes, I noticed that the lowest-rated producer has an average rating of about &lt;strong&gt;2.9&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This means that almost all producers produce wines of medium to high quality, with only a few producing lower-quality wines. We can also conclude that producers do play a role in determining the product’s quality—whether it is good or poor.&lt;/p&gt;

&lt;h2&gt;
  
  
  Does wine quality improves or declines over the years
&lt;/h2&gt;

&lt;p&gt;Now we will explore the relationship between the years and wine quality—does wine improve with age or decline?&lt;/p&gt;

&lt;p&gt;The chart below illustrates this relationship.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzuxi1oz1a1s843ilwp27.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzuxi1oz1a1s843ilwp27.png" alt=" " width="800" height="426"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we can see, the chart shows fluctuations in the average rating over the years.&lt;/p&gt;

&lt;p&gt;Between &lt;strong&gt;1960 and 1990&lt;/strong&gt;, there is a clear and steady improvement in wine quality. Between &lt;strong&gt;1990 and 2000&lt;/strong&gt;, we observe a downward fluctuation in quality, with rises and falls. From &lt;strong&gt;2000 to 2020&lt;/strong&gt;, quality rises again, followed by a noticeable decline in the last two decades.&lt;/p&gt;

&lt;p&gt;What can be concluded from this is that wine quality indeed changes over the years.&lt;/p&gt;

&lt;h1&gt;
  
  
  Building Machine Learning Model
&lt;/h1&gt;

&lt;p&gt;In this section, we will train a &lt;strong&gt;Machine Learning model&lt;/strong&gt; that will allow us to predict the quality of new wines that we haven’t seen in our dataset before.&lt;/p&gt;

&lt;p&gt;We will use the &lt;strong&gt;Random Forest Classifier&lt;/strong&gt; algorithm, which falls under the category of &lt;strong&gt;Supervised Learning&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Preparing the data
&lt;/h2&gt;

&lt;p&gt;The first step in the preparation process was adding a new column called &lt;code&gt;Quality_Classification&lt;/code&gt;, which is based on the &lt;code&gt;Rating&lt;/code&gt; value, as mentioned earlier, with values classified as &lt;strong&gt;low&lt;/strong&gt;, &lt;strong&gt;medium&lt;/strong&gt;, or &lt;strong&gt;high&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Next, I converted the &lt;code&gt;Country&lt;/code&gt; and &lt;code&gt;Winery&lt;/code&gt; columns to the &lt;strong&gt;average rating&lt;/strong&gt; for each individual value. Then I transformed the &lt;code&gt;Year&lt;/code&gt; column into &lt;strong&gt;wine age&lt;/strong&gt;. After that, I scaled the &lt;code&gt;Price&lt;/code&gt; and &lt;code&gt;NbrOfRating&lt;/code&gt; columns using &lt;code&gt;np.log1p&lt;/code&gt; because the difference between their minimum and maximum values is very large.&lt;/p&gt;

&lt;p&gt;In the final step, I removed the unwanted columns.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F65jwxxlzwjay802q6omk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F65jwxxlzwjay802q6omk.png" alt=" " width="625" height="402"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Our first prediction
&lt;/h2&gt;

&lt;p&gt;As you can see in the image below, this is a snippet of the script performing our first prediction of the quality of a wine that was not present in our dataset.&lt;/p&gt;

&lt;p&gt;What we did, in simple terms, was first separate the data into &lt;strong&gt;X&lt;/strong&gt; and &lt;strong&gt;Y&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;X&lt;/strong&gt; represents the numerical values, while &lt;strong&gt;Y&lt;/strong&gt; represents the wine classification that corresponds to each row of numerical features.&lt;/p&gt;

&lt;p&gt;After that, we split X and Y into two parts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The first part (&lt;strong&gt;X_train&lt;/strong&gt;, &lt;strong&gt;Y_train&lt;/strong&gt;) is used to train the model.&lt;/li&gt;
&lt;li&gt;The second part (&lt;strong&gt;X_validation&lt;/strong&gt;, &lt;strong&gt;Y_validation&lt;/strong&gt;) is used to test how accurate the model is in its predictions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You’ll notice that after we trained the model using &lt;code&gt;X_train&lt;/code&gt; and &lt;code&gt;Y_train&lt;/code&gt; (using the &lt;code&gt;fit&lt;/code&gt; method), we then made predictions using the &lt;strong&gt;X_validation&lt;/strong&gt; values. The prediction result is the first line below, which contains an array of classifications.&lt;/p&gt;

&lt;p&gt;To evaluate the model’s accuracy, we used the &lt;code&gt;accuracy_score&lt;/code&gt; method from the sklearn library, comparing the prediction result with &lt;strong&gt;Y_validation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The accuracy test gives value close to 1 which means the model learned well from the data it was given and is now capable of producing realistic predictions when provided with new wine data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frzo45tr7tewfwsxvbtg7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frzo45tr7tewfwsxvbtg7.png" alt=" " width="800" height="327"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Communication
&lt;/h1&gt;

&lt;p&gt;In this study, we analyzed a dataset of over &lt;strong&gt;13,800 wines&lt;/strong&gt; from Vivino to understand the factors that influence wine quality. Our main focus was the &lt;strong&gt;Rating&lt;/strong&gt; value, which serves as the target variable for predicting wine quality. Based on the ratings, we classified wines into &lt;strong&gt;low, medium, and high quality&lt;/strong&gt;, forming the foundation for our Machine Learning model.&lt;/p&gt;

&lt;p&gt;Through exploratory analysis, we observed that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Price&lt;/strong&gt; does not have a clear linear relationship with quality; cheaper wines can often be highly rated, and expensive wines do not guarantee superior quality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Country and producer (winery)&lt;/strong&gt; play a significant role in wine quality. Some countries and top producers consistently produce higher-quality wines.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wine age&lt;/strong&gt; influences quality, but the trend is not strictly linear; quality fluctuates over decades, showing periods of improvement and decline.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We preprocessed the data carefully:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Added a &lt;code&gt;Quality_Classification&lt;/code&gt; column based on &lt;code&gt;Rating&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Converted &lt;code&gt;Country&lt;/code&gt; and &lt;code&gt;Winery&lt;/code&gt; to average ratings.&lt;/li&gt;
&lt;li&gt;Transformed &lt;code&gt;Year&lt;/code&gt; into wine age.&lt;/li&gt;
&lt;li&gt;Scaled &lt;code&gt;Price&lt;/code&gt; and &lt;code&gt;NbrOfRating&lt;/code&gt; using &lt;code&gt;np.log1p&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Removed irrelevant columns to focus on predictive features.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Finally, we trained a &lt;strong&gt;Random Forest Classifier&lt;/strong&gt;, a supervised learning algorithm, to predict the quality of new wines based on their characteristics. This model allows us to anticipate wine quality even for wines not present in our dataset, providing actionable insights for producers, retailers, and wine enthusiasts.&lt;/p&gt;

&lt;p&gt;The results demonstrate that while price alone is not a reliable indicator, a combination of &lt;strong&gt;country, winery, age, and other features&lt;/strong&gt; can effectively predict wine quality, highlighting the value of data-driven decision-making in the wine industry.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>python</category>
    </item>
    <item>
      <title>Inside the App Store: Insights from 10,000 Google Play Applications</title>
      <dc:creator>Mohamed ID-ABDELLAH</dc:creator>
      <pubDate>Thu, 13 Nov 2025 22:17:21 +0000</pubDate>
      <link>https://dev.to/idabdellah/inside-the-app-store-insights-from-10000-google-play-applications-4k05</link>
      <guid>https://dev.to/idabdellah/inside-the-app-store-insights-from-10000-google-play-applications-4k05</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In this project, and as part of the training course on the &lt;em&gt;Qwasar Silicon Valley&lt;/em&gt; platform, we analyze a database containing more than ten thousand applications from the Google Play Store for the years 2017 and 2018. Our goal is to interpret this data and extract conclusions that provide a deeper understanding of the app market, while addressing practical questions such as: &lt;em&gt;What is the size of the market?&lt;/em&gt; (in terms of the number of downloads and the total revenue of paid applications), and &lt;em&gt;what is the distribution of categories and their percentages?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We will follow a clear scientific methodology that includes data cleaning, visual and statistical exploratory analysis, and calculating indicators such as average prices, as well as ranking applications by popularity and price. In the end, we will draw conclusions and propose future steps to continue the research.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Technical note: the data is in CSV format, and it will be handled using the pandas library in Python.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Dataset Overview
&lt;/h2&gt;

&lt;p&gt;The database we will be working with, as mentioned earlier, contains more than ten thousand applications from the Google Play Store, diverse in their categories and types.&lt;br&gt;&lt;br&gt;
Its exact shape is &lt;em&gt;(10840, 13)&lt;/em&gt;, meaning it includes 10,840 data rows with 13 columns.&lt;br&gt;&lt;br&gt;
The columns and their descriptions are as follows:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Column Name&lt;/th&gt;
&lt;th&gt;The Data It Contains&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;App&lt;/td&gt;
&lt;td&gt;Application Name&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Category&lt;/td&gt;
&lt;td&gt;The main category under which the application is classified.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rating&lt;/td&gt;
&lt;td&gt;Application rating in the Google Play Store&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reviews&lt;/td&gt;
&lt;td&gt;Number of reviews the app has&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Size&lt;/td&gt;
&lt;td&gt;App size&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Installs&lt;/td&gt;
&lt;td&gt;Number of downloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Type&lt;/td&gt;
&lt;td&gt;App type — whether it is paid or free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Price&lt;/td&gt;
&lt;td&gt;Zero if it is free; otherwise, the price&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Content Rating&lt;/td&gt;
&lt;td&gt;The age group or audience targeted by the application&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Genres&lt;/td&gt;
&lt;td&gt;The sub-categories under which the application falls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Last Updated&lt;/td&gt;
&lt;td&gt;The last date on which the application was updated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Current Version&lt;/td&gt;
&lt;td&gt;The current version of the application&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Android Version&lt;/td&gt;
&lt;td&gt;The minimum Android OS version required for the application to run&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;And to give you a clearer picture, here is an image of the first twenty rows of data from the database.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy528nvp2dzxpk6dhqtzi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy528nvp2dzxpk6dhqtzi.png" alt="Data Eyeball Before Cleaning" width="800" height="157"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Naturally, no dataset is free of missing values, and this is what we will address in the following section.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Cleaning
&lt;/h2&gt;

&lt;p&gt;The first thing I noticed after opening the CSV data file is that there was a single row that contained only 12 values, while the dataset has 13 columns. When I inspected it, I found that the missing column in that row was &lt;em&gt;Category&lt;/em&gt;. I deleted it immediately because the application has no analytical value if its category is unknown.&lt;/p&gt;

&lt;p&gt;What I did is correct, because when analyzing and interpreting app data, there are key columns such as &lt;em&gt;Category, Rating, Installs,&lt;/em&gt; and others, while some columns are less important, like &lt;em&gt;Current Version&lt;/em&gt; and &lt;em&gt;Last Updated&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;I also standardized the data — for example, I converted numerical columns to the &lt;em&gt;number&lt;/em&gt; data type, and applied some additional adjustments to improve formatting and readability.&lt;/p&gt;

&lt;p&gt;I also noticed that some values in different columns were missing, so these errors had to be corrected. For essential analytical columns like &lt;em&gt;Rating, Reviews, Size,&lt;/em&gt; and &lt;em&gt;Installs&lt;/em&gt;, if any value was missing, I replaced it with the mean of that column based on the category it belongs to.&lt;/p&gt;

&lt;p&gt;Finally, here is an image of what the data looks like after cleaning and refinement.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7nvajf0ivcklbzc580ow.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7nvajf0ivcklbzc580ow.png" alt="Data Eyeball After Cleaning" width="800" height="159"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Exploratory Data Analysis
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Visualizing Distributions
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F697pin143pvfv4tavbkp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F697pin143pvfv4tavbkp.png" alt="Histograms to visualize data distribution" width="626" height="468"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From the histograms above, we observe that most applications have a rating between 4 and 5, and only a few have low ratings. We also notice that most apps are small in size and do not exceed 250 million downloads or less.&lt;br&gt;&lt;br&gt;
We further observe that the prices of paid applications range between one dollar and twenty dollars.&lt;/p&gt;

&lt;h2&gt;
  
  
  Correlations
&lt;/h2&gt;

&lt;p&gt;In order to understand the relationships between the data, the table below uses the &lt;em&gt;Pearson Correlations&lt;/em&gt; method to calculate the relationship between two variables—whether it is positive or negative. The closer the result is to one, the more positive the relationship is, meaning one variable increases as the other increases. If the result is below zero, the relationship is negative. And if the result is zero, then there is no relationship between the two variables.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkvfr3mgvtbkfke0ajlp0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkvfr3mgvtbkfke0ajlp0.png" alt="Pearson Correlations Table" width="434" height="157"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Interpretations &amp;amp; Insights
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Categories Distribution
&lt;/h3&gt;

&lt;p&gt;The application database we have is divided into several categories, with a total of 33 unique categories.&lt;br&gt;&lt;br&gt;
To understand the distribution of these categories, we will use the pie chart below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqvogq9x6h1xtrimvtbxb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqvogq9x6h1xtrimvtbxb.png" alt="Pie Chart to visualize Categories distribution" width="524" height="468"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we can see, the categories dominating the dataset are &lt;em&gt;Family, Game,&lt;/em&gt; and &lt;em&gt;Tools&lt;/em&gt;, with the Family category taking the lead at 18%.&lt;br&gt;&lt;br&gt;
It is also worth noting that the categories are sorted in descending order, from the most frequent to the least. To ensure clear visualization and avoid clutter in the chart, I grouped the smaller, less frequent categories into a single group labeled &lt;em&gt;Others&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Downloads percentages per category
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5en019z3ieuyvuleyazk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5en019z3ieuyvuleyazk.png" alt="Pie chart for downloads percentages per category" width="503" height="407"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As the chart above shows, the categories with the highest number of downloads are &lt;em&gt;Game, Communication,&lt;/em&gt; and &lt;em&gt;Productivity&lt;/em&gt;, with the Games category taking the lead at 18%, followed by Communication at 10% of the total downloads.&lt;/p&gt;

&lt;p&gt;It can also be observed that although the Games category is only the second in quantity, as mentioned in the previous section, it still ranks first in total downloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mean rating per category
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbuw7v0btpy9y33qqr0gu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbuw7v0btpy9y33qqr0gu.png" alt="Bar chart for Mean rating per category" width="800" height="442"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What can be inferred from this chart is that all categories, without exception, have a high average rating.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mean price per category
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdfxe6i1rn8tfg4ng9qzs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdfxe6i1rn8tfg4ng9qzs.png" alt="Bar chart for mean price per category" width="628" height="468"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What can be concluded from this chart is that the two categories with the highest average prices are &lt;em&gt;Finance&lt;/em&gt; and &lt;em&gt;Lifestyle&lt;/em&gt;, with the Finance category leading at an average of $8 per application.&lt;/p&gt;

&lt;h3&gt;
  
  
  Most popular paid apps in family category
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6wtx6lvpyfikem25qri3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6wtx6lvpyfikem25qri3.png" alt="Bar chart for most popular paid apps in family category" width="627" height="467"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Returning to the leading category in terms of the number of applications, which is the Family category, the chart above shows that the most downloaded apps belong to this category.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;If there’s one thing to remember, it’s that the real story is often hidden in the data. The world of data analytics is very vast, and the amount of insights, explanations, and analyses that can be extracted from just ten thousand applications is already enormous—let alone if the dataset itself were massive.&lt;/p&gt;

</description>
      <category>data</category>
      <category>database</category>
      <category>analytics</category>
      <category>python</category>
    </item>
  </channel>
</rss>
