Introduction
In this project, and as part of the training course on the Qwasar Silicon Valley platform, we analyze a database containing more than ten thousand applications from the Google Play Store for the years 2017 and 2018. Our goal is to interpret this data and extract conclusions that provide a deeper understanding of the app market, while addressing practical questions such as: What is the size of the market? (in terms of the number of downloads and the total revenue of paid applications), and what is the distribution of categories and their percentages?
We will follow a clear scientific methodology that includes data cleaning, visual and statistical exploratory analysis, and calculating indicators such as average prices, as well as ranking applications by popularity and price. In the end, we will draw conclusions and propose future steps to continue the research.
Technical note: the data is in CSV format, and it will be handled using the pandas library in Python.
Dataset Overview
The database we will be working with, as mentioned earlier, contains more than ten thousand applications from the Google Play Store, diverse in their categories and types.
Its exact shape is (10840, 13), meaning it includes 10,840 data rows with 13 columns.
The columns and their descriptions are as follows:
| Column Name | The Data It Contains |
|---|---|
| App | Application Name |
| Category | The main category under which the application is classified. |
| Rating | Application rating in the Google Play Store |
| Reviews | Number of reviews the app has |
| Size | App size |
| Installs | Number of downloads |
| Type | App type — whether it is paid or free |
| Price | Zero if it is free; otherwise, the price |
| Content Rating | The age group or audience targeted by the application |
| Genres | The sub-categories under which the application falls |
| Last Updated | The last date on which the application was updated |
| Current Version | The current version of the application |
| Android Version | The minimum Android OS version required for the application to run |
And to give you a clearer picture, here is an image of the first twenty rows of data from the database.
Naturally, no dataset is free of missing values, and this is what we will address in the following section.
Data Cleaning
The first thing I noticed after opening the CSV data file is that there was a single row that contained only 12 values, while the dataset has 13 columns. When I inspected it, I found that the missing column in that row was Category. I deleted it immediately because the application has no analytical value if its category is unknown.
What I did is correct, because when analyzing and interpreting app data, there are key columns such as Category, Rating, Installs, and others, while some columns are less important, like Current Version and Last Updated.
I also standardized the data — for example, I converted numerical columns to the number data type, and applied some additional adjustments to improve formatting and readability.
I also noticed that some values in different columns were missing, so these errors had to be corrected. For essential analytical columns like Rating, Reviews, Size, and Installs, if any value was missing, I replaced it with the mean of that column based on the category it belongs to.
Finally, here is an image of what the data looks like after cleaning and refinement.
Exploratory Data Analysis
Visualizing Distributions
From the histograms above, we observe that most applications have a rating between 4 and 5, and only a few have low ratings. We also notice that most apps are small in size and do not exceed 250 million downloads or less.
We further observe that the prices of paid applications range between one dollar and twenty dollars.
Correlations
In order to understand the relationships between the data, the table below uses the Pearson Correlations method to calculate the relationship between two variables—whether it is positive or negative. The closer the result is to one, the more positive the relationship is, meaning one variable increases as the other increases. If the result is below zero, the relationship is negative. And if the result is zero, then there is no relationship between the two variables.
Interpretations & Insights
Categories Distribution
The application database we have is divided into several categories, with a total of 33 unique categories.
To understand the distribution of these categories, we will use the pie chart below.
As we can see, the categories dominating the dataset are Family, Game, and Tools, with the Family category taking the lead at 18%.
It is also worth noting that the categories are sorted in descending order, from the most frequent to the least. To ensure clear visualization and avoid clutter in the chart, I grouped the smaller, less frequent categories into a single group labeled Others.
Downloads percentages per category
As the chart above shows, the categories with the highest number of downloads are Game, Communication, and Productivity, with the Games category taking the lead at 18%, followed by Communication at 10% of the total downloads.
It can also be observed that although the Games category is only the second in quantity, as mentioned in the previous section, it still ranks first in total downloads.
Mean rating per category
What can be inferred from this chart is that all categories, without exception, have a high average rating.
Mean price per category
What can be concluded from this chart is that the two categories with the highest average prices are Finance and Lifestyle, with the Finance category leading at an average of $8 per application.
Most popular paid apps in family category
Returning to the leading category in terms of the number of applications, which is the Family category, the chart above shows that the most downloaded apps belong to this category.
Conclusion
If there’s one thing to remember, it’s that the real story is often hidden in the data. The world of data analytics is very vast, and the amount of insights, explanations, and analyses that can be extracted from just ten thousand applications is already enormous—let alone if the dataset itself were massive.









Top comments (0)