Analyzing Steam Games 2025: Genres, Players and User Ratings

#python #datascience #pandas #jupyter

As part of my Master in Data Science & AI at Evolve, I worked on a data analysis project using a real Steam games dataset from 2025.

The goal of the project was to practice the full workflow of a data analysis project: understanding the dataset, cleaning it, transforming variables, creating visualizations and explaining the results in a way that is easy to understand.

The main question I wanted to answer was:

Which Steam genres have the highest average estimated users per game, and which ones have the highest average playtime?

Later, I expanded the analysis with two additional questions:

Are there clear differences between free and paid games?
Is there any relationship between positive ratings, estimated users and average playtime?

Dataset

The dataset used in this project was Steam Games dataset 2025, downloaded from Kaggle.

It contains around 95,000 games and 47 columns, including information such as:

game name;
release date;
genres;
price;
estimated owners;
average playtime;
positive and negative reviews.

One important detail is that the original CSV file is not included in the GitHub repository because it is too large. Instead, the repository explains where the file should be placed in order to reproduce the project.

Data cleaning and transformation

Before analyzing the data, I had to transform several columns.

For example, the estimated_owners column does not contain an exact number of users. It contains ranges such as 100000 - 200000. To work with this variable, I used the midpoint of each range as an approximation.

I also converted average_playtime_forever from minutes to hours, separated games with multiple genres, translated the main genre names into Spanish for the final report, and created summary tables for the analysis.

One of the most important decisions was not to use total estimated users by genre as the main metric. A single game can belong to several genres, so summing users by genre can produce inflated numbers. Instead, I focused on average estimated users per game, which gave a more realistic comparison between genres.

Main findings

Among genres with at least 1,000 games, the genres with the highest average estimated users per game were:

Massively Multiplayer
Free To Play
Action
RPG
Strategy

This showed that popularity is not only about how many games a genre has, but also about how many users an average game in that genre can attract.

When analyzing average playtime, the ranking changed. The genres with the highest average playtime were:

Simulation
Massively Multiplayer
Casual
Adventure
Action

This was one of the most interesting parts of the project. The genres with the most estimated users were not always the same genres with the highest playtime. In other words, popularity and retention are related, but they are not the same thing.

Free games vs paid games

I also compared free and paid games.

Free games had a higher average number of estimated users. This makes sense because they have no economic barrier to entry, so more users can try them.

However, paid games had higher average playtime. My interpretation is that users who pay for a game may be more likely to spend more time playing it.

Interestingly, the average positive rating was very similar between free and paid games.

Ratings and popularity

In the last part of the project, I analyzed whether positive ratings were strongly related to estimated users or average playtime.

To make this analysis more reliable, I filtered games with at least 50 reviews. This was important because games with very few reviews can have extreme percentages that are not representative.

The result was that positive ratings had a very weak relationship with both estimated users and playtime. A game can be popular without having exceptionally high ratings, and a highly rated game does not necessarily have a massive number of users.

Technical Notes

Some methodological decisions were important for keeping the analysis consistent:

estimated_owners was converted from ranges into approximate numeric values using the midpoint of each range.
Genres were exploded so that each game could be counted once per genre.
Genres with fewer than 1,000 games were filtered out in the main genre comparison to avoid unstable rankings.
Games with fewer than 50 reviews were excluded from the ratings analysis to reduce noise.
Correlation was interpreted carefully, since it does not imply causation.

What I learned

One of the main things I learned is that in data analysis, getting a result is not enough. The result also needs to make sense.

At one point, the total estimated users by genre produced very large numbers. Instead of accepting them directly, I reviewed the metric and realized that using average users per game was a better approach for the question I was trying to answer.

I also practiced:

cleaning and transforming real-world data;
creating new variables;
working with grouped summaries;
building visualizations;
explaining limitations;
preparing a project to be shared publicly.

I also used AI support during the project, mainly to help structure the work, review code, improve explanations and detect possible interpretation issues. The analysis decisions and final interpretation were reviewed step by step as part of the learning process.