DEV Community


My first project

jpakele profile image jpakele ・4 min read

The Problem at Hand

Imagine a world where Microsoft wanted to get into the Movie Making business. They make PC and video games, separate industries from theater and movie. How would they even know where to start? What is the smartest first move? Enter: Data Science.

For the project, my goal in this fictional (though not believably realistic) scenario is to advise the executives at Microsoft one what a course of action might be. What's profitable? What's reasonable? What is the actionable information that they need?

The Data Set

There were 4 sets of data that came into use in this project, though not all of the data from each was used (as I'll talk more about later).

  • Box Office Mojo
  • IMDB
  • Rotten Tomatoes

These data set contain itemized information about published movies. Each data set contained various combinations of attributes about a specific movies (movie earning, release year, title, affiliated actors/actresses, etc.)

The Problem WITH the Data

All of these data sets came from different, separate sources. Each source had it's own way of organizing things and not all of the data between each other was the same. For instance the movie ID number between IMDB and RT are neither named the same way nor organized the same way.

Cleaning the Data

In order to gain a better understanding of the dataset as a whole, I felt it was necessary to eliminate as much of the data that was unnecessary as possible. Any information that wasn't directly about the movie genre's themselves, the overall/average rating of a movie, or pertained to the monetization of an individual movie was eliminated all together.

There were a lot of reason for keeping more data (Actors, writers, run time, release date, etc.) but in the context of having to convey this information to a corporate executive I felt that the two things someone in that position would care about the most were how well a movie could be received and how much money it could make them.

Methodology and Results

Methodology 1

At this point, I had a strong and focused set of goals for approaching the problem at hand. From here, I took the data sets that were still left, renamed similar columns and combined them to get one consolidated data set. Combining can be a bit tricky, especially with titles that look different to Python but are the same movie (such as such as "Star Wars the Last Jedi" vs "Star Wars: The Last Jedi"). Likewise, movies that had the same name but were in fact different from each other would show up as repeated rows with data exactly the same as one another. To solve this particular issue I had combined the lists contingent on two columns. Those were "title" and "release date".

From there, I organized the list in order of highest domestic gross to lowest. This is where I took 3 samples of the data. Here I grab counts of every movie title's genre(s) for the top 100 movies, 1,000 movies, and every movie whos gross amount domestically was at least 10,000,000 USD.

The results of this were:

  • The movies that see the most box office success, both as one of the most common and as one of the highest earners, is 'Action, Animation, & Comedy'.
  • 'Action, Adventure, & Sci-Fi' is the second most common of the highest earners.
  • 'Action, Adventure, & Sci-Fi' is the single highest earner genre combination.

Methodology 2

Similarly to the first methodology, I had sorted the same list by the most highly rated to the lowest. However, from the start I had removed all movie that were unable to receive at least 10,000 votes. I saw this as anything less had failed to become relevantly popular enough and should be seen as a non-example.

Then I took my 3 samples, a value count of every single-movie's genre(s) from the top 100 from this list, the top 1,000 from this list, and all movies that received a 7.0/10 average rating or higher.

The results were:

  • The data shows that purely 'Drama' movies have the highest ratings as well as being the most common genre type.
  • The second most common genre is 'Comedy & Drama' in combination.

Methodology 3

This was significantly different as now I needed to find a correlation between the two of these findings. By creating a scatter plot from these I had discerned that there seemed to be an upward trend in data between the total domestic gross and the average rating of a movie. But it was hard to put a pin in why this was true. A higher rated movie meant higher earning but was there any data that looked similar to this?

Yes! Production budget vs domestic gross! It seems like the higher the budget of a movie is, the higher it tended to rate.

Actionable Info

The gist of my recommendations boiled down to making 'Drama' and 'Drama, Comedy' movies are safe for establishing a good foundation of well received movies to earn a kind of "brand trust". Use these to test the water with smaller budgets and increasing said budget based on how well each proceeding movie did. Eventually the goal is to make an 'Action, Adventure, Sci-Fi' movie with a large budget to garner a larger return on investments.

I had also made the recommendation that since Microsoft and Xbox are the owners of very, large well beloved intellectual properties in the form of video games and video game franchise they'd be able to save effort on story crafting by making a video game into a movie. They wouldn't have to go through the hassle of purchasing rights to anything new and would be able to capitalize on the nostalgia factor that these games instill in audiences.

Discussion (0)

Editor guide