Movie Crawler: Scraping 100,000+ Movie Information

The data of movies record audiences' preferences and their attitude towards certain things. Gathering the movie info from relative websites, like IMDb and Rotten Tomatoes, will contribute to data analysis and data mining in the film industry. Generally speaking, the scraped data can be employed in some scenario:

· Analyzing the features of the target audience
· Obtaining public opinions to predict the coming trends
· Helping the Advertising Push

There are still more things that we can do with the movie data according to the needs. To help you fulfill data gathering, this article will introduce how to scrape the information from the IMDb Horror movie list, including director information, the cast of actors, and some other important information.

In this case, I’ll show you how to scrape the 134,555 Horror movie information from IMDb, using the link:

https://www.imdb.com/search/title/?genres=horror&start=51&explore=title_type,genres&ref_=adv_nxt

The goal of this web scraper is to find films that are listed on the Horror movie list, obtaining director information, the cast of actors, and some other important information.

Before getting started, please download Octoparse V7 on your computer to follow up. Besides, it’s highly recommended to learn the basic logic of using Octoparse.

Let’s get started

Step 1: Open the target website in the Octoparse built-in browser.

Simply click “+task” under the Advanced Mode.

Then, paste the URL to the box and click the “Save URL” button.

Step 2: Click to build a task to scrape the movie information.
After having the RUL opened in the Octoparse built-in browser, we can continue to build a pagination and a loop item to get the data.

Simply click the “next>>” element in the built-in browser and then click “Loop click selected element” on the Action Tips.

We can see the pagination has been built in the workflow.

If you want to make the Octoparse recognize the element you selected more precisely, you could simply revise the XPath. As we can see in the below picture, the XPath that Octoparse generated is //DIV[@class='nav']/DIV[2]/A[2]. We’d better change it to //a[contains(text(), "Next »")]

In this case, we need to scrape the data from the movie list, which says, we can directly create a loop item to extract the data.

Select one of the “blocks” on the browser, Octoparse can detect all the data fields in the blog you selected.

Then, select “Select all sub-elements”.

All the needed data are being selected by Octoparse and highlighted in red. Select “Select All” to continue.

Finally, we select “Extract data in the loop”.

Now, we have both the pagination and the loop item done in Octoparse. We can see the workflow of the task on the left side and the data that are displayed on the right side.

Step 3: Clean the data in Octoparse.

Before extracting data, we’d better clean the data to make our final result better. Simple need to click to delete the unwanted field and rename the description you need.

Step 4: Extract data
Simply click “Extract data” to get the data locally.

As local extraction utilizes your own computer resources, such as the CPU, internet speed, it works slower than using Octoparse cloud extraction.

Anyway, after creating the scraper, what you need to do is wait and get the data, more than 100,000 lines of movie data in about 2 hours.

With the above steps, I suppose, everyone, including those who have no programming background can easily build a movie crawler with Octoparse V7 and get more than 100,000 lines of the movie information. However, that's not the easiest way. Using Octoparse V8 could be much easier:

All in all, with data scraping, we can obtain the movie data online with any legal issue.

Apart from the data, the more important is about the skill you learned, which is extremely useful for doing the market research, keeping yourself updating, and many other things.