agenthustler

Posted on Mar 26 • Edited on Apr 19

How to Scrape IMDb in 2026: Movies, TV Shows, Ratings and Reviews

#webdev #python #tutorial #webscraping

IMDb remains the go-to database for movie and TV information. With over 10 million titles and hundreds of millions of ratings, it's a goldmine for data projects — recommendation engines, market analysis, sentiment tracking, and more.

In this guide, I'll show you how to scrape IMDb effectively in 2026 with working Python code.

IMDb's Structure in 2026

IMDb is still primarily server-rendered HTML, which makes it easier to scrape than heavily JavaScript-dependent sites. However, they've added more dynamic loading and anti-bot protections over the years.

Key pages you'll want to scrape:

Title pages (/title/tt1234567/) — movie/show details, ratings, cast
Search results (/find/) — finding titles by name
Charts (/chart/) — top rated, most popular
Reviews (/title/tt1234567/reviews) — user reviews and ratings
Name pages (/name/nm1234567/) — actor/director filmography

Basic Scraping with BeautifulSoup

Let's start with extracting movie details from a title page:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

The JSON-LD approach is robust because IMDb uses it for SEO — they're unlikely to remove it.

Scraping IMDb Search Results

To find movies by keyword or title:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Extracting Ratings and Reviews

User reviews are valuable for sentiment analysis:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Scraping Top Charts

IMDb's charts (Top 250, Most Popular) are great for building recommendation datasets:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Handling Anti-Bot Protection

IMDb has gotten stricter about automated access. Here's what you need:

Proxy Rotation

Scraping IMDb at scale without proxies will get your IP blocked. I recommend ScrapeOps — they provide a proxy API specifically optimized for web scraping. You just route your requests through their endpoint:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

For higher volume scraping, ThorData residential proxies give you a rotating pool of real residential IPs:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Rate Limiting

Always add delays between requests:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Building a Complete Movie Dataset

Here's how to combine everything into a dataset builder:

import csv
import time
import random

def build_movie_dataset(title_ids: list, output_file: str):
    fieldnames = [
        "title_id", "title", "year", "rating", "rating_count",
        "genres", "directors", "actors", "duration", "content_rating"
    ]

    with open(output_file, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()

        for title_id in title_ids:
            try:
                movie = scrape_imdb_title(title_id)
                if movie:
                    movie["title_id"] = title_id
                    movie["genres"] = ", ".join(movie.get("genres", []))
                    movie["directors"] = ", ".join(movie.get("directors", []))
                    movie["actors"] = ", ".join(movie.get("actors", []))
                    writer.writerow(movie)
                    print(f"✓ {movie.get("title", title_id)}")

                time.sleep(random.uniform(2, 4))

            except Exception as e:
                print(f"✗ {title_id}: {e}")
                continue

# Build a dataset from Top 250
top_movies = scrape_top_250()
title_ids = [m["url"].split("/title/")[1].rstrip("/") for m in top_movies[:50]]
build_movie_dataset(title_ids, "imdb_top50.csv")

The Easy Way: Pre-Built IMDb Scraper

Maintaining an IMDb scraper means keeping up with their HTML changes. If you need reliable data extraction without the maintenance, check out the IMDb Scraper on Apify. It handles all the parsing, proxy rotation, and anti-bot detection automatically.

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("cryptosignals/imdb-scraper").call(
    run_input={
        "searchTerms": ["inception", "interstellar"],
        "maxItems": 100,
        "includeReviews": True
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"{item["title"]} ({item["year"]}) — ★{item["rating"]}")

It returns clean JSON with all the fields you need — no HTML parsing required.

Legal Considerations

A few things to keep in mind:

Respect robots.txt — IMDb's robots.txt restricts certain paths. Check it before scraping.
Rate limit your requests — don't hammer their servers. 1 request per 2-5 seconds is reasonable.
Don't redistribute copyrighted content — movie descriptions and reviews have copyright protections.
Use data for analysis, not replication — building a competing movie database from scraped data could be problematic.
IMDb has an official API — for commercial use, consider their data licensing options.

Conclusion

IMDb is one of the more scraper-friendly sites thanks to its server-rendered HTML and rich JSON-LD data. The BeautifulSoup examples above should handle most use cases.

For production scraping, use ScrapeOps or ThorData proxies to avoid IP blocks. And if you want to skip the code entirely, the IMDb Scraper on Apify does everything out of the box.

Happy scraping!

DEV Community