agenthustler

Posted on Apr 9

How to Scrape IMDB Movie Data in 2026: Ratings, Cast, Reviews, and Box Office

#python #data #webscraping #tutorial

IMDB holds data on over 10 million titles — movies, TV shows, shorts, video games — making it the most comprehensive entertainment database on the web. If you're building a movie recommendation engine, analyzing box office trends, or researching the film industry, IMDB data is often the starting point.

But IMDB doesn't offer a public API. They used to have one, they shut it down years ago, and the alternatives range from limited to expensive. In this guide, I'll cover the practical ways to get IMDB data in 2026 — from official datasets to web scraping.

IMDB's Official Data Sources

Before you start scraping, know what's available officially. IMDB provides two legitimate data channels:

1. IMDB Non-Commercial Datasets

IMDB publishes daily TSV (tab-separated) dumps at datasets.imdbws.com. These are free for non-commercial use and contain:

title.basics.tsv.gz — Title ID, type, name, year, runtime, genres
title.ratings.tsv.gz — Average rating and vote count per title
title.crew.tsv.gz — Directors and writers per title
title.principals.tsv.gz — Top-billed cast and crew
name.basics.tsv.gz — Person names, birth/death years, known-for titles

Here's how to download and work with them:

import pandas as pd
import requests
import gzip
import io

def load_imdb_dataset(filename):
    """Download and load an IMDB dataset into a DataFrame."""
    url = f"https://datasets.imdbws.com/{filename}"
    print(f"Downloading {filename}...")

    response = requests.get(url, stream=True)
    response.raise_for_status()

    # Decompress and read TSV
    with gzip.open(io.BytesIO(response.content), 'rt', encoding='utf-8') as f:
        df = pd.read_csv(f, sep='\t', na_values='\\N', low_memory=False)

    print(f"Loaded {len(df):,} rows")
    return df

# Load titles and ratings
titles = load_imdb_dataset("title.basics.tsv.gz")
ratings = load_imdb_dataset("title.ratings.tsv.gz")

# Merge them
movies = titles[titles['titleType'] == 'movie'].merge(ratings, on='tconst', how='inner')

# Top rated movies with significant votes
top_movies = movies[movies['numVotes'] >= 50000].nlargest(20, 'averageRating')
for _, movie in top_movies.iterrows():
    print(f"{movie['primaryTitle']} ({movie['startYear']}) - {movie['averageRating']}/10 ({movie['numVotes']:,.0f} votes)")

These datasets are solid for bulk analysis — ratings, genres, cast relationships. But they're missing a lot: plot summaries, reviews, box office numbers, poster images, and detailed cast info.

2. IMDB API (Commercial)

Amazon offers a commercial IMDB API through AWS Data Exchange. It's comprehensive but priced for enterprise customers (we're talking thousands per month). Unless you have a significant budget, this isn't practical for most projects.

Scraping IMDB Movie Pages

For the data that's not in the official datasets, scraping individual movie pages is the go-to approach. IMDB's pages are mostly server-rendered HTML, which makes them relatively straightforward to parse.

import requests
from bs4 import BeautifulSoup
import json
import time

def scrape_movie(imdb_id):
    """Scrape detailed movie data from an IMDB title page."""
    url = f"https://www.imdb.com/title/{imdb_id}/"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.9"
    }

    response = requests.get(url, headers=headers)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, 'html.parser')

    # IMDB embeds structured data as JSON-LD
    script_tag = soup.find('script', type='application/ld+json')
    if script_tag:
        structured_data = json.loads(script_tag.string)
    else:
        structured_data = {}

    movie = {
        "imdb_id": imdb_id,
        "title": structured_data.get("name", ""),
        "description": structured_data.get("description", ""),
        "rating": structured_data.get("aggregateRating", {}).get("ratingValue"),
        "vote_count": structured_data.get("aggregateRating", {}).get("ratingCount"),
        "genres": structured_data.get("genre", []),
        "director": extract_people(structured_data.get("director", [])),
        "actors": extract_people(structured_data.get("actor", [])),
        "duration": structured_data.get("duration", ""),
        "content_rating": structured_data.get("contentRating", ""),
        "poster_url": structured_data.get("image", ""),
        "date_published": structured_data.get("datePublished", "")
    }

    return movie

def extract_people(data):
    """Extract names from IMDB's JSON-LD person entries."""
    if isinstance(data, dict):
        return [data.get("name", "")]
    elif isinstance(data, list):
        return [p.get("name", "") for p in data if isinstance(p, dict)]
    return []

# Example: Scrape The Shawshank Redemption
movie = scrape_movie("tt0111161")
for key, value in movie.items():
    print(f"{key}: {value}")

The JSON-LD approach is the real trick here. Instead of parsing brittle CSS selectors, IMDB embeds structured data in a <script type="application/ld+json"> tag that follows Schema.org format. This is much more stable than DOM parsing because it's a standardized format that IMDB maintains for SEO purposes.

Extracting Box Office Data

Box office data lives on a separate section of each movie page. It's not in the JSON-LD, so you'll need to parse the HTML:

def scrape_box_office(imdb_id):
    """Scrape box office information from IMDB."""
    url = f"https://www.imdb.com/title/{imdb_id}/"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "Accept-Language": "en-US,en;q=0.9"
    }

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    box_office = {}

    # Look for the box office section
    box_office_section = soup.find('div', {'data-testid': 'title-boxoffice-section'})
    if box_office_section:
        items = box_office_section.find_all('li', {'data-testid': re.compile('title-boxoffice-')})
        for item in items:
            label_el = item.find('span', class_='ipc-metadata-list-item__label')
            value_el = item.find('span', class_='ipc-metadata-list-item__list-content-item')
            if label_el and value_el:
                label = label_el.text.strip()
                value = value_el.text.strip()
                box_office[label] = value

    return box_office

import re
box_office = scrape_box_office("tt0111161")
for metric, value in box_office.items():
    print(f"{metric}: {value}")
# Budget: $25,000,000 (estimated)
# Gross US & Canada: $58,500,000
# Opening weekend US & Canada: $727,327
# Gross worldwide: $73,300,000

Note that box office data is only available for theatrical releases. Streaming-only titles won't have this section.

Scraping User Reviews at Scale

IMDB reviews are paginated and require handling the load-more pattern. Here's how to collect them:

def scrape_reviews(imdb_id, max_reviews=100):
    """Scrape user reviews from IMDB."""
    url = f"https://www.imdb.com/title/{imdb_id}/reviews"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "Accept-Language": "en-US,en;q=0.9"
    }

    reviews = []
    params = {}

    while len(reviews) < max_reviews:
        response = requests.get(url, headers=headers, params=params)
        soup = BeautifulSoup(response.text, 'html.parser')

        review_containers = soup.find_all('div', class_='review-container')
        if not review_containers:
            break

        for container in review_containers:
            review = {}

            # Rating (out of 10)
            rating_el = container.find('span', class_='rating-other-user-rating')
            if rating_el:
                review['rating'] = rating_el.find('span').text.strip()

            # Title
            title_el = container.find('a', class_='title')
            if title_el:
                review['title'] = title_el.text.strip()

            # Review text
            content_el = container.find('div', class_='text')
            if content_el:
                review['text'] = content_el.text.strip()

            # Date
            date_el = container.find('span', class_='review-date')
            if date_el:
                review['date'] = date_el.text.strip()

            # Helpfulness
            helpful_el = container.find('div', class_='actions')
            if helpful_el:
                review['helpful'] = helpful_el.text.strip()

            reviews.append(review)

        # Check for pagination key
        load_more = soup.find('div', class_='load-more-data')
        if load_more and load_more.get('data-key'):
            params['paginationKey'] = load_more['data-key']
        else:
            break

        time.sleep(1)  # Be respectful

    return reviews[:max_reviews]

reviews = scrape_reviews("tt0111161", max_reviews=50)
print(f"Collected {len(reviews)} reviews")
for r in reviews[:3]:
    print(f"  [{r.get('rating', 'N/A')}/10] {r.get('title', 'No title')}")

Building a Movie Dataset Pipeline

In practice, you'll want to combine the official datasets with scraped data. Here's a pattern that works well:

import csv
import time

def build_movie_dataset(genre='Action', min_votes=10000, min_year=2020, limit=200):
    """
    Build a rich movie dataset by combining IMDB dumps with scraped data.

    Step 1: Filter candidates from the official dataset (fast, no scraping)
    Step 2: Enrich selected movies by scraping individual pages (slow, throttled)
    """

    # Step 1: Load and filter from official data
    print("Loading official IMDB datasets...")
    titles = load_imdb_dataset("title.basics.tsv.gz")
    ratings = load_imdb_dataset("title.ratings.tsv.gz")

    movies = titles.merge(ratings, on='tconst')
    filtered = movies[
        (movies['titleType'] == 'movie') &
        (movies['genres'].str.contains(genre, na=False)) &
        (movies['numVotes'] >= min_votes) &
        (movies['startYear'].astype(float) >= min_year)
    ].nlargest(limit, 'numVotes')

    print(f"Found {len(filtered)} candidate movies")

    # Step 2: Enrich with scraped data
    enriched = []
    for idx, row in filtered.iterrows():
        imdb_id = row['tconst']
        print(f"Scraping {row['primaryTitle']} ({imdb_id})...")

        try:
            movie_data = scrape_movie(imdb_id)
            box_office = scrape_box_office(imdb_id)
            movie_data['box_office'] = box_office
            movie_data['official_rating'] = row['averageRating']
            movie_data['official_votes'] = row['numVotes']
            enriched.append(movie_data)
        except Exception as e:
            print(f"  Error: {e}")

        # Respect rate limits — 1 request per second
        time.sleep(1.5)

    return enriched

# Build dataset
dataset = build_movie_dataset(genre='Sci-Fi', min_year=2023, limit=50)
print(f"\nBuilt dataset with {len(dataset)} enriched movies")

This two-step approach is efficient: use the official bulk data for filtering and basic info, then only scrape the pages you actually need.

Handling Anti-Scraping Measures

IMDB isn't as aggressive as some sites, but you'll still hit issues at scale:

Request throttling: Keep requests to 1-2 per second. Faster than that and you'll start getting 503 errors.
User-Agent rotation: While a single realistic User-Agent works for small jobs, rotate through a pool for larger ones.
Session management: IMDB may require cookies for some pages. Use requests.Session() to maintain cookies across requests.

def create_scraping_session():
    """Create a requests session with proper headers for IMDB."""
    session = requests.Session()
    session.headers.update({
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive"
    })

    # Hit the homepage first to get cookies
    session.get("https://www.imdb.com/")

    return session

session = create_scraping_session()
# Use session.get() instead of requests.get() for subsequent requests

Using Pre-built IMDB Scrapers

If maintaining your own scraper isn't appealing (and honestly, it shouldn't be for most use cases), pre-built solutions save significant time.

The IMDB Scraper on Apify is one option that handles all the edge cases — pagination, anti-bot measures, structured output — and gives you clean JSON for any IMDB title. You provide movie URLs or search queries, and it returns ratings, cast, reviews, box office data, and more in a structured format. It's particularly useful if you need data on hundreds or thousands of movies without managing infrastructure.

For one-off research projects, the official datasets combined with targeted scraping might be enough. For ongoing data pipelines, a managed scraper saves you from playing whack-a-mole with layout changes.

Ethical Considerations and Legal Notes

A few things to keep in mind:

IMDB's terms of service prohibit scraping for commercial purposes without a license. The official datasets are explicitly for non-commercial use.
Copyright: IMDB's editorial content (reviews, summaries written by staff) is copyrighted. User-submitted reviews have different considerations.
Amazon owns IMDB, and they have the legal resources to enforce their terms. If you're building a commercial product, look into the official AWS IMDB API.
Rate limit respectfully: Even for non-commercial scraping, hammering their servers isn't appropriate. One request per second is a reasonable default.

Conclusion

Scraping IMDB in 2026 is a solved problem with multiple valid approaches. For bulk analysis, start with the official TSV datasets — they're free, comprehensive for basic data, and don't require any scraping. When you need richer data (reviews, box office, full cast details), targeted scraping of individual pages using the JSON-LD structured data is the most reliable method.

For production pipelines, consider a managed scraper like the IMDB Scraper on Apify to avoid the maintenance burden. Whatever approach you choose, combine the official datasets with scraped data rather than scraping everything — it's faster, more reliable, and more respectful of IMDB's infrastructure.

The entertainment data space is rich for analysis. From tracking box office trends to building recommendation systems, IMDB data is the foundation. Now you have the tools to collect it.

Building something with IMDB data? I'd love to hear about it in the comments. Whether it's a recommendation engine, a data viz project, or market research — share what you're working on.

DEV Community