agenthustler

Posted on Mar 20

Scraping IMDb in 2026: Movie Ratings, Cast & Top 250 Without API Keys

#webscraping #python #tutorial #movies

The Hidden Goldmine in IMDb Pages

Every IMDb page hides a treasure in plain sight. Buried in the HTML of every movie, TV show, and episode page is a <script type="application/ld+json"> tag containing rich, structured metadata in JSON format. No API key required. No rate-limited endpoints. No parsing fragile HTML tables.

This is the same structured data that Google, Bing, and other search engines use to populate those rich movie cards you see in search results. And it's freely accessible to anyone who fetches the page.

Let me show you how to use it.

What's Inside the JSON-LD?

Here's what IMDb embeds in every title page as schema.org structured data:

Title and year — exact release year and original title
IMDb rating — the aggregate score and total vote count
Genre — array of genre classifications
Duration — in ISO 8601 format (e.g., PT2H28M for 2 hours 28 minutes)
Description — plot summary
Director and creators — with links to their IMDb profiles
Cast — top-billed actors with profile links
Content rating — PG, PG-13, R, etc.
Date published — original release date
Keywords — plot keywords for the title

That's a remarkable amount of structured data, and it's consistent across millions of titles.

Method 1: Quick Extraction with curl + jq

For one-off lookups, you don't even need Python. curl and a bit of text processing will do:

# Fetch Inception's page and extract the JSON-LD
curl -s "https://www.imdb.com/title/tt1375666/"   -H "User-Agent: Mozilla/5.0 (compatible)"   -H "Accept-Language: en-US" |   grep -o '<script type="application/ld+json">[^<]*</script>' |   sed 's/<script type="application\/ld+json">//;s/<\/script>//' |   python3 -m json.tool

This outputs clean, formatted JSON with all the metadata. Pipe it through jq for specific fields:

# Just get the rating
... | jq '.aggregateRating.ratingValue'
# Output: 8.8

# Get the cast names
... | jq '.actor[].name'
# Output: "Leonardo DiCaprio"
#         "Joseph Gordon-Levitt"
#         ...

Method 2: Python Extraction at Scale

For scraping multiple titles, Python with httpx and selectolax (a fast HTML parser) is the way to go:

import httpx
import json
from selectolax.parser import HTMLParser

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Accept-Language": "en-US,en;q=0.9"
}

def scrape_imdb_title(title_id: str) -> dict:
    """Extract structured data from an IMDb title page."""
    url = f"https://www.imdb.com/title/{title_id}/"

    with httpx.Client() as client:
        resp = client.get(url, headers=HEADERS, follow_redirects=True)
        resp.raise_for_status()

    tree = HTMLParser(resp.text)
    ld_script = tree.css_first('script[type="application/ld+json"]')

    if not ld_script:
        return {"error": "No JSON-LD found", "title_id": title_id}

    data = json.loads(ld_script.text())

    # Normalize the output
    return {
        "id": title_id,
        "title": data.get("name"),
        "year": data.get("datePublished", "")[:4],
        "rating": data.get("aggregateRating", {}).get("ratingValue"),
        "ratingCount": data.get("aggregateRating", {}).get("ratingCount"),
        "genre": data.get("genre", []),
        "duration": data.get("duration"),
        "description": data.get("description"),
        "contentRating": data.get("contentRating"),
        "director": [
            {"name": d.get("name"), "url": d.get("url")}
            for d in (data.get("director", []) if isinstance(data.get("director"), list) else [data.get("director", {})])
        ],
        "cast": [
            {"name": a.get("name"), "url": a.get("url")}
            for a in data.get("actor", [])
        ]
    }

# Example usage
movie = scrape_imdb_title("tt1375666")
print(f"{movie['title']} ({movie['year']}) - {movie['rating']}/10 ({movie['ratingCount']} votes)")
print(f"Genre: {', '.join(movie['genre'])}")
print(f"Director: {movie['director'][0]['name']}")
print(f"Cast: {', '.join(a['name'] for a in movie['cast'][:5])}")

Output:

Inception (2010) - 8.8/10 (2400000 votes)
Genre: Action, Adventure, Sci-Fi
Director: Christopher Nolan
Cast: Leonardo DiCaprio, Joseph Gordon-Levitt, Elliot Page, Tom Hardy, Ken Watanabe

Scraping the Top 250

IMDb's Top 250 page lists the highest-rated movies. Here's how to scrape all of them:

import time

def get_top_250_ids() -> list[str]:
    """Extract title IDs from the IMDb Top 250 page."""
    url = "https://www.imdb.com/chart/top/"

    with httpx.Client() as client:
        resp = client.get(url, headers=HEADERS, follow_redirects=True)

    tree = HTMLParser(resp.text)
    ld_script = tree.css_first('script[type="application/ld+json"]')

    if not ld_script:
        return []

    data = json.loads(ld_script.text())
    items = data.get("itemListElement", [])

    ids = []
    for item in items:
        url = item.get("item", {}).get("url", "")
        # Extract tt1234567 from the URL
        if "/title/" in url:
            tid = url.split("/title/")[1].strip("/")
            ids.append(tid)

    return ids

def scrape_top_250():
    """Scrape all Top 250 movies with rate limiting."""
    ids = get_top_250_ids()
    print(f"Found {len(ids)} titles in Top 250")

    results = []
    for i, tid in enumerate(ids):
        movie = scrape_imdb_title(tid)
        results.append(movie)
        print(f"[{i+1}/{len(ids)}] {movie.get('title')} - {movie.get('rating')}")
        time.sleep(1)  # Be polite

    return results

Pro Tips for Reliable Scraping

1. Rotate User-Agents — IMDb will eventually block a single User-Agent string making thousands of requests. Keep a list and rotate:

import random

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
]

headers = {**HEADERS, "User-Agent": random.choice(USER_AGENTS)}

2. Add Delays — Be respectful. 1-2 seconds between requests is sufficient and keeps you under the radar.

3. Handle Missing Data — Not all titles have all fields. The JSON-LD for a TV episode looks different from a feature film. Always use .get() with defaults.

4. Cache Results — IMDb ratings don't change by the minute. Cache title data for at least 24 hours to avoid unnecessary requests.

The Easy Way: Use Our Apify Actor

Don't want to maintain scraping code? We built IMDb Scraper as an Apify actor that handles all of this:

JSON-LD extraction with fallback parsing
Automatic rate limiting and retry logic
Top 250 Movies and TV Shows charts
Search functionality
Clean JSON/CSV/Excel export
Scheduled runs

{
  "mode": "top250movies"
}

One click, full dataset. No infrastructure to manage.

What Can You Build?

Movie recommendation engine — use genre, cast, and rating data for collaborative filtering
"What to watch" tool — cross-reference IMDb ratings with streaming availability
Film study datasets — analyze trends in genres, ratings, or cast networks over decades
Content calendar — track upcoming releases and their early ratings
Price comparison — find where to watch top-rated content for the best price

Wrapping Up

IMDb's JSON-LD structured data makes movie scraping remarkably clean and reliable. Unlike traditional HTML scraping that breaks with every redesign, JSON-LD follows the schema.org standard and rarely changes format.

Whether you use the Python code above or our IMDb Scraper on Apify, you now have everything you need to build with movie data.

What are you building? Share your projects in the comments — I'm always curious to see what people do with entertainment data.

DEV Community