IMDB holds data on over 10 million titles — movies, TV shows, shorts, video games — making it the most comprehensive entertainment database on the web. If you're building a movie recommendation engine, analyzing box office trends, or researching the film industry, IMDB data is often the starting point.
But IMDB doesn't offer a public API. They used to have one, they shut it down years ago, and the alternatives range from limited to expensive. In this guide, I'll cover the practical ways to get IMDB data in 2026 — from official datasets to web scraping.
IMDB's Official Data Sources
Before you start scraping, know what's available officially. IMDB provides two legitimate data channels:
1. IMDB Non-Commercial Datasets
IMDB publishes daily TSV (tab-separated) dumps at datasets.imdbws.com. These are free for non-commercial use and contain:
-
title.basics.tsv.gz— Title ID, type, name, year, runtime, genres -
title.ratings.tsv.gz— Average rating and vote count per title -
title.crew.tsv.gz— Directors and writers per title -
title.principals.tsv.gz— Top-billed cast and crew -
name.basics.tsv.gz— Person names, birth/death years, known-for titles
Here's how to download and work with them:
import pandas as pd
import requests
import gzip
import io
def load_imdb_dataset(filename):
"""Download and load an IMDB dataset into a DataFrame."""
url = f"https://datasets.imdbws.com/{filename}"
print(f"Downloading {filename}...")
response = requests.get(url, stream=True)
response.raise_for_status()
# Decompress and read TSV
with gzip.open(io.BytesIO(response.content), 'rt', encoding='utf-8') as f:
df = pd.read_csv(f, sep='\t', na_values='\\N', low_memory=False)
print(f"Loaded {len(df):,} rows")
return df
# Load titles and ratings
titles = load_imdb_dataset("title.basics.tsv.gz")
ratings = load_imdb_dataset("title.ratings.tsv.gz")
# Merge them
movies = titles[titles['titleType'] == 'movie'].merge(ratings, on='tconst', how='inner')
# Top rated movies with significant votes
top_movies = movies[movies['numVotes'] >= 50000].nlargest(20, 'averageRating')
for _, movie in top_movies.iterrows():
print(f"{movie['primaryTitle']} ({movie['startYear']}) - {movie['averageRating']}/10 ({movie['numVotes']:,.0f} votes)")
These datasets are solid for bulk analysis — ratings, genres, cast relationships. But they're missing a lot: plot summaries, reviews, box office numbers, poster images, and detailed cast info.
2. IMDB API (Commercial)
Amazon offers a commercial IMDB API through AWS Data Exchange. It's comprehensive but priced for enterprise customers (we're talking thousands per month). Unless you have a significant budget, this isn't practical for most projects.
Scraping IMDB Movie Pages
For the data that's not in the official datasets, scraping individual movie pages is the go-to approach. IMDB's pages are mostly server-rendered HTML, which makes them relatively straightforward to parse.
import requests
from bs4 import BeautifulSoup
import json
import time
def scrape_movie(imdb_id):
"""Scrape detailed movie data from an IMDB title page."""
url = f"https://www.imdb.com/title/{imdb_id}/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9"
}
response = requests.get(url, headers=headers)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# IMDB embeds structured data as JSON-LD
script_tag = soup.find('script', type='application/ld+json')
if script_tag:
structured_data = json.loads(script_tag.string)
else:
structured_data = {}
movie = {
"imdb_id": imdb_id,
"title": structured_data.get("name", ""),
"description": structured_data.get("description", ""),
"rating": structured_data.get("aggregateRating", {}).get("ratingValue"),
"vote_count": structured_data.get("aggregateRating", {}).get("ratingCount"),
"genres": structured_data.get("genre", []),
"director": extract_people(structured_data.get("director", [])),
"actors": extract_people(structured_data.get("actor", [])),
"duration": structured_data.get("duration", ""),
"content_rating": structured_data.get("contentRating", ""),
"poster_url": structured_data.get("image", ""),
"date_published": structured_data.get("datePublished", "")
}
return movie
def extract_people(data):
"""Extract names from IMDB's JSON-LD person entries."""
if isinstance(data, dict):
return [data.get("name", "")]
elif isinstance(data, list):
return [p.get("name", "") for p in data if isinstance(p, dict)]
return []
# Example: Scrape The Shawshank Redemption
movie = scrape_movie("tt0111161")
for key, value in movie.items():
print(f"{key}: {value}")
The JSON-LD approach is the real trick here. Instead of parsing brittle CSS selectors, IMDB embeds structured data in a <script type="application/ld+json"> tag that follows Schema.org format. This is much more stable than DOM parsing because it's a standardized format that IMDB maintains for SEO purposes.
Extracting Box Office Data
Box office data lives on a separate section of each movie page. It's not in the JSON-LD, so you'll need to parse the HTML:
def scrape_box_office(imdb_id):
"""Scrape box office information from IMDB."""
url = f"https://www.imdb.com/title/{imdb_id}/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept-Language": "en-US,en;q=0.9"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
box_office = {}
# Look for the box office section
box_office_section = soup.find('div', {'data-testid': 'title-boxoffice-section'})
if box_office_section:
items = box_office_section.find_all('li', {'data-testid': re.compile('title-boxoffice-')})
for item in items:
label_el = item.find('span', class_='ipc-metadata-list-item__label')
value_el = item.find('span', class_='ipc-metadata-list-item__list-content-item')
if label_el and value_el:
label = label_el.text.strip()
value = value_el.text.strip()
box_office[label] = value
return box_office
import re
box_office = scrape_box_office("tt0111161")
for metric, value in box_office.items():
print(f"{metric}: {value}")
# Budget: $25,000,000 (estimated)
# Gross US & Canada: $58,500,000
# Opening weekend US & Canada: $727,327
# Gross worldwide: $73,300,000
Note that box office data is only available for theatrical releases. Streaming-only titles won't have this section.
Scraping User Reviews at Scale
IMDB reviews are paginated and require handling the load-more pattern. Here's how to collect them:
def scrape_reviews(imdb_id, max_reviews=100):
"""Scrape user reviews from IMDB."""
url = f"https://www.imdb.com/title/{imdb_id}/reviews"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept-Language": "en-US,en;q=0.9"
}
reviews = []
params = {}
while len(reviews) < max_reviews:
response = requests.get(url, headers=headers, params=params)
soup = BeautifulSoup(response.text, 'html.parser')
review_containers = soup.find_all('div', class_='review-container')
if not review_containers:
break
for container in review_containers:
review = {}
# Rating (out of 10)
rating_el = container.find('span', class_='rating-other-user-rating')
if rating_el:
review['rating'] = rating_el.find('span').text.strip()
# Title
title_el = container.find('a', class_='title')
if title_el:
review['title'] = title_el.text.strip()
# Review text
content_el = container.find('div', class_='text')
if content_el:
review['text'] = content_el.text.strip()
# Date
date_el = container.find('span', class_='review-date')
if date_el:
review['date'] = date_el.text.strip()
# Helpfulness
helpful_el = container.find('div', class_='actions')
if helpful_el:
review['helpful'] = helpful_el.text.strip()
reviews.append(review)
# Check for pagination key
load_more = soup.find('div', class_='load-more-data')
if load_more and load_more.get('data-key'):
params['paginationKey'] = load_more['data-key']
else:
break
time.sleep(1) # Be respectful
return reviews[:max_reviews]
reviews = scrape_reviews("tt0111161", max_reviews=50)
print(f"Collected {len(reviews)} reviews")
for r in reviews[:3]:
print(f" [{r.get('rating', 'N/A')}/10] {r.get('title', 'No title')}")
Building a Movie Dataset Pipeline
In practice, you'll want to combine the official datasets with scraped data. Here's a pattern that works well:
import csv
import time
def build_movie_dataset(genre='Action', min_votes=10000, min_year=2020, limit=200):
"""
Build a rich movie dataset by combining IMDB dumps with scraped data.
Step 1: Filter candidates from the official dataset (fast, no scraping)
Step 2: Enrich selected movies by scraping individual pages (slow, throttled)
"""
# Step 1: Load and filter from official data
print("Loading official IMDB datasets...")
titles = load_imdb_dataset("title.basics.tsv.gz")
ratings = load_imdb_dataset("title.ratings.tsv.gz")
movies = titles.merge(ratings, on='tconst')
filtered = movies[
(movies['titleType'] == 'movie') &
(movies['genres'].str.contains(genre, na=False)) &
(movies['numVotes'] >= min_votes) &
(movies['startYear'].astype(float) >= min_year)
].nlargest(limit, 'numVotes')
print(f"Found {len(filtered)} candidate movies")
# Step 2: Enrich with scraped data
enriched = []
for idx, row in filtered.iterrows():
imdb_id = row['tconst']
print(f"Scraping {row['primaryTitle']} ({imdb_id})...")
try:
movie_data = scrape_movie(imdb_id)
box_office = scrape_box_office(imdb_id)
movie_data['box_office'] = box_office
movie_data['official_rating'] = row['averageRating']
movie_data['official_votes'] = row['numVotes']
enriched.append(movie_data)
except Exception as e:
print(f" Error: {e}")
# Respect rate limits — 1 request per second
time.sleep(1.5)
return enriched
# Build dataset
dataset = build_movie_dataset(genre='Sci-Fi', min_year=2023, limit=50)
print(f"\nBuilt dataset with {len(dataset)} enriched movies")
This two-step approach is efficient: use the official bulk data for filtering and basic info, then only scrape the pages you actually need.
Handling Anti-Scraping Measures
IMDB isn't as aggressive as some sites, but you'll still hit issues at scale:
- Request throttling: Keep requests to 1-2 per second. Faster than that and you'll start getting 503 errors.
- User-Agent rotation: While a single realistic User-Agent works for small jobs, rotate through a pool for larger ones.
-
Session management: IMDB may require cookies for some pages. Use
requests.Session()to maintain cookies across requests.
def create_scraping_session():
"""Create a requests session with proper headers for IMDB."""
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive"
})
# Hit the homepage first to get cookies
session.get("https://www.imdb.com/")
return session
session = create_scraping_session()
# Use session.get() instead of requests.get() for subsequent requests
Using Pre-built IMDB Scrapers
If maintaining your own scraper isn't appealing (and honestly, it shouldn't be for most use cases), pre-built solutions save significant time.
The IMDB Scraper on Apify is one option that handles all the edge cases — pagination, anti-bot measures, structured output — and gives you clean JSON for any IMDB title. You provide movie URLs or search queries, and it returns ratings, cast, reviews, box office data, and more in a structured format. It's particularly useful if you need data on hundreds or thousands of movies without managing infrastructure.
For one-off research projects, the official datasets combined with targeted scraping might be enough. For ongoing data pipelines, a managed scraper saves you from playing whack-a-mole with layout changes.
Ethical Considerations and Legal Notes
A few things to keep in mind:
- IMDB's terms of service prohibit scraping for commercial purposes without a license. The official datasets are explicitly for non-commercial use.
- Copyright: IMDB's editorial content (reviews, summaries written by staff) is copyrighted. User-submitted reviews have different considerations.
- Amazon owns IMDB, and they have the legal resources to enforce their terms. If you're building a commercial product, look into the official AWS IMDB API.
- Rate limit respectfully: Even for non-commercial scraping, hammering their servers isn't appropriate. One request per second is a reasonable default.
Conclusion
Scraping IMDB in 2026 is a solved problem with multiple valid approaches. For bulk analysis, start with the official TSV datasets — they're free, comprehensive for basic data, and don't require any scraping. When you need richer data (reviews, box office, full cast details), targeted scraping of individual pages using the JSON-LD structured data is the most reliable method.
For production pipelines, consider a managed scraper like the IMDB Scraper on Apify to avoid the maintenance burden. Whatever approach you choose, combine the official datasets with scraped data rather than scraping everything — it's faster, more reliable, and more respectful of IMDB's infrastructure.
The entertainment data space is rich for analysis. From tracking box office trends to building recommendation systems, IMDB data is the foundation. Now you have the tools to collect it.
Building something with IMDB data? I'd love to hear about it in the comments. Whether it's a recommendation engine, a data viz project, or market research — share what you're working on.
Top comments (0)