Netflix is one of the most data-rich entertainment platforms on the planet. While its internal recommendation engine and viewing statistics remain locked behind authentication, a surprising amount of Netflix data is publicly accessible without logging in. This data includes title catalogs, genre classifications, new release sections, and regional availability information that can be cross-referenced with external databases like IMDB.
In this guide, we'll explore exactly what Netflix data is publicly available, how it's structured on the web, and how to build reliable scrapers using both custom code and Apify's cloud scraping infrastructure to extract it at scale.
What Netflix Data Is Publicly Accessible?
Before writing any code, it's essential to understand what Netflix exposes without authentication. Many developers assume everything requires a logged-in session, but that's not the case.
Public Title Pages
Every Netflix title has a public-facing page at netflix.com/title/{id}. These pages are accessible without login and contain:
- Title name and original title (for international content)
- Synopsis/description — the short and long descriptions
- Cast and crew — actors, directors, writers
- Genre tags — Netflix's internal genre classification
- Maturity rating — TV-MA, PG-13, etc.
- Release year — when the content was originally released
- Type indicator — whether it's a movie, series, or documentary
- Thumbnail/poster images — high-resolution artwork URLs
Genre Browsing Pages
Netflix has a well-known system of genre codes. URLs like netflix.com/browse/genre/{code} expose category-level browsing. While some require authentication for full content, the genre structure itself is discoverable through public metadata and third-party databases that catalog Netflix's genre IDs.
Popular genre codes include:
| Genre Code | Category |
|---|---|
| 6839 | Action & Adventure |
| 33264 | Asian TV Shows |
| 1365 | Action Comedies |
| 7424 | Anime |
| 8933 | Classic Movies |
| 5763 | Dramas |
| 11881 | Thrillers |
| 2595 | Horror |
| 31574 | Reality TV |
There are actually over 27,000 micro-genre codes that Netflix uses internally, many of which map to publicly accessible browse pages.
New Releases and Trending Sections
Netflix's media center (media.netflix.com) publishes press releases about new content additions. The "What's New" sections provide structured data about upcoming releases, including premiere dates, title descriptions, and regional launch schedules.
IMDB Cross-Referenced Data
While not directly from Netflix, IMDB maintains comprehensive linkage data between Netflix titles and their IMDB entries. This lets you enrich Netflix catalog data with:
- IMDB ratings and vote counts
- Full cast filmographies
- Box office data
- Awards history
- User reviews and sentiment data
Regional Availability Detection
Netflix's catalog varies dramatically by region. By examining public-facing pages from different geographic endpoints and using services like uNoGS (unofficial Netflix online Global Search), you can detect which titles are available in which countries. This is public data derived from Netflix's own CDN and DNS routing behavior.
Understanding Netflix's Page Structure
Netflix renders most of its content dynamically using React. This means traditional HTTP request-based scraping will only get you the initial HTML shell. The actual content data is loaded through:
-
Server-side rendered metadata — embedded in
<script>tags as JSON-LD structured data - Falcor API responses — Netflix uses Falcor (their open-source data fetching framework) to load content data
-
Open Graph meta tags — title, description, and image data in
<meta>tags
For public pages, the most reliable extraction targets are the JSON-LD structured data and Open Graph tags, as these are rendered server-side for SEO purposes.
Setting Up Your Scraping Environment
Python Setup
# requirements.txt
requests==2.31.0
beautifulsoup4==4.12.3
lxml==5.1.0
apify-client==1.8.1
Install dependencies:
pip install requests beautifulsoup4 lxml apify-client
Node.js Setup
npm init -y
npm install axios cheerio crawlee apify-client
Building a Netflix Title Scraper
Python Implementation
import requests
from bs4 import BeautifulSoup
import json
import re
import time
import random
class NetflixPublicScraper:
"""Scraper for publicly accessible Netflix title data."""
BASE_URL = "https://www.netflix.com/title"
def __init__(self):
self.session = requests.Session()
self.session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml"
})
def scrape_title(self, title_id: str) -> dict:
"""Extract public metadata from a Netflix title page."""
url = f"{self.BASE_URL}/{title_id}"
try:
response = self.session.get(url, timeout=15)
response.raise_for_status()
except requests.RequestException as e:
return {"error": str(e), "title_id": title_id}
soup = BeautifulSoup(response.text, "lxml")
data = {"title_id": title_id, "url": url}
# Extract Open Graph metadata
og_tags = {
"title": "og:title",
"description": "og:description",
"image": "og:image",
"type": "og:type",
"site_name": "og:site_name"
}
for key, property_name in og_tags.items():
tag = soup.find("meta", property=property_name)
data[key] = tag["content"] if tag else None
# Extract JSON-LD structured data
json_ld_scripts = soup.find_all(
"script", type="application/ld+json"
)
for script in json_ld_scripts:
try:
ld_data = json.loads(script.string)
if isinstance(ld_data, dict):
data["structured_data"] = ld_data
data["genre"] = ld_data.get("genre", [])
data["actors"] = [
a.get("name")
for a in ld_data.get("actors", [])
if isinstance(a, dict)
]
data["director"] = [
d.get("name")
for d in ld_data.get("director", [])
if isinstance(d, dict)
]
data["content_rating"] = ld_data.get(
"contentRating"
)
data["date_created"] = ld_data.get(
"dateCreated"
)
except (json.JSONDecodeError, TypeError):
continue
# Extract additional metadata from page scripts
scripts = soup.find_all("script")
for script in scripts:
if script.string and "reactContext" in script.string:
context_match = re.search(
r'reactContext\s*=\s*({.+?});',
script.string
)
if context_match:
try:
context = json.loads(context_match.group(1))
data["country"] = (
context.get("models", {})
.get("geoInfo", {})
.get("data", {})
.get("country")
)
except json.JSONDecodeError:
pass
return data
def scrape_multiple(
self, title_ids: list, delay_range=(1, 3)
) -> list:
"""Scrape multiple titles with polite delays."""
results = []
for i, title_id in enumerate(title_ids):
print(f"Scraping {i+1}/{len(title_ids)}: {title_id}")
result = self.scrape_title(title_id)
results.append(result)
if i < len(title_ids) - 1:
delay = random.uniform(*delay_range)
time.sleep(delay)
return results
# Usage example
if __name__ == "__main__":
scraper = NetflixPublicScraper()
sample_ids = [
"80100172", # Stranger Things
"80057281", # Narcos
"70143836", # Breaking Bad
]
results = scraper.scrape_multiple(sample_ids)
for result in results:
print(f"\nTitle: {result.get('title', 'N/A')}")
print(f"Description: {result.get('description', 'N/A')[:100]}...")
print(f"Genres: {result.get('genre', [])}")
print(f"Rating: {result.get('content_rating', 'N/A')}")
Node.js Implementation
const axios = require('axios');
const cheerio = require('cheerio');
class NetflixPublicScraper {
constructor() {
this.baseUrl = 'https://www.netflix.com/title';
this.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
+ 'AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.9',
'Accept': 'text/html,application/xhtml+xml',
};
}
async scrapeTitle(titleId) {
const url = `${this.baseUrl}/${titleId}`;
try {
const { data: html } = await axios.get(url, {
headers: this.headers,
timeout: 15000,
});
const $ = cheerio.load(html);
const result = { titleId, url };
// Open Graph metadata
result.title = $('meta[property="og:title"]').attr('content');
result.description = $('meta[property="og:description"]')
.attr('content');
result.image = $('meta[property="og:image"]').attr('content');
// JSON-LD structured data
$('script[type="application/ld+json"]').each((_, el) => {
try {
const ldData = JSON.parse($(el).html());
if (ldData && typeof ldData === 'object') {
result.structuredData = ldData;
result.genre = ldData.genre || [];
result.contentRating = ldData.contentRating;
result.dateCreated = ldData.dateCreated;
result.actors = (ldData.actors || [])
.filter(a => a.name)
.map(a => a.name);
result.director = (ldData.director || [])
.filter(d => d.name)
.map(d => d.name);
}
} catch (e) {
// Skip malformed JSON-LD
}
});
return result;
} catch (error) {
return { error: error.message, titleId };
}
}
async scrapeMultiple(titleIds, delayMs = 2000) {
const results = [];
for (let i = 0; i < titleIds.length; i++) {
console.log(
`Scraping ${i + 1}/${titleIds.length}: ${titleIds[i]}`
);
const result = await this.scrapeTitle(titleIds[i]);
results.push(result);
if (i < titleIds.length - 1) {
await new Promise(r => setTimeout(r, delayMs));
}
}
return results;
}
}
// Usage
(async () => {
const scraper = new NetflixPublicScraper();
const results = await scraper.scrapeMultiple([
'80100172', // Stranger Things
'80057281', // Narcos
'70143836', // Breaking Bad
]);
results.forEach(r => {
console.log(`\nTitle: ${r.title || 'N/A'}`);
console.log(`Genres: ${(r.genre || []).join(', ')}`);
console.log(`Rating: ${r.contentRating || 'N/A'}`);
});
})();
Regional Availability Detection
One of the most valuable datasets you can build is a regional availability map. Here's how to detect which Netflix titles are available in different countries:
import requests
class RegionalAvailabilityChecker:
"""Check Netflix title availability across regions
using public uNoGS API data."""
def __init__(self):
self.session = requests.Session()
def check_availability(self, title_name: str) -> dict:
"""Check which countries a title is available in."""
search_url = "https://unogs.com/api/search"
params = {
"query": title_name,
"limit": 5
}
try:
response = self.session.get(
search_url, params=params, timeout=10
)
data = response.json()
results = []
for item in data.get("results", []):
results.append({
"title": item.get("title"),
"netflix_id": item.get("nfid"),
"countries": item.get("country_list", []),
"country_count": item.get("country_count", 0),
"imdb_id": item.get("imdbid"),
"rating": item.get("imdbrating"),
"year": item.get("year")
})
return {"query": title_name, "results": results}
except requests.RequestException as e:
return {"error": str(e)}
Enriching Data with IMDB Cross-References
Once you have Netflix title IDs, you can cross-reference them with IMDB for richer data:
def enrich_with_imdb(netflix_data: dict, imdb_id: str) -> dict:
"""Add IMDB data to Netflix scraping results."""
imdb_url = f"https://www.imdb.com/title/{imdb_id}/"
headers = {
"User-Agent": "Mozilla/5.0 (compatible; DataBot/1.0)"
}
response = requests.get(
imdb_url, headers=headers, timeout=15
)
soup = BeautifulSoup(response.text, "lxml")
# Extract IMDB JSON-LD
ld_script = soup.find("script", type="application/ld+json")
if ld_script:
imdb_data = json.loads(ld_script.string)
netflix_data["imdb_rating"] = (
imdb_data
.get("aggregateRating", {})
.get("ratingValue")
)
netflix_data["imdb_votes"] = (
imdb_data
.get("aggregateRating", {})
.get("ratingCount")
)
netflix_data["imdb_keywords"] = imdb_data.get(
"keywords", ""
).split(",")
netflix_data["imdb_description"] = imdb_data.get(
"description"
)
return netflix_data
Scaling with Apify
Local scraping works for small datasets, but for catalog-scale extraction — tens of thousands of titles across multiple regions — you need cloud infrastructure. Apify provides exactly this.
Why Use Apify for Netflix Data?
- Proxy rotation — Netflix actively blocks datacenter IPs. Apify's residential proxy pool handles this automatically.
- Browser rendering — React-rendered content requires headless browsers. Apify's Crawlee framework manages browser instances efficiently.
- Scheduling — Track catalog changes daily or weekly with built-in scheduling.
- Storage — Results go directly to Apify datasets, exportable as JSON, CSV, or Excel.
- Scalability — Run hundreds of concurrent browser instances without managing infrastructure.
Using an Apify Netflix Actor
from apify_client import ApifyClient
# Initialize the Apify client
client = ApifyClient("YOUR_APIFY_TOKEN")
# Configure the Netflix scraping actor
run_input = {
"searchTerms": [
"stranger things",
"squid game",
"wednesday",
"the witcher"
],
"maxResults": 100,
"includeGenres": True,
"includeRegionalData": True,
"proxyConfiguration": {
"useApifyProxy": True,
"apifyProxyGroups": ["RESIDENTIAL"]
}
}
# Run the actor
run = client.actor("netflix-catalog-scraper").call(
run_input=run_input
)
# Fetch results from the dataset
dataset_items = client.dataset(
run["defaultDatasetId"]
).list_items().items
print(f"Scraped {len(dataset_items)} Netflix titles")
for item in dataset_items[:5]:
print(f" {item['title']} ({item.get('year', 'N/A')})")
print(f" Genres: {', '.join(item.get('genres', []))}")
print(f" Available in: {item.get('country_count', '?')} countries")
print()
Node.js Apify Integration
const { ApifyClient } = require('apify-client');
const client = new ApifyClient({
token: 'YOUR_APIFY_TOKEN',
});
async function scrapeNetflixCatalog() {
const run = await client.actor('netflix-catalog-scraper')
.call({
searchTerms: ['action movies', 'sci-fi series'],
maxResults: 200,
includeGenres: true,
includeRegionalData: true,
});
const { items } = await client
.dataset(run.defaultDatasetId)
.listItems();
console.log(`Found ${items.length} titles`);
// Export to CSV
const csvUrl = `https://api.apify.com/v2/datasets/`
+ `${run.defaultDatasetId}/items?format=csv`;
console.log(`Download CSV: ${csvUrl}`);
return items;
}
scrapeNetflixCatalog();
Handling Anti-Scraping Measures
Netflix employs several anti-bot measures. Here's how to handle them ethically:
Rate Limiting
Always implement respectful delays between requests:
import time
import random
def polite_delay(min_seconds=2, max_seconds=5):
"""Add a random delay to avoid overwhelming the server."""
delay = random.uniform(min_seconds, max_seconds)
time.sleep(delay)
Rotating User Agents
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) "
"AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
]
def get_random_headers():
return {
"User-Agent": random.choice(USER_AGENTS),
"Accept-Language": "en-US,en;q=0.9",
}
Practical Use Cases for Netflix Public Data
1. Content Research and Analysis
Track what genres Netflix is investing in, which regions get exclusive content first, and how the catalog composition changes over time.
2. Competitive Intelligence
Media companies use catalog data to understand Netflix's content strategy — what types of originals they're producing, which licensed content they're acquiring, and regional content gaps.
3. Entertainment Apps and Recommendation Engines
Build third-party recommendation tools that help users discover content across streaming platforms by aggregating catalog data from Netflix and competitors.
4. Academic Research
Researchers study content diversity, regional representation, and the economics of streaming through publicly available catalog data.
5. Journalism and Reporting
Entertainment journalists track catalog additions, removals, and regional differences for reporting on the streaming industry.
Legal and Ethical Considerations
When scraping Netflix's public data:
- Respect robots.txt — Always check and follow Netflix's robots.txt directives
- Rate limit your requests — Never overwhelm servers with rapid-fire requests
- Only access public data — Never attempt to bypass login walls or access authenticated endpoints
- Don't redistribute copyrighted content — Metadata is fair game for analysis, but actual video content is protected
- Check terms of service — Review Netflix's ToS regarding automated data collection
- Use data responsibly — Aggregated insights are generally acceptable; individual user data never is
Conclusion
Netflix's public data footprint is larger than most people realize. Between title pages, genre structures, regional availability data, and IMDB cross-references, you can build comprehensive entertainment datasets without ever touching authenticated endpoints.
For small-scale projects, the Python and Node.js scrapers in this article will get you started. For production-scale data pipelines processing thousands of titles across dozens of regions, Apify's cloud infrastructure handles the heavy lifting of proxy rotation, browser management, and scheduling.
The key is to start with the publicly accessible data, respect rate limits, and build incrementally. Whether you're building a recommendation engine, doing content research, or tracking industry trends, the data is out there — you just need the right tools to collect it.
Top comments (0)