DEV Community

agenthustler
agenthustler

Posted on

Netflix Public Data Scraping: Extract Titles, Genres and Streaming Availability

Netflix is one of the most data-rich entertainment platforms on the planet. While its internal recommendation engine and viewing statistics remain locked behind authentication, a surprising amount of Netflix data is publicly accessible without logging in. This data includes title catalogs, genre classifications, new release sections, and regional availability information that can be cross-referenced with external databases like IMDB.

In this guide, we'll explore exactly what Netflix data is publicly available, how it's structured on the web, and how to build reliable scrapers using both custom code and Apify's cloud scraping infrastructure to extract it at scale.

What Netflix Data Is Publicly Accessible?

Before writing any code, it's essential to understand what Netflix exposes without authentication. Many developers assume everything requires a logged-in session, but that's not the case.

Public Title Pages

Every Netflix title has a public-facing page at netflix.com/title/{id}. These pages are accessible without login and contain:

  • Title name and original title (for international content)
  • Synopsis/description — the short and long descriptions
  • Cast and crew — actors, directors, writers
  • Genre tags — Netflix's internal genre classification
  • Maturity rating — TV-MA, PG-13, etc.
  • Release year — when the content was originally released
  • Type indicator — whether it's a movie, series, or documentary
  • Thumbnail/poster images — high-resolution artwork URLs

Genre Browsing Pages

Netflix has a well-known system of genre codes. URLs like netflix.com/browse/genre/{code} expose category-level browsing. While some require authentication for full content, the genre structure itself is discoverable through public metadata and third-party databases that catalog Netflix's genre IDs.

Popular genre codes include:

Genre Code Category
6839 Action & Adventure
33264 Asian TV Shows
1365 Action Comedies
7424 Anime
8933 Classic Movies
5763 Dramas
11881 Thrillers
2595 Horror
31574 Reality TV

There are actually over 27,000 micro-genre codes that Netflix uses internally, many of which map to publicly accessible browse pages.

New Releases and Trending Sections

Netflix's media center (media.netflix.com) publishes press releases about new content additions. The "What's New" sections provide structured data about upcoming releases, including premiere dates, title descriptions, and regional launch schedules.

IMDB Cross-Referenced Data

While not directly from Netflix, IMDB maintains comprehensive linkage data between Netflix titles and their IMDB entries. This lets you enrich Netflix catalog data with:

  • IMDB ratings and vote counts
  • Full cast filmographies
  • Box office data
  • Awards history
  • User reviews and sentiment data

Regional Availability Detection

Netflix's catalog varies dramatically by region. By examining public-facing pages from different geographic endpoints and using services like uNoGS (unofficial Netflix online Global Search), you can detect which titles are available in which countries. This is public data derived from Netflix's own CDN and DNS routing behavior.

Understanding Netflix's Page Structure

Netflix renders most of its content dynamically using React. This means traditional HTTP request-based scraping will only get you the initial HTML shell. The actual content data is loaded through:

  1. Server-side rendered metadata — embedded in <script> tags as JSON-LD structured data
  2. Falcor API responses — Netflix uses Falcor (their open-source data fetching framework) to load content data
  3. Open Graph meta tags — title, description, and image data in <meta> tags

For public pages, the most reliable extraction targets are the JSON-LD structured data and Open Graph tags, as these are rendered server-side for SEO purposes.

Setting Up Your Scraping Environment

Python Setup

# requirements.txt
requests==2.31.0
beautifulsoup4==4.12.3
lxml==5.1.0
apify-client==1.8.1
Enter fullscreen mode Exit fullscreen mode

Install dependencies:

pip install requests beautifulsoup4 lxml apify-client
Enter fullscreen mode Exit fullscreen mode

Node.js Setup

npm init -y
npm install axios cheerio crawlee apify-client
Enter fullscreen mode Exit fullscreen mode

Building a Netflix Title Scraper

Python Implementation

import requests
from bs4 import BeautifulSoup
import json
import re
import time
import random

class NetflixPublicScraper:
    """Scraper for publicly accessible Netflix title data."""

    BASE_URL = "https://www.netflix.com/title"

    def __init__(self):
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                          "AppleWebKit/537.36 (KHTML, like Gecko) "
                          "Chrome/120.0.0.0 Safari/537.36",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept": "text/html,application/xhtml+xml"
        })

    def scrape_title(self, title_id: str) -> dict:
        """Extract public metadata from a Netflix title page."""
        url = f"{self.BASE_URL}/{title_id}"

        try:
            response = self.session.get(url, timeout=15)
            response.raise_for_status()
        except requests.RequestException as e:
            return {"error": str(e), "title_id": title_id}

        soup = BeautifulSoup(response.text, "lxml")
        data = {"title_id": title_id, "url": url}

        # Extract Open Graph metadata
        og_tags = {
            "title": "og:title",
            "description": "og:description",
            "image": "og:image",
            "type": "og:type",
            "site_name": "og:site_name"
        }

        for key, property_name in og_tags.items():
            tag = soup.find("meta", property=property_name)
            data[key] = tag["content"] if tag else None

        # Extract JSON-LD structured data
        json_ld_scripts = soup.find_all(
            "script", type="application/ld+json"
        )

        for script in json_ld_scripts:
            try:
                ld_data = json.loads(script.string)
                if isinstance(ld_data, dict):
                    data["structured_data"] = ld_data
                    data["genre"] = ld_data.get("genre", [])
                    data["actors"] = [
                        a.get("name") 
                        for a in ld_data.get("actors", [])
                        if isinstance(a, dict)
                    ]
                    data["director"] = [
                        d.get("name") 
                        for d in ld_data.get("director", [])
                        if isinstance(d, dict)
                    ]
                    data["content_rating"] = ld_data.get(
                        "contentRating"
                    )
                    data["date_created"] = ld_data.get(
                        "dateCreated"
                    )
            except (json.JSONDecodeError, TypeError):
                continue

        # Extract additional metadata from page scripts
        scripts = soup.find_all("script")
        for script in scripts:
            if script.string and "reactContext" in script.string:
                context_match = re.search(
                    r'reactContext\s*=\s*({.+?});',
                    script.string
                )
                if context_match:
                    try:
                        context = json.loads(context_match.group(1))
                        data["country"] = (
                            context.get("models", {})
                            .get("geoInfo", {})
                            .get("data", {})
                            .get("country")
                        )
                    except json.JSONDecodeError:
                        pass

        return data

    def scrape_multiple(
        self, title_ids: list, delay_range=(1, 3)
    ) -> list:
        """Scrape multiple titles with polite delays."""
        results = []
        for i, title_id in enumerate(title_ids):
            print(f"Scraping {i+1}/{len(title_ids)}: {title_id}")
            result = self.scrape_title(title_id)
            results.append(result)

            if i < len(title_ids) - 1:
                delay = random.uniform(*delay_range)
                time.sleep(delay)

        return results


# Usage example
if __name__ == "__main__":
    scraper = NetflixPublicScraper()

    sample_ids = [
        "80100172",  # Stranger Things
        "80057281",  # Narcos
        "70143836",  # Breaking Bad
    ]

    results = scraper.scrape_multiple(sample_ids)

    for result in results:
        print(f"\nTitle: {result.get('title', 'N/A')}")
        print(f"Description: {result.get('description', 'N/A')[:100]}...")
        print(f"Genres: {result.get('genre', [])}")
        print(f"Rating: {result.get('content_rating', 'N/A')}")
Enter fullscreen mode Exit fullscreen mode

Node.js Implementation

const axios = require('axios');
const cheerio = require('cheerio');

class NetflixPublicScraper {
  constructor() {
    this.baseUrl = 'https://www.netflix.com/title';
    this.headers = {
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
        + 'AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36',
      'Accept-Language': 'en-US,en;q=0.9',
      'Accept': 'text/html,application/xhtml+xml',
    };
  }

  async scrapeTitle(titleId) {
    const url = `${this.baseUrl}/${titleId}`;

    try {
      const { data: html } = await axios.get(url, {
        headers: this.headers,
        timeout: 15000,
      });

      const $ = cheerio.load(html);
      const result = { titleId, url };

      // Open Graph metadata
      result.title = $('meta[property="og:title"]').attr('content');
      result.description = $('meta[property="og:description"]')
        .attr('content');
      result.image = $('meta[property="og:image"]').attr('content');

      // JSON-LD structured data
      $('script[type="application/ld+json"]').each((_, el) => {
        try {
          const ldData = JSON.parse($(el).html());
          if (ldData && typeof ldData === 'object') {
            result.structuredData = ldData;
            result.genre = ldData.genre || [];
            result.contentRating = ldData.contentRating;
            result.dateCreated = ldData.dateCreated;
            result.actors = (ldData.actors || [])
              .filter(a => a.name)
              .map(a => a.name);
            result.director = (ldData.director || [])
              .filter(d => d.name)
              .map(d => d.name);
          }
        } catch (e) {
          // Skip malformed JSON-LD
        }
      });

      return result;
    } catch (error) {
      return { error: error.message, titleId };
    }
  }

  async scrapeMultiple(titleIds, delayMs = 2000) {
    const results = [];
    for (let i = 0; i < titleIds.length; i++) {
      console.log(
        `Scraping ${i + 1}/${titleIds.length}: ${titleIds[i]}`
      );
      const result = await this.scrapeTitle(titleIds[i]);
      results.push(result);

      if (i < titleIds.length - 1) {
        await new Promise(r => setTimeout(r, delayMs));
      }
    }
    return results;
  }
}

// Usage
(async () => {
  const scraper = new NetflixPublicScraper();
  const results = await scraper.scrapeMultiple([
    '80100172', // Stranger Things
    '80057281', // Narcos
    '70143836', // Breaking Bad
  ]);

  results.forEach(r => {
    console.log(`\nTitle: ${r.title || 'N/A'}`);
    console.log(`Genres: ${(r.genre || []).join(', ')}`);
    console.log(`Rating: ${r.contentRating || 'N/A'}`);
  });
})();
Enter fullscreen mode Exit fullscreen mode

Regional Availability Detection

One of the most valuable datasets you can build is a regional availability map. Here's how to detect which Netflix titles are available in different countries:

import requests

class RegionalAvailabilityChecker:
    """Check Netflix title availability across regions 
    using public uNoGS API data."""

    def __init__(self):
        self.session = requests.Session()

    def check_availability(self, title_name: str) -> dict:
        """Check which countries a title is available in."""

        search_url = "https://unogs.com/api/search"
        params = {
            "query": title_name,
            "limit": 5
        }

        try:
            response = self.session.get(
                search_url, params=params, timeout=10
            )
            data = response.json()

            results = []
            for item in data.get("results", []):
                results.append({
                    "title": item.get("title"),
                    "netflix_id": item.get("nfid"),
                    "countries": item.get("country_list", []),
                    "country_count": item.get("country_count", 0),
                    "imdb_id": item.get("imdbid"),
                    "rating": item.get("imdbrating"),
                    "year": item.get("year")
                })

            return {"query": title_name, "results": results}

        except requests.RequestException as e:
            return {"error": str(e)}
Enter fullscreen mode Exit fullscreen mode

Enriching Data with IMDB Cross-References

Once you have Netflix title IDs, you can cross-reference them with IMDB for richer data:

def enrich_with_imdb(netflix_data: dict, imdb_id: str) -> dict:
    """Add IMDB data to Netflix scraping results."""

    imdb_url = f"https://www.imdb.com/title/{imdb_id}/"
    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; DataBot/1.0)"
    }

    response = requests.get(
        imdb_url, headers=headers, timeout=15
    )
    soup = BeautifulSoup(response.text, "lxml")

    # Extract IMDB JSON-LD
    ld_script = soup.find("script", type="application/ld+json")
    if ld_script:
        imdb_data = json.loads(ld_script.string)

        netflix_data["imdb_rating"] = (
            imdb_data
            .get("aggregateRating", {})
            .get("ratingValue")
        )
        netflix_data["imdb_votes"] = (
            imdb_data
            .get("aggregateRating", {})
            .get("ratingCount")
        )
        netflix_data["imdb_keywords"] = imdb_data.get(
            "keywords", ""
        ).split(",")
        netflix_data["imdb_description"] = imdb_data.get(
            "description"
        )

    return netflix_data
Enter fullscreen mode Exit fullscreen mode

Scaling with Apify

Local scraping works for small datasets, but for catalog-scale extraction — tens of thousands of titles across multiple regions — you need cloud infrastructure. Apify provides exactly this.

Why Use Apify for Netflix Data?

  1. Proxy rotation — Netflix actively blocks datacenter IPs. Apify's residential proxy pool handles this automatically.
  2. Browser rendering — React-rendered content requires headless browsers. Apify's Crawlee framework manages browser instances efficiently.
  3. Scheduling — Track catalog changes daily or weekly with built-in scheduling.
  4. Storage — Results go directly to Apify datasets, exportable as JSON, CSV, or Excel.
  5. Scalability — Run hundreds of concurrent browser instances without managing infrastructure.

Using an Apify Netflix Actor

from apify_client import ApifyClient

# Initialize the Apify client
client = ApifyClient("YOUR_APIFY_TOKEN")

# Configure the Netflix scraping actor
run_input = {
    "searchTerms": [
        "stranger things",
        "squid game",
        "wednesday",
        "the witcher"
    ],
    "maxResults": 100,
    "includeGenres": True,
    "includeRegionalData": True,
    "proxyConfiguration": {
        "useApifyProxy": True,
        "apifyProxyGroups": ["RESIDENTIAL"]
    }
}

# Run the actor
run = client.actor("netflix-catalog-scraper").call(
    run_input=run_input
)

# Fetch results from the dataset
dataset_items = client.dataset(
    run["defaultDatasetId"]
).list_items().items

print(f"Scraped {len(dataset_items)} Netflix titles")

for item in dataset_items[:5]:
    print(f"  {item['title']} ({item.get('year', 'N/A')})")
    print(f"  Genres: {', '.join(item.get('genres', []))}")
    print(f"  Available in: {item.get('country_count', '?')} countries")
    print()
Enter fullscreen mode Exit fullscreen mode

Node.js Apify Integration

const { ApifyClient } = require('apify-client');

const client = new ApifyClient({
  token: 'YOUR_APIFY_TOKEN',
});

async function scrapeNetflixCatalog() {
  const run = await client.actor('netflix-catalog-scraper')
    .call({
      searchTerms: ['action movies', 'sci-fi series'],
      maxResults: 200,
      includeGenres: true,
      includeRegionalData: true,
    });

  const { items } = await client
    .dataset(run.defaultDatasetId)
    .listItems();

  console.log(`Found ${items.length} titles`);

  // Export to CSV
  const csvUrl = `https://api.apify.com/v2/datasets/`
    + `${run.defaultDatasetId}/items?format=csv`;
  console.log(`Download CSV: ${csvUrl}`);

  return items;
}

scrapeNetflixCatalog();
Enter fullscreen mode Exit fullscreen mode

Handling Anti-Scraping Measures

Netflix employs several anti-bot measures. Here's how to handle them ethically:

Rate Limiting

Always implement respectful delays between requests:

import time
import random

def polite_delay(min_seconds=2, max_seconds=5):
    """Add a random delay to avoid overwhelming the server."""
    delay = random.uniform(min_seconds, max_seconds)
    time.sleep(delay)
Enter fullscreen mode Exit fullscreen mode

Rotating User Agents

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
    "AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
    "AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) "
    "AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
]

def get_random_headers():
    return {
        "User-Agent": random.choice(USER_AGENTS),
        "Accept-Language": "en-US,en;q=0.9",
    }
Enter fullscreen mode Exit fullscreen mode

Practical Use Cases for Netflix Public Data

1. Content Research and Analysis

Track what genres Netflix is investing in, which regions get exclusive content first, and how the catalog composition changes over time.

2. Competitive Intelligence

Media companies use catalog data to understand Netflix's content strategy — what types of originals they're producing, which licensed content they're acquiring, and regional content gaps.

3. Entertainment Apps and Recommendation Engines

Build third-party recommendation tools that help users discover content across streaming platforms by aggregating catalog data from Netflix and competitors.

4. Academic Research

Researchers study content diversity, regional representation, and the economics of streaming through publicly available catalog data.

5. Journalism and Reporting

Entertainment journalists track catalog additions, removals, and regional differences for reporting on the streaming industry.

Legal and Ethical Considerations

When scraping Netflix's public data:

  • Respect robots.txt — Always check and follow Netflix's robots.txt directives
  • Rate limit your requests — Never overwhelm servers with rapid-fire requests
  • Only access public data — Never attempt to bypass login walls or access authenticated endpoints
  • Don't redistribute copyrighted content — Metadata is fair game for analysis, but actual video content is protected
  • Check terms of service — Review Netflix's ToS regarding automated data collection
  • Use data responsibly — Aggregated insights are generally acceptable; individual user data never is

Conclusion

Netflix's public data footprint is larger than most people realize. Between title pages, genre structures, regional availability data, and IMDB cross-references, you can build comprehensive entertainment datasets without ever touching authenticated endpoints.

For small-scale projects, the Python and Node.js scrapers in this article will get you started. For production-scale data pipelines processing thousands of titles across dozens of regions, Apify's cloud infrastructure handles the heavy lifting of proxy rotation, browser management, and scheduling.

The key is to start with the publicly accessible data, respect rate limits, and build incrementally. Whether you're building a recommendation engine, doing content research, or tracking industry trends, the data is out there — you just need the right tools to collect it.

Top comments (0)