DEV Community

Cover image for Building a Production-Grade Scraper with Playwright, Chromium, Kubernetes, and AWS
Nir Berko
Nir Berko

Posted on

Building a Production-Grade Scraper with Playwright, Chromium, Kubernetes, and AWS

Introduction

Recently, I worked on scraping an off-market real estate platform that required authentication, session handling, and dynamic rendering.

This was not a simple HTTP scraper.

The site is a single-page application, protected by login, session cookies, and active blocking mechanisms. To make it work reliably in production, I had to combine headless browser automation, Kubernetes, AWS-managed infrastructure, and a deliberate proxy strategy.

This post walks through the architecture, the key technical decisions, and the lessons learned for anyone building scrapers that need to survive real-world constraints.


High-Level Architecture

The system was split into two microservices running inside Kubernetes on AWS:

  1. Scraper Service (Node.js / NestJS)

    Responsible for:

    • Orchestrating login
    • Managing session state
    • Fetching listing and detail data
    • Parsing and normalizing results
  2. Chromium Service

    A dedicated headless Chromium instance running as a separate pod using the browserless/chrome image.

The scraper service connects to Chromium over CDP via WebSocket, exposed through an internal Kubernetes Service. This separation kept browser concerns isolated and made the system easier to reason about and evolve.


Why a Headless Browser Was Required

A headless browser was unavoidable for two reasons:

  • The platform requires authentication and session cookies
  • The site is a SPA that relies heavily on client-side rendering

Pure HTTP scraping could not reliably establish or maintain a valid session. Playwright was used to perform login, manage cookies and local storage, and act as the source of truth for the authentication state.


Why Playwright

Playwright turned out to be a strong fit for this use case:

  • Reliable handling of modern SPAs
  • Clean APIs for waiting on navigation and DOM readiness
  • Stable session and cookie management
  • Native support for connecting to a remote Chromium instance over CDP

Most importantly, it significantly reduced flakiness around login flows and dynamic rendering.


Connecting to Remote Chromium in Kubernetes

Chromium ran behind an internal Kubernetes Service.

The scraper connected using:

chromium.connectOverCDP(ws://<chromium-service>:<port>)
Enter fullscreen mode Exit fullscreen mode

The host and port were injected via environment variables, which allowed:

  • Clean separation between dev, staging, and production
  • No hard-coded endpoints
  • Easy switching between local and remote browser execution

This setup was stable and did not require reconnect logic or special handling.


AWS Infrastructure Choices

Although this system was Kubernetes-first, AWS services played a key role in making it production-ready.

Amazon EKS

The entire system ran on Amazon EKS, which provided:

  • A managed Kubernetes control plane
  • Predictable networking and service discovery
  • Clean separation between scraper and browser workloads

EKS made it straightforward to run a browser-heavy workload while still benefiting from Kubernetes primitives.

AWS Secrets Manager

All sensitive configurations, including:

  • Platform credentials
  • Proxy credentials
  • Environment-specific secrets

was stored in AWS Secrets Manager and injected into pods via environment variables. This avoided hardcoding secrets into images or manifests and enabled clean separation between environments.

IAM Roles for Service Accounts (IRSA)

Pods accessed AWS resources using IAM roles attached to Kubernetes service accounts. This eliminated static AWS credentials and followed least-privilege principles.

Amazon CloudWatch

Logs from the scraper service were shipped to CloudWatch. This was critical for:

  • Debugging silent failures
  • Identifying where blocking occurred (login vs data fetch)
  • Understanding retry behavior over time

Scraping systems fail quietly. Centralized logging was essential.


The Reality of Scraping Blocked Websites

In production scraping, parsing data is rarely the hard part.

Getting access is.

Modern platforms actively block automated traffic using:

  • Rate limiting
  • IP reputation checks
  • Blocking cloud datacenter IP ranges
  • Flagging abnormal login or navigation patterns

Without mitigation, even a well-written scraper may work once and then silently stop.

This is where proxies become unavoidable.


Proxy Strategy: Necessary but Expensive

For authenticated and high-value platforms, proxies are not about scale. They are about survival.

Without proxies:

  • Login attempts get blocked quickly
  • Sessions are invalidated
  • Requests return partial or empty responses

Residential or ISP-grade proxies improve:

  • IP reputation
  • Session stability
  • Retry success rates

The tradeoff is cost and operational complexity.


Use Proxies Only Where You Must

One key lesson was to treat proxies as an expensive resource, not the default path.

In this setup:

  • Proxies were mainly used for browser-based actions, especially login
  • Once a valid session was established, most data fetching was done via direct HTTP requests using session cookies

This approach:

  • Reduced proxy traffic and cost
  • Improved speed
  • Lowered exposure to unnecessary blocking

A Key Optimization: Browser for Login, HTTP for Data

After login, cookies were extracted from the browser context and reused for direct HTTP requests from Node.js.

This allowed:

  • Fetching JSON listing data without rendering pages
  • Fetching HTML detail pages directly
  • Full control over headers, retries, and backoff logic

Benefits:

  • Much faster than rendering each page in Chromium
  • More stable than page-by-page navigation
  • Lower proxy and browser costs
  • Easier retry handling

The browser became an authentication tool, not a data-fetching bottleneck.


Single-Tenant Design (By Choice)

The scraper was intentionally designed as single-tenant:

  • One browser instance
  • One context and page
  • Shared login state

This is optimized for speed and simplicity.

If scaling to multi-tenancy, the next steps would be:

  • Introducing a job queue (for example, SQS)
  • Explicit concurrency limits
  • Separate browser contexts per job
  • Tighter resource control per pod

For this use case, simplicity was the right tradeoff.


Kubernetes Resource Considerations

Chromium ran reliably without explicit CPU or memory tuning.

For higher load, best practices would include:

  • Explicit CPU and memory requests and limits
  • Increasing /dev/shm using emptyDir
  • Monitoring browser memory growth over time

These become critical as concurrency increases.


Scraping Is an Arms Race

There is no “set it and forget it” scraper.

Sites change:

  • Login flows
  • Required headers
  • JavaScript behavior
  • Bot detection rules

A scraper that works today can break tomorrow without any code changes.

Scraping should be treated as a system, not a script:

  • Expect failures
  • Build retries
  • Monitor success rates
  • Adapt proxy strategy over time

Final Thoughts

Scraping modern web applications is less about HTML parsing and more about system design.

Using a headless browser only where it is truly required, combined with direct HTTP requests wherever possible, provides the best balance between reliability, speed, and cost.

If scraping is a core dependency, the real question is not can you scrape the data, but whether the data is valuable enough to justify the ongoing operational complexity.

That question should be answered early.

Top comments (1)

Collapse
 
onlineproxyio profile image
OnlineProxy

The most reliable EKS setup is two services- a stateless scraper and a remote browserless/chrome behind a ClusterIP. Split browser and scraper when you need isolation, independent scaling, or multi-tenancy, keep a single pod with local Chromium when volume is low, bursts are short, or you’re keeping ops light. The “browser-for-login, HTTP-for-data” play works great-persist storageState, lift cookies/headers, and fetch with undici/got-but it faceplants when tokens are device/IP-bound, fingerprint-tied, or data only rides in-browser GraphQL with shifty headers. Bake in graceful failover with feature flags, route-level fallbacks, circuit breakers, canary accounts, schema/DOM-change detectors, and auto-disable to a DLQ so you don’t cook your accounts.