Nir Berko

Posted on Jan 16

Building a Production-Grade Scraper with Playwright, Chromium, Kubernetes, and AWS

#scraping #playwright #kubernetes #aws

Introduction

Recently, I worked on scraping an off-market real estate platform that required authentication, session handling, and dynamic rendering.

This was not a simple HTTP scraper.

The site is a single-page application, protected by login, session cookies, and active blocking mechanisms. To make it work reliably in production, I had to combine headless browser automation, Kubernetes, AWS-managed infrastructure, and a deliberate proxy strategy.

This post walks through the architecture, the key technical decisions, and the lessons learned for anyone building scrapers that need to survive real-world constraints.

High-Level Architecture

The system was split into two microservices running inside Kubernetes on AWS:

Scraper Service (Node.js / NestJS)

Responsible for:
- Orchestrating login
- Managing session state
- Fetching listing and detail data
- Parsing and normalizing results
Chromium Service

A dedicated headless Chromium instance running as a separate pod using the browserless/chrome image.

The scraper service connects to Chromium over CDP via WebSocket, exposed through an internal Kubernetes Service. This separation kept browser concerns isolated and made the system easier to reason about and evolve.

Why a Headless Browser Was Required

A headless browser was unavoidable for two reasons:

The platform requires authentication and session cookies
The site is a SPA that relies heavily on client-side rendering

Pure HTTP scraping could not reliably establish or maintain a valid session. Playwright was used to perform login, manage cookies and local storage, and act as the source of truth for the authentication state.

Why Playwright

Playwright turned out to be a strong fit for this use case:

Reliable handling of modern SPAs
Clean APIs for waiting on navigation and DOM readiness
Stable session and cookie management
Native support for connecting to a remote Chromium instance over CDP

Most importantly, it significantly reduced flakiness around login flows and dynamic rendering.

Connecting to Remote Chromium in Kubernetes

Chromium ran behind an internal Kubernetes Service.

The scraper connected using:

chromium.connectOverCDP(ws://<chromium-service>:<port>)

The host and port were injected via environment variables, which allowed:

Clean separation between dev, staging, and production
No hard-coded endpoints
Easy switching between local and remote browser execution

This setup was stable and did not require reconnect logic or special handling.

AWS Infrastructure Choices

Although this system was Kubernetes-first, AWS services played a key role in making it production-ready.

Amazon EKS

The entire system ran on Amazon EKS, which provided:

A managed Kubernetes control plane
Predictable networking and service discovery
Clean separation between scraper and browser workloads

EKS made it straightforward to run a browser-heavy workload while still benefiting from Kubernetes primitives.

AWS Secrets Manager

All sensitive configurations, including:

Platform credentials
Proxy credentials
Environment-specific secrets

was stored in AWS Secrets Manager and injected into pods via environment variables. This avoided hardcoding secrets into images or manifests and enabled clean separation between environments.

IAM Roles for Service Accounts (IRSA)

Pods accessed AWS resources using IAM roles attached to Kubernetes service accounts. This eliminated static AWS credentials and followed least-privilege principles.

Amazon CloudWatch

Logs from the scraper service were shipped to CloudWatch. This was critical for:

Debugging silent failures
Identifying where blocking occurred (login vs data fetch)
Understanding retry behavior over time

Scraping systems fail quietly. Centralized logging was essential.

The Reality of Scraping Blocked Websites

In production scraping, parsing data is rarely the hard part.

Getting access is.

Modern platforms actively block automated traffic using:

Rate limiting
IP reputation checks
Blocking cloud datacenter IP ranges
Flagging abnormal login or navigation patterns

Without mitigation, even a well-written scraper may work once and then silently stop.

This is where proxies become unavoidable.

Proxy Strategy: Necessary but Expensive

For authenticated and high-value platforms, proxies are not about scale. They are about survival.

Without proxies:

Login attempts get blocked quickly
Sessions are invalidated
Requests return partial or empty responses

Residential or ISP-grade proxies improve:

IP reputation
Session stability
Retry success rates

The tradeoff is cost and operational complexity.

Use Proxies Only Where You Must

One key lesson was to treat proxies as an expensive resource, not the default path.

In this setup:

Proxies were mainly used for browser-based actions, especially login
Once a valid session was established, most data fetching was done via direct HTTP requests using session cookies

This approach:

Reduced proxy traffic and cost
Improved speed
Lowered exposure to unnecessary blocking

A Key Optimization: Browser for Login, HTTP for Data

After login, cookies were extracted from the browser context and reused for direct HTTP requests from Node.js.

This allowed:

Fetching JSON listing data without rendering pages
Fetching HTML detail pages directly
Full control over headers, retries, and backoff logic

Benefits:

Much faster than rendering each page in Chromium
More stable than page-by-page navigation
Lower proxy and browser costs
Easier retry handling

The browser became an authentication tool, not a data-fetching bottleneck.

Single-Tenant Design (By Choice)

The scraper was intentionally designed as single-tenant:

One browser instance
One context and page
Shared login state

This is optimized for speed and simplicity.

If scaling to multi-tenancy, the next steps would be:

Introducing a job queue (for example, SQS)
Explicit concurrency limits
Separate browser contexts per job
Tighter resource control per pod

For this use case, simplicity was the right tradeoff.

Kubernetes Resource Considerations

Chromium ran reliably without explicit CPU or memory tuning.

For higher load, best practices would include:

Explicit CPU and memory requests and limits
Increasing /dev/shm using emptyDir
Monitoring browser memory growth over time

These become critical as concurrency increases.

Scraping Is an Arms Race

There is no “set it and forget it” scraper.

Sites change:

Login flows
Required headers
JavaScript behavior
Bot detection rules

A scraper that works today can break tomorrow without any code changes.

Scraping should be treated as a system, not a script:

Expect failures
Build retries
Monitor success rates
Adapt proxy strategy over time

Final Thoughts

Scraping modern web applications is less about HTML parsing and more about system design.

Using a headless browser only where it is truly required, combined with direct HTTP requests wherever possible, provides the best balance between reliability, speed, and cost.

If scraping is a core dependency, the real question is not can you scrape the data, but whether the data is valuable enough to justify the ongoing operational complexity.

That question should be answered early.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.