Introduction
Recently, I worked on scraping an off-market real estate platform that required authentication, session handling, and dynamic rendering.
This was not a simple HTTP scraper.
The site is a single-page application, protected by login, session cookies, and active blocking mechanisms. To make it work reliably in production, I had to combine headless browser automation, Kubernetes, AWS-managed infrastructure, and a deliberate proxy strategy.
This post walks through the architecture, the key technical decisions, and the lessons learned for anyone building scrapers that need to survive real-world constraints.
High-Level Architecture
The system was split into two microservices running inside Kubernetes on AWS:
-
Scraper Service (Node.js / NestJS)
Responsible for:- Orchestrating login
- Managing session state
- Fetching listing and detail data
- Parsing and normalizing results
Chromium Service
A dedicated headless Chromium instance running as a separate pod using thebrowserless/chromeimage.
The scraper service connects to Chromium over CDP via WebSocket, exposed through an internal Kubernetes Service. This separation kept browser concerns isolated and made the system easier to reason about and evolve.
Why a Headless Browser Was Required
A headless browser was unavoidable for two reasons:
- The platform requires authentication and session cookies
- The site is a SPA that relies heavily on client-side rendering
Pure HTTP scraping could not reliably establish or maintain a valid session. Playwright was used to perform login, manage cookies and local storage, and act as the source of truth for the authentication state.
Why Playwright
Playwright turned out to be a strong fit for this use case:
- Reliable handling of modern SPAs
- Clean APIs for waiting on navigation and DOM readiness
- Stable session and cookie management
- Native support for connecting to a remote Chromium instance over CDP
Most importantly, it significantly reduced flakiness around login flows and dynamic rendering.
Connecting to Remote Chromium in Kubernetes
Chromium ran behind an internal Kubernetes Service.
The scraper connected using:
chromium.connectOverCDP(ws://<chromium-service>:<port>)
The host and port were injected via environment variables, which allowed:
- Clean separation between dev, staging, and production
- No hard-coded endpoints
- Easy switching between local and remote browser execution
This setup was stable and did not require reconnect logic or special handling.
AWS Infrastructure Choices
Although this system was Kubernetes-first, AWS services played a key role in making it production-ready.
Amazon EKS
The entire system ran on Amazon EKS, which provided:
- A managed Kubernetes control plane
- Predictable networking and service discovery
- Clean separation between scraper and browser workloads
EKS made it straightforward to run a browser-heavy workload while still benefiting from Kubernetes primitives.
AWS Secrets Manager
All sensitive configurations, including:
- Platform credentials
- Proxy credentials
- Environment-specific secrets
was stored in AWS Secrets Manager and injected into pods via environment variables. This avoided hardcoding secrets into images or manifests and enabled clean separation between environments.
IAM Roles for Service Accounts (IRSA)
Pods accessed AWS resources using IAM roles attached to Kubernetes service accounts. This eliminated static AWS credentials and followed least-privilege principles.
Amazon CloudWatch
Logs from the scraper service were shipped to CloudWatch. This was critical for:
- Debugging silent failures
- Identifying where blocking occurred (login vs data fetch)
- Understanding retry behavior over time
Scraping systems fail quietly. Centralized logging was essential.
The Reality of Scraping Blocked Websites
In production scraping, parsing data is rarely the hard part.
Getting access is.
Modern platforms actively block automated traffic using:
- Rate limiting
- IP reputation checks
- Blocking cloud datacenter IP ranges
- Flagging abnormal login or navigation patterns
Without mitigation, even a well-written scraper may work once and then silently stop.
This is where proxies become unavoidable.
Proxy Strategy: Necessary but Expensive
For authenticated and high-value platforms, proxies are not about scale. They are about survival.
Without proxies:
- Login attempts get blocked quickly
- Sessions are invalidated
- Requests return partial or empty responses
Residential or ISP-grade proxies improve:
- IP reputation
- Session stability
- Retry success rates
The tradeoff is cost and operational complexity.
Use Proxies Only Where You Must
One key lesson was to treat proxies as an expensive resource, not the default path.
In this setup:
- Proxies were mainly used for browser-based actions, especially login
- Once a valid session was established, most data fetching was done via direct HTTP requests using session cookies
This approach:
- Reduced proxy traffic and cost
- Improved speed
- Lowered exposure to unnecessary blocking
A Key Optimization: Browser for Login, HTTP for Data
After login, cookies were extracted from the browser context and reused for direct HTTP requests from Node.js.
This allowed:
- Fetching JSON listing data without rendering pages
- Fetching HTML detail pages directly
- Full control over headers, retries, and backoff logic
Benefits:
- Much faster than rendering each page in Chromium
- More stable than page-by-page navigation
- Lower proxy and browser costs
- Easier retry handling
The browser became an authentication tool, not a data-fetching bottleneck.
Single-Tenant Design (By Choice)
The scraper was intentionally designed as single-tenant:
- One browser instance
- One context and page
- Shared login state
This is optimized for speed and simplicity.
If scaling to multi-tenancy, the next steps would be:
- Introducing a job queue (for example, SQS)
- Explicit concurrency limits
- Separate browser contexts per job
- Tighter resource control per pod
For this use case, simplicity was the right tradeoff.
Kubernetes Resource Considerations
Chromium ran reliably without explicit CPU or memory tuning.
For higher load, best practices would include:
- Explicit CPU and memory requests and limits
- Increasing
/dev/shmusingemptyDir - Monitoring browser memory growth over time
These become critical as concurrency increases.
Scraping Is an Arms Race
There is no “set it and forget it” scraper.
Sites change:
- Login flows
- Required headers
- JavaScript behavior
- Bot detection rules
A scraper that works today can break tomorrow without any code changes.
Scraping should be treated as a system, not a script:
- Expect failures
- Build retries
- Monitor success rates
- Adapt proxy strategy over time
Final Thoughts
Scraping modern web applications is less about HTML parsing and more about system design.
Using a headless browser only where it is truly required, combined with direct HTTP requests wherever possible, provides the best balance between reliability, speed, and cost.
If scraping is a core dependency, the real question is not can you scrape the data, but whether the data is valuable enough to justify the ongoing operational complexity.
That question should be answered early.
Top comments (1)
The most reliable EKS setup is two services- a stateless scraper and a remote browserless/chrome behind a ClusterIP. Split browser and scraper when you need isolation, independent scaling, or multi-tenancy, keep a single pod with local Chromium when volume is low, bursts are short, or you’re keeping ops light. The “browser-for-login, HTTP-for-data” play works great-persist storageState, lift cookies/headers, and fetch with undici/got-but it faceplants when tokens are device/IP-bound, fingerprint-tied, or data only rides in-browser GraphQL with shifty headers. Bake in graceful failover with feature flags, route-level fallbacks, circuit breakers, canary accounts, schema/DOM-change detectors, and auto-disable to a DLQ so you don’t cook your accounts.