Max Kurz

Posted on Mar 4

Running Headless Chrome at Scale: Production Lessons from Millions of Renders.

#javascript #go #opensource

Running Headless Chrome at Scale: Production Lessons from Millions of Renders.

Anyone who has crawled websites at scale knows that a simple HTML fetch is not enough. Too many websites client-side render their content, which means the HTML you get from a plain HTTP request is an empty shell. To get the actual content, you need a browser. In practice, that means headless Chrome.

Running one instance of headless Chrome is straightforward. Running it at scale - processing millions or tens of millions of pages on a regular basis - is a different problem entirely.

At EdgeComet, we run hundreds of Chrome instances across our servers. We've spent years fine-tuning and managing Chrome page rendering on a scale. This article covers managing Chrome itself. We will omit the process of parsing and storing data. The focus here is on the hard part: keeping Chrome alive, responsive, and fast when you push it far beyond what it was designed for.

Do You Even Need Chrome?

Wherever you can avoid using headless Chrome, do not use it. Each instance consumes 500+ MB of RAM and turns a 1-2 second HTTP fetch into a 5-10 second rendering job. At scale, that difference is the difference between one server and ten.

Before sending a URL to Chrome, fetch the page with a plain HTTP client. Check the text-to-HTML ratio. Look for <div id="root"> or <div id="app"> with no content inside. If the raw HTML already has the content you need, skip Chrome entirely.

Build an adaptive pipeline: HTTP-first, headless Chrome as a fallback. A large portion of the web still serves server-side HTML. Routing those pages around Chrome saves real money and throughput.

Choosing Your Automation Tool

Most teams pick between Puppeteer, Playwright, or a language-specific library. Go is a popular choice for building crawlers and scrapers that operate at scale - its concurrency model and low memory footprint per goroutine make it well suited for managing dozens of Chrome instances in parallel. We chose Go for EdgeComet and use chromedp, which gives us direct access to Chrome DevTools Protocol. That said, the principles in this article apply regardless of which library you use to talk to Chrome.

Anti-Bot Protections

Cloudflare, Imperva, DataDome, Fastly - they all detect headless Chrome, and they get better at it every year. These systems operate at the application layer: browser fingerprinting, JavaScript challenges, behavioral analysis. Launching Chrome in headless mode does not get past them.

Detection works on multiple levels. The simplest check: navigator.webdriver is set to true in automation-controlled browsers. Beyond that - canvas fingerprinting, plugin enumeration, behavioral analysis (rapid requests, no mouse movements, uniform timing). A real user does not behave like a for-loop.

For moderate protection levels, stealth plugins help. The puppeteer-extra-plugin-stealth package overrides the most common detection vectors. Adding --disable-blink-features=AutomationControlled to your Chrome flags hides the automation flag. Rotating user agents and residential proxies add another layer.

For aggressive protections (Cloudflare's managed challenge, Imperva Advanced), you have two realistic options. Tools like camoufox (a patched Firefox build) or Scrapling with its stealth fetcher are designed specifically for this. Or you go to specialized scraping providers who maintain the cat-and-mouse game as their core business. If your task is to scrape a couple million pages behind heavy protection, buying an API from a provider is cheaper, faster, and more reliable than building your own evasion stack.

If you need to do this at scale on your own - good luck with that journey. It's a full-time job.

Tabs vs Windows

Headless Chrome, like a regular browser, has windows and tabs. The obvious approach: spawn one browser, open multiple tabs, process pages in parallel. Less memory, faster startup.

It does not work reliably at scale. One tab with a JavaScript memory leak or an infinite loop affects every other tab in the same browser process. One bad page takes down all your active renders.

One-tab-per-browser is the right model. Each Chrome instance gets its own process tree, its own memory space, its own DOM.

At EdgeComet, each Chrome instance handles exactly one tab at a time. We run a pool of instances (typically 15-25 per server) and acquire/release them from a FIFO queue.

In Go, a buffered channel is a natural fit for this. Each Chrome instance gets an integer ID. At startup, all IDs are pushed into the channel. Acquiring an instance blocks until one is available; releasing pushes the ID back:

type ChromePool struct {
    instances []*ChromeInstance
    queue     chan int // FIFO queue of available instance IDs
}

// Initialization: create instances and push IDs into the queue
pool.instances = make([]*ChromeInstance, poolSize)
pool.queue = make(chan int, poolSize)
for i := 0; i < poolSize; i++ {
    instance, err := NewChromeInstance(i, config)
    pool.instances[i] = instance
    pool.queue <- i
}

// Acquire: blocks until an instance is available
func (p *ChromePool) AcquireChrome() (*ChromeInstance, error) {
    select {
    case <-p.ctx.Done():
        return nil, ErrPoolShutdown
    case instanceID := <-p.queue:
        instance := p.instances[instanceID]
        // ... health checks, restart policies ...
        return instance, nil
    }
}

// Release: return instance to the queue
func (p *ChromePool) ReleaseChrome(instance *ChromeInstance) {
    instance.IncrementRequests()
    p.queue <- instance.ID
}

When a render request comes in, we take an instance from the pool and first navigate it to a warm-up URL - a simple HTML page like example.com. We wait for that page to load and confirm the tab is alive and responsive. Only after that do we navigate to the actual target URL.
This guarantees that the Chrome instance is not stalled before we commit a real render task to it. If the warm-up page fails to load, we kill that instance and grab another one from the pool. After the render completes, the tab is closed and the instance returns to the queue. No state leaks between renders. No bad page poisoning the next ten.

The overhead of running separate browser processes is real - over 200 MB extra per instance compared to tabs in a shared browser. But the reliability gain is worth every megabyte.

Chrome Reliability

Chrome and Chromium are mature projects with a huge codebase going back to KHTML and KDE from over 25 years ago. With that user base and the companies backing its development, you would expect it to be rock-solid for any use case.

It is not. Chrome was designed for interactive desktop browsing, not for processing thousands of pages sequentially in a headless loop. Memory accumulates. Handles leak. Internal caches grow. After a few hundred pages, performance degrades. After a few thousand, it stalls or crashes.

A stalled or dead Chrome instance must be a normal flow in your application, not an exception. Chrome dying is expected behavior.

What we do at EdgeComet:

Restart policies. We restart each Chrome instance every 100 renders or every 60 minutes, whichever comes first. The restart happens between renders, not during one. The instance finishes its current page, gets pulled from the pool, killed, and a fresh instance takes its place.

Health checks. Before assigning a page to a Chrome instance, ping it. Call browser.getVersion() (or equivalent in your library) with a short timeout. If Chrome doesn't respond within 5 seconds, it's dead. Don't try to recover it - kill the process and start a new one.

Both checks happen during pool acquire, before any work is assigned:

// ShouldRestart determines if the instance needs to be restarted based on policies
func (ci *ChromeInstance) ShouldRestart(config *Config) bool {
    if int(atomic.LoadInt32(&ci.requestsDone)) >= config.RestartAfterCount {
        return true
    }
    if ci.Age() >= config.RestartAfterTime {
        return true
    }
    return false
}

// IsAlive checks if the Chrome instance is still responsive
func (ci *ChromeInstance) IsAlive() bool {
    if ci.status == ChromeStatusDead {
        return false
    }
    // Try to get browser version as a health check
    ctx, cancel := context.WithTimeout(ci.ctx, 5*time.Second)
    defer cancel()

    err := chromedp.Run(ctx, chromedp.ActionFunc(func(ctx context.Context) error {
        _, _, _, _, _, err := browser.GetVersion().Do(ctx)
        return err
    }))
    return err == nil
}

Our defaults: RestartAfterCount: 100, RestartAfterTime: 60 * time.Minute. Both IsAlive() and ShouldRestart() run during AcquireChrome() - if either triggers, the instance is killed and recreated before the caller gets it.

Graceful degradation. Stalled instances happen. A page might trigger an infinite JavaScript loop or a Chrome bug that hangs the process. Set hard timeouts on every render operation and handle timeouts as a normal flow, not a crash.

Handling JavaScript: The Page Readiness Problem

The hardest part of the entire process. There is no single event that tells you "this page is done rendering."

With client-side rendering, you load a shell, JavaScript fetches data from APIs, components mount, more data loads, more components render. There is no reliable "done" signal.

Chrome exposes lifecycle events through the DevTools Protocol. The two most useful ones:

networkIdle - fires when there have been zero network connections for 500 milliseconds. This is the safer choice for unknown pages. It waits until all API calls, lazy-loaded resources, and async operations settle.

networkAlmostIdle - fires when there are two or fewer connections active for 500 milliseconds. Faster than networkIdle, but riskier. Some pages maintain persistent connections (WebSockets, long-polling, analytics heartbeats) that prevent networkIdle from ever firing. networkAlmostIdle handles those cases.

If you know your target sites, test both and pick the one that works. If you're crawling a random set of thousands of sites, go with networkIdle.

Here is how we implement this. The key design decision: the timeout is soft. If the lifecycle event never fires, we mark the page as timed out but still extract whatever HTML is in the DOM. A partial render is better than no render.

// navigateAndWait navigates to URL and waits for the specified lifecycle event.
// Uses soft timeout - if wait exceeds timeout, it continues with HTML extraction.
func (ci *ChromeInstance) navigateAndWait(req *RenderRequest, metrics *PageMetrics) error {
    frameId, loaderId, _, _, err := page.Navigate(req.URL).Do(ctx)
    if err != nil {
        return err
    }

    err = waitForEvent(ctx, req.WaitFor, frameId, loaderId, req.Timeout, metrics)

    if errors.Is(err, ErrWaitTimeout) {
        metrics.TimedOut = true // Mark but don't fail
    } else if err != nil {
        return err
    }

    // Extra wait after the event fires to catch deferred JS execution
    if req.ExtraWait > 0 && !metrics.TimedOut {
        time.Sleep(req.ExtraWait)
    }
    return nil
}

// waitForEvent listens for Chrome lifecycle events, matching on frameId and loaderId
// to track the correct navigation (not a previous or redirected page).
func waitForEvent(ctx context.Context, eventName, frameId, loaderId string,
    timeout time.Duration, metrics *PageMetrics) error {

    ch := make(chan struct{})
    listenerCtx, cancel := context.WithCancel(ctx)
    defer cancel()

    chromedp.ListenTarget(listenerCtx, func(ev interface{}) {
        if e, ok := ev.(*page.EventLifecycleEvent); ok {
            if string(e.FrameID) == frameId && string(e.LoaderID) == loaderId {
                if string(e.Name) == eventName {
                    cancel()
                    close(ch)
                }
            }
        }
    })

    select {
    case <-ch:
        return nil
    case <-time.After(timeout):
        return ErrWaitTimeout
    }
}

One more problem: even after the network settles, some JavaScript executes on timers. A framework might defer rendering by 100ms, or a component might animate into view. We add a configurable extra wait after the lifecycle event fires - usually 1-2 seconds - to catch these late executions.

Set a hard timeout on top of everything. 10-15 seconds is a reasonable ceiling. If a page hasn't finished by then, grab whatever HTML is in the DOM and move on. A partial render is better than a stuck worker.

Chrome Flags and Resource Blocking

Tuning Chrome's launch flags gives you a 30-50% performance improvement for free.

The essential flags for any headless crawling setup:

--headless - no GUI
--no-sandbox - disables Chrome's process sandbox (see warning below)
--disable-dev-shm-usage - prevents crashes when shared memory is limited
--disable-gpu - no GPU in server environments
--disable-extensions - no extensions needed
--disable-background-networking - stops background update checks and telemetry
--no-first-run - skips first-run wizard
--mute-audio - no audio processing
--disable-sync - no account sync
--disable-translate - no translation popups

A word on --no-sandbox. Chrome's security works in two layers. The first layer is V8 and Blink - the browser engine restricts what web JavaScript can do (no filesystem access, no arbitrary syscalls). The second layer is the OS-level sandbox (user namespaces, seccomp-bpf) - it restricts what the renderer process itself can do at the operating system level. --no-sandbox disables the second layer.
Regular JavaScript on a page still cannot access your filesystem - that is enforced by V8. But if an attacker exploits a vulnerability in Chrome's renderer (memory corruption, V8 bug), the OS-level sandbox would normally contain the damage. Without it, the exploit gets the full OS permissions of the Chrome process. In practice, nobody is burning a Chrome zero-day on a random scraper - these exploits are rare and valuable.
The more realistic threats when running without sandbox are resource exhaustion, crypto miners running inside your Chrome instances, or pages making outbound requests from your IP. Still, if you run Chrome on bare metal without containers, be aware that you have no OS-level isolation between Chrome and your machine.
The reason --no-sandbox is so common: Chrome's OS-level sandbox relies on Linux kernel features that are often unavailable inside containers. Without this flag, Chrome will not start. This is acceptable when the container itself acts as the isolation boundary.

Resource blocking is the other big win. If you're extracting text content, you don't need images, fonts, CSS, or video. Intercept requests and abort anything that isn't the document, XHR, or fetch. This cuts page load time by 20% or more and saves significant bandwidth.

Go further: block known third-party scripts by URL pattern. Google Analytics, Facebook SDK, Hotjar, ad networks - none of these contribute to the content you're after, and they add seconds to page load. We maintain a blocklist of 30+ URL patterns for common third-party services that gets applied to every render:

var globalBlockedPatterns = []string{
    "*google-analytics.com*",   "*googletagmanager.com*",
    "*googleadservices.com*",   "*googlesyndication.com*",
    "*doubleclick.net*",        "*facebook.com*",
    "*hotjar.com*",             "*clarity.ms*",
    "*twitter.com*",            "*youtube.com*",
    "*ampproject.org*",         "*gstatic.com*",
    "*typekit.net*",            "*static.cloudflareinsights.com*",
    // ... and more
}

The blocking works through Chrome's Fetch domain. We intercept every network request before it leaves the browser, check it against compiled patterns and blocked resource types, and abort matches:

// Inside the fetch event handler - runs for every network request
blockedByURL := blocklist.IsBlocked(event.Request.URL)
blockedByResourceType := blocklist.IsResourceTypeBlocked(string(event.ResourceType))

if blockedByURL || blockedByResourceType {
    fetch.FailRequest(event.RequestID, network.ErrorReasonAborted).Do(ctx)
} else {
    fetch.ContinueRequest(event.RequestID).Do(ctx)
}

Users can add custom patterns per request on top of the global list, and block entire resource types (Image, Font, Media) when they only need text content.

System Requirements

Chrome is resource-hungry, but you do not need a server with 40 CPU cores and 1 TB of RAM. The sweet spot: 8 to 16 virtual cores with 16 to 32 GB of RAM. Each server handles 15 to 25 Chrome instances depending on page complexity.

A simple formula for pool sizing: take your available RAM, subtract 2 GB for the OS and your application, and divide by 500 MB per Chrome instance. On a 16 GB server, that gives you about 28 instances in theory, but in practice we cap it at 20-25 to leave headroom for traffic spikes.

Dedicated servers vs cloud VMs. Dedicated always wins for Chrome workloads. Chrome needs direct access to CPU and memory without the overhead of hypervisor scheduling. A dedicated server with a Ryzen 5 3600 (6 cores, 12 threads @ 3.6 GHz) and 64 GB RAM costs about $50/month. For the same money on a cloud provider, you get about half the resources with noisy neighbors on the physical server.

Scaling horizontally, not vertically. Many smaller servers beat one big server. Chrome doesn't scale linearly on a single machine - contention on shared resources (CPU cache, memory bus, I/O) creates diminishing returns past 20-25 instances. Three servers with 20 instances each outperform one server with 60 instances.

Architecture

The crawling architecture follows a tiered approach:

Layer 1: HTTP fetch. Plain HTTP request with a smart client. Check the response for client-side rendering signals. If the page has real content, you're done. This handles 30-50% of the web depending on your target list.

Layer 2: Headless Chrome render. Pages that need JavaScript execution go through your Chrome pool. Standard flags, resource blocking, page readiness detection, timeout handling.

Layer 3: Stealth render. Pages behind anti-bot protections get routed through stealth-configured browsers with proxy rotation. This is the most expensive tier - use it only when the first two fail.

The Chrome pool itself needs proper lifecycle management. Instances live in a queue. Workers acquire an instance, use it, and release it back. Track render counts and uptime per instance to enforce restart policies. Monitor active instances, queue depth, and failure rates.

If you're running across multiple servers, you need coordination. We use Redis for service registry - each server publishes its capacity and current load with a heartbeat every few seconds. The gateway routes requests to servers with available capacity. This prevents any single server from getting overwhelmed while others sit idle.

EdgeComet Open Source Engine

If you are building your own crawling or rendering infrastructure based on headless Chrome, feel free to use our open-source render engine as a reference or as a starting point for your own implementation.

The engine implements Chrome pool management, lifecycle event tracking, restart policies, resource blocking with URL pattern matching, and health checks. The core render implementation is in renderer.go. It has been battle-tested processing millions of pages every day. The codebase is written in Go, but the architecture and Chrome management patterns translate directly to any language. If you find it useful, give it a star - it helps other developers discover the project.

Conclusion

There is no universal solution here. The implementation depends entirely on your task. Crawling a million pages from one website is a fundamentally different problem than crawling one page from a million websites. The concurrency model, timeout strategy, restart policies, and even the Chrome flags you choose will vary based on what you're actually doing.

Whatever the task is, invest in logging and tracing early. Chrome will misbehave - that's a given. Detailed logs of lifecycle events, network activity, timeout triggers, and instance health make the difference between spending ten minutes on a rendering issue and spending a day. When a specific website breaks your pipeline, you need to see exactly what Chrome did, what events fired, and where it stalled. Without that visibility, you're debugging blind.

Automated tests are equally important. Start with basic scenarios: does a fully server-side page render correctly? Does a heavy SPA with lazy loading return complete content? Does your pipeline recover when Chrome stalls? Then, as production reveals issues with specific websites, add targeted tests for those cases. This matters more than people expect - fixing a rendering issue for one type of website can break functionality for another. A change to your timeout logic that fixes infinite-scroll pages might cause early termination on sites with slow API responses. Regression tests on real-world URLs catch these tradeoffs before your users do.

Running headless Chrome at scale is not a solved problem you configure once and forget. It requires ongoing tuning, monitoring, and adapting. But with the right architecture - adaptive routing, isolated instances, aggressive restart policies, and proper observability - it becomes manageable.

DEV Community

Running Headless Chrome at Scale: Production Lessons from Millions of Renders.

Running Headless Chrome at Scale: Production Lessons from Millions of Renders.

Do You Even Need Chrome?

Choosing Your Automation Tool

Anti-Bot Protections

Tabs vs Windows

Chrome Reliability

Handling JavaScript: The Page Readiness Problem

Chrome Flags and Resource Blocking

System Requirements

Architecture

EdgeComet Open Source Engine

Conclusion

Top comments (0)