Three memory-leak patterns in long-running scrapers (and how I caught them after 968 Trustpilot runs)

#webscraping #python #ai #apify

Memory leaks in scrapers do not crash the run. They quietly bump the Apify Memory limit from 1 GB to 2 GB to 4 GB, double the per-run cost, and only get spotted weeks later on a compute-unit invoice.

After 968 Trustpilot runs (~80–300 review pages each, ~150k page hits cumulative), I started sampling RSS every 1,000 pages. The growth pattern told a different story than the logs. Below are the three patterns that account for ~90% of the leaks I have seen across my 32 published Apify actors.

1. The unbounded asyncio queue

The most common pattern. A producer coroutine fetches URLs faster than the consumer parses them, so the in-memory queue grows linearly with runtime.

# leaks at high concurrency
queue = asyncio.Queue()  # no maxsize
async def producer():
    async for url in source:
        await queue.put(url)  # never blocks

async def consumer():
    while True:
        url = await queue.get()
        await process(url)

If process() is slower than source (which is true for most JS-rendered sites), the queue accumulates. On a Trustpilot run that fetched a company with 12,000 reviews, the queue held ~9,500 URLs at peak — about 380 MB of bytestrings.

Fix:

queue = asyncio.Queue(maxsize=200)  # producer blocks at 200

A bounded queue forces the producer to wait. Memory stays flat; throughput drops by less than 5% because the consumer never sat idle anyway — the bottleneck was the network, not the queue.

2. Per-URL `re.compile` bypassing Python's regex cache

CPython has a built-in regex cache at re._cache (default 512 entries). The naive assumption is that re.search(pattern, text) is cached and cheap on the second call.

It is — for string patterns. But the moment you build the pattern dynamically per URL, every call is a cache miss and a new compiled object:

# leaks at scale
def extract_review_id(html: str, slug: str):
    pat = rf"review-{slug}-(\d+)"          # dynamic pattern
    m = re.search(pat, html)
    return m.group(1) if m else None

Each unique slug (company name) puts a new entry in re._cache. The cap is 512, so it does not "leak" forever — but the eviction is a clear-the-whole-dict operation under the hood, and the compiled pattern keeps an internal __weakref__ to anything it captured during compile (locale, flags). Across a long run with thousands of distinct slugs, RSS would climb 3–5 MB per 1,000 pages.

Fix: lift the dynamic part out and use a single static pattern with a group capture:

REVIEW_ID = re.compile(r"review-([a-z0-9-]+)-(\d+)")

def extract_review_id(html: str, slug: str):
    for m in REVIEW_ID.finditer(html):
        if m.group(1) == slug:
            return m.group(2)
    return None

One compiled pattern, one cache slot. RSS curve flattens.

3. BeautifulSoup soup retention in long-lived lists

This one is sneaky. The code looks correct:

results = []
for url in urls:
    html = await fetch(url)
    soup = BeautifulSoup(html, "lxml")
    results.append({
        "title": soup.title.string,
        "body": soup.select_one("article").get_text(),
    })

The dictionaries look small — a few hundred bytes of text each. But soup.title.string and the result of select_one(...).get_text() are not plain Python strings. They are NavigableString and bs4.element.Tag proxies that hold a back-reference to the parent soup. As long as those proxies live in results, the entire parse tree (often 200 KB–2 MB per page) stays in memory.

After 2,500 review pages, my Trustpilot worker had ~3 GB RSS — almost all of it old soup trees being kept alive by .string references in results.

Fix: coerce to plain str at the point of extraction:

results.append({
    "title": str(soup.title.string) if soup.title else None,
    "body": str(soup.select_one("article").get_text()) if soup.select_one("article") else "",
})
# soup goes out of scope at end of loop → entire tree freed

str(...) copies the bytes into a fresh Python string with no back-reference. The soup is now garbage as soon as the loop iterates. RSS on the same 2,500-page run dropped from 3 GB to 410 MB.

Detection: RSS sampling every 1,000 pages

I added a 12-line probe to every scraper:

import os, psutil, time
_proc = psutil.Process(os.getpid())
_samples = []

def rss_tick(page_count: int):
    if page_count % 1000 == 0:
        rss_mb = _proc.memory_info().rss / 1024 / 1024
        _samples.append((page_count, rss_mb, time.time()))
        if len(_samples) >= 3:
            first, last = _samples[-3], _samples[-1]
            growth_per_1k = (last[1] - first[1]) / ((last[0] - first[0]) / 1000)
            if growth_per_1k > 50:  # >50 MB per 1,000 pages
                print(f"LEAK ALERT: +{growth_per_1k:.1f} MB/1k pages")

The threshold of 50 MB per 1,000 pages is conservative — anything above 20 MB on a steady-state run is worth investigating. The output gets piped to Apify's dataset, so I can grep across runs.

The cost angle nobody mentions

Memory leaks rarely crash a scraper. What they do is force you to bump the actor's Memory configuration:

1 GB → 2 GB: doubles compute-unit consumption per second
2 GB → 4 GB: quadruples it vs the 1 GB baseline

On Apify pricing, a 4 GB run at $0.0004/CU-second costs ~4× a properly-tuned 1 GB run for the same wall-clock time. Across 968 Trustpilot runs that would have been an extra ~$120/year for nothing — pure operational waste because nobody profiled RSS.

The 3 patterns above cover ~90% of leaks I have hit in production. Add the RSS probe to every long-running scraper, set the leak threshold at 50 MB/1k pages, and you will catch the next one in the first dev cycle instead of the next billing cycle.

More production scraping notes: t.me/scraping_ai. Originally published at blog.spinov.online.

Top comments (2)

foxck016077 • May 18

The RSS-probe-every-1000-pages framing is the move. Most leak posts give you the symptom and the fix, not the detection cadence — without "sample at fixed page intervals" you end up just staring at the Apify memory graph hoping the shape tells you something.

One question and one note from another Apify actor builder (smaller surface, only one published actor versus your 32):

Did you find the leak threshold of 50 MB / 1k pages held across very different page-weight workloads, or did you end up tuning it per-actor? I keep wondering whether a single threshold generalizes once you mix lightweight JSON endpoints with JS-rendered DOM pages in the same long-running job.
The re.compile bypass one bit me on a different shape — I had a per-feature router building dynamic header-match patterns per request and the bypass was invisible until I sampled tracemalloc. Putting re.compile in module scope or caching it under a frozen tuple key was the only thing that brought RSS back to flat. Cheap learning, expensive to spot.