Memory leaks in scrapers do not crash the run. They quietly bump the Apify Memory limit from 1 GB to 2 GB to 4 GB, double the per-run cost, and only get spotted weeks later on a compute-unit invoice.
After 968 Trustpilot runs (~80–300 review pages each, ~150k page hits cumulative), I started sampling RSS every 1,000 pages. The growth pattern told a different story than the logs. Below are the three patterns that account for ~90% of the leaks I have seen across my 32 published Apify actors.
1. The unbounded asyncio queue
The most common pattern. A producer coroutine fetches URLs faster than the consumer parses them, so the in-memory queue grows linearly with runtime.
# leaks at high concurrency
queue = asyncio.Queue() # no maxsize
async def producer():
async for url in source:
await queue.put(url) # never blocks
async def consumer():
while True:
url = await queue.get()
await process(url)
If process() is slower than source (which is true for most JS-rendered sites), the queue accumulates. On a Trustpilot run that fetched a company with 12,000 reviews, the queue held ~9,500 URLs at peak — about 380 MB of bytestrings.
Fix:
queue = asyncio.Queue(maxsize=200) # producer blocks at 200
A bounded queue forces the producer to wait. Memory stays flat; throughput drops by less than 5% because the consumer never sat idle anyway — the bottleneck was the network, not the queue.
2. Per-URL re.compile bypassing Python's regex cache
CPython has a built-in regex cache at re._cache (default 512 entries). The naive assumption is that re.search(pattern, text) is cached and cheap on the second call.
It is — for string patterns. But the moment you build the pattern dynamically per URL, every call is a cache miss and a new compiled object:
# leaks at scale
def extract_review_id(html: str, slug: str):
pat = rf"review-{slug}-(\d+)" # dynamic pattern
m = re.search(pat, html)
return m.group(1) if m else None
Each unique slug (company name) puts a new entry in re._cache. The cap is 512, so it does not "leak" forever — but the eviction is a clear-the-whole-dict operation under the hood, and the compiled pattern keeps an internal __weakref__ to anything it captured during compile (locale, flags). Across a long run with thousands of distinct slugs, RSS would climb 3–5 MB per 1,000 pages.
Fix: lift the dynamic part out and use a single static pattern with a group capture:
REVIEW_ID = re.compile(r"review-([a-z0-9-]+)-(\d+)")
def extract_review_id(html: str, slug: str):
for m in REVIEW_ID.finditer(html):
if m.group(1) == slug:
return m.group(2)
return None
One compiled pattern, one cache slot. RSS curve flattens.
3. BeautifulSoup soup retention in long-lived lists
This one is sneaky. The code looks correct:
results = []
for url in urls:
html = await fetch(url)
soup = BeautifulSoup(html, "lxml")
results.append({
"title": soup.title.string,
"body": soup.select_one("article").get_text(),
})
The dictionaries look small — a few hundred bytes of text each. But soup.title.string and the result of select_one(...).get_text() are not plain Python strings. They are NavigableString and bs4.element.Tag proxies that hold a back-reference to the parent soup. As long as those proxies live in results, the entire parse tree (often 200 KB–2 MB per page) stays in memory.
After 2,500 review pages, my Trustpilot worker had ~3 GB RSS — almost all of it old soup trees being kept alive by .string references in results.
Fix: coerce to plain str at the point of extraction:
results.append({
"title": str(soup.title.string) if soup.title else None,
"body": str(soup.select_one("article").get_text()) if soup.select_one("article") else "",
})
# soup goes out of scope at end of loop → entire tree freed
str(...) copies the bytes into a fresh Python string with no back-reference. The soup is now garbage as soon as the loop iterates. RSS on the same 2,500-page run dropped from 3 GB to 410 MB.
Detection: RSS sampling every 1,000 pages
I added a 12-line probe to every scraper:
import os, psutil, time
_proc = psutil.Process(os.getpid())
_samples = []
def rss_tick(page_count: int):
if page_count % 1000 == 0:
rss_mb = _proc.memory_info().rss / 1024 / 1024
_samples.append((page_count, rss_mb, time.time()))
if len(_samples) >= 3:
first, last = _samples[-3], _samples[-1]
growth_per_1k = (last[1] - first[1]) / ((last[0] - first[0]) / 1000)
if growth_per_1k > 50: # >50 MB per 1,000 pages
print(f"LEAK ALERT: +{growth_per_1k:.1f} MB/1k pages")
The threshold of 50 MB per 1,000 pages is conservative — anything above 20 MB on a steady-state run is worth investigating. The output gets piped to Apify's dataset, so I can grep across runs.
The cost angle nobody mentions
Memory leaks rarely crash a scraper. What they do is force you to bump the actor's Memory configuration:
- 1 GB → 2 GB: doubles compute-unit consumption per second
- 2 GB → 4 GB: quadruples it vs the 1 GB baseline
On Apify pricing, a 4 GB run at $0.0004/CU-second costs ~4× a properly-tuned 1 GB run for the same wall-clock time. Across 968 Trustpilot runs that would have been an extra ~$120/year for nothing — pure operational waste because nobody profiled RSS.
The 3 patterns above cover ~90% of leaks I have hit in production. Add the RSS probe to every long-running scraper, set the leak threshold at 50 MB/1k pages, and you will catch the next one in the first dev cycle instead of the next billing cycle.
More production scraping notes: t.me/scraping_ai. Originally published at blog.spinov.online.
Top comments (1)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.