DEV Community

BAOFUFAN
BAOFUFAN

Posted on

Crawling 100K Pages with asyncio: 3 Crashes and What I Learned About the Event Loop

Here’s what happened. Last week, my boss dropped a requirement: “Scrape data from these 100,000 product pages, I need it by tonight.” I thought, easy – just a loop with requests. Then I did the math: at 0.5 seconds per request, serial execution would take 13.8 hours. I nearly broke down.

So I pulled out asyncio, planning to squeeze the time down to minutes with concurrent coroutines. But then it crashed three times: first, the target server blacklisted my IP; second, memory exploded to 8GB and the OOM Killer took it down; third, I realized half the requests had returned exceptions and I had no idea. After hitting these walls, I finally started to grasp asyncio. This article is my battle-scarred experience.

The Event Loop: Why It’s So Fast

Many people think asyncio is multi-threaded, but it actually runs in a single thread. The core is the Event Loop, which acts like a constantly spinning scheduler — when one coroutine is waiting for a network response, the loop immediately switches to another ready coroutine, keeping the CPU almost never idle. This makes it perfect for I/O-bound tasks: web scraping, API calls, database queries.

Let’s feel it with a simple example:

import asyncio
import time

async def fetch(url):
    # 模拟网络 IO,不阻塞线程
    await asyncio.sleep(0.5)
    return f"{url} done"

async def main():
    start = time.time()
    # 10 个任务并发跑,总耗时约 0.5 秒,而不是 5 秒
    results = await asyncio.gather(
        *[fetch(f"https://page/{i}") for i in range(10)]
    )
    print(f"耗时: {time.time() - start:.2f}s")
    print(results[:3])

asyncio.run(main())
Enter fullscreen mode Exit fullscreen mode

If you replaced the above with synchronous time.sleep(0.5), 10 iterations would take 5 seconds. asyncio.gather submits all 10 coroutines to the event loop at once; the fastest return doesn’t wait for the slowest, and the total time is determined only by the slowest task — that’s the power of async concurrency.

Building a Scraper That Can Handle 100K Requests

The code above only uses asyncio.sleep. For real work you need to make HTTP requests. I used aiohttp, the asyncio ecosystem’s equivalent of requests. But blindly spawning 100K concurrent coroutines is essentially a DDoS — the target server will ban you instantly. That was my first crash.

That’s why you must use a Semaphore to limit concurrency. Usually 200-500 is a safe range:

import asyncio
import aiohttp

SEM_LIMIT = 200

async def fetch(session, url, sem):
    async with sem:  # 控制同时进行的协程数
        try:
            async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as resp:
                text = await resp.text()
                return url, len(text)
        except Exception as e:
            return url, str(e)

async def crawl(urls):
    sem = asyncio.Semaphore(SEM_LIMIT)
    # 复用同一个 TCP 连接池,避免反复握手
    connector = aiohttp.TCPConnector(limit=SEM_LIMIT)
    async with aiohttp.ClientSession(connector=connector) as session:
        tasks = [fetch(session, url, sem) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)
    return results

# 跑起来
urls = [f"https://httpbin.org/delay/1?page={i}" for i in range(1000)]
results = asyncio.run(crawl(urls))
print(f"完成 {len(results)} 个请求")
Enter fullscreen mode Exit fullscreen mode

A few key points here:

  • asyncio.Semaphore acts like a lock — it only allows N coroutines to enter at once, the rest queue up outside, preventing unlimited connections from being created.
  • TCPConnector(limit=SEM_LIMIT) limits the underlying connection pool size, working together with the semaphore to avoid port exhaustion.
  • return_exceptions=True is extremely important (more on that later).
  • aiohttp.ClientTimeout sets a total timeout of 10 seconds, preventing a single slow request from hanging the whole batch.

The Three Pitfalls That Really Made Me Suffer

Pitfall 1: One Exception in gather Blows Up Everything

Initially I didn’t add return_exceptions=True. After a few minutes of crawling, a single 500 error knocked the whole thing down. The default behavior of asyncio.gather is: if any child coroutine raises an exception, it immediately cancels all other tasks and propagates the exception. When you have 100K requests, a few timeouts or 5xx errors are completely normal — but the whole batch would be wasted. The fix is that return_exceptions=True makes gather return exceptions as results instead of crashing, so the rest of the tasks keep running. Later you can manually filter and handle the failed ones.

Pitfall 2: Accidentally Blocking Inside a Coroutine

While parsing HTML I used BeautifulSoup — CPU-intensive, but not a killer. The real disaster was that in an early version I casually sprinkled time.sleep(0.1) to introduce “anti-scraping delays”. time.sleep is a synchronous block — it brings the entire event loop to a halt; all coroutines are frozen for those 0.1 seconds. The correct way is await asyncio.sleep(0.1). Any operation that could make the thread wait (file I/O, synchronous requests, heavy CPU work) must either be offloaded to a thread pool (loop.run_in_executor) or replaced with an async library. Otherwise, your concurrency collapses into serial execution because of one blocking task.

Pitfall 3: Connection Pool Exhaustion and DNS Resolution Storms

The default TCPConnector has a connection limit of 100. When you set the Semaphore to 500, 400 coroutines will be stuck waiting for a connection. Worse, every new connection needs a DNS lookup, and aiohttp defaults to using the synchronous socket.getaddrinfo… yet another implicit block. The solution is to install aiodns and pass it to the connector (e.g., aiohttp.TCPConnector(limit=..., resolver=aiohttp.AsyncResolver())) to make DNS resolution truly async and avoid dragging the event loop down.

Top comments (0)