3 Hours Wasted on asyncio Pitfalls That Almost Took Down Production

#python #asyncio #异步编程 #爬虫

Last Friday at 5 PM, right when I was about to close my laptop and sneak out, the alert channel exploded — the online data collection service had a timeout rate spiking to 40%, and all downstream reports were blank. I checked the logs and found that the crawler processing thousands of URLs was still using the old synchronous requests library, fetching them one by one. Each request averaged 1.2 seconds, one full round took nearly 20 minutes, but the business requirement demanded completion within 5 minutes. Only one thought crossed my mind: rewrite it with asyncio for concurrency and deploy before leaving.

That decision led to three major pitfalls, and I almost wrecked the service. Now I’m sharing the hard-learned lessons—hopefully saving you those three hours.

Why asyncio Is the Right Play for IO‑Bound Tasks

The core of asyncio is the event loop plus coroutines. Think of the event loop as a constantly polling scheduler, and each coroutine as a task that can voluntarily pause and hand back control. When a coroutine is waiting for a network response (IO), the event loop immediately switches to another ready coroutine, keeping the CPU from spinning idle. The biggest difference from traditional multithreading: asyncio is cooperative scheduling within a single thread, avoiding thread‑switching overhead and GIL lock contention. It especially shines in network‑request‑heavy scenarios.

The common pattern we use: define coroutine functions with async def and await asynchronous IO operations inside them, then use asyncio.gather() to hand multiple coroutines to the event loop at once. The total duration depends on the slowest task, not the sum of all tasks.

But there’s a gap between “understanding the principle” and “writing correct code” — one that’s filled with casualties.

Code in Practice: From “Sync Trap” to “Async Delight”

Pitfall 1: Using a Synchronous Blocking Call Inside a Coroutine

At first, I wrote a naive concurrent crawler that looked something like this:

import asyncio
import requests  # 同步库，不能用！

async def fetch(url):
    # 错误示范：直接把同步的 requests 放在协程里
    resp = requests.get(url, timeout=5)   # 这次调用会阻塞整个线程！
    return resp.status_code

async def main():
    urls = ["https://httpbin.org/delay/1"] * 10
    tasks = [fetch(url) for url in urls]
    results = await asyncio.gather(*tasks)
    print(results)

asyncio.run(main())

When you run this, you’ll notice all requests are still sequential — the effect is exactly the same as synchronous code. The reason is simple: requests.get() is a synchronous blocking call. While waiting for the network, it never yields control back to the event loop, so only one coroutine runs at a time. The event loop is effectively useless.

The correct approach: switch to an async HTTP client, like aiohttp or httpx.AsyncClient.

import asyncio
import aiohttp

async def fetch(session, url):
    # 使用 aiohttp 的异步请求，await 时将控制权交还事件循环
    async with session.get(url, timeout=aiohttp.ClientTimeout(total=5)) as resp:
        return await resp.text()

async def main():
    urls = ["https://httpbin.org/delay/1"] * 10
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls]
        results = await asyncio.gather(*tasks)
    print(f"完成 {len(results)} 个请求")

asyncio.run(main())

This code truly leverages the event loop’s concurrency. For 10 requests each with a 1‑second delay, the total time is just over 1 second instead of 10 seconds. My crawler job went from 20 minutes to under 2 minutes.

Pitfall 2: gather() Blows Up the Whole Family on a Single Exception

When the number of URLs grew to several hundred, occasionally a few requests would time out or DNS resolution would fail. I noticed that if any single coroutine raised an exception, gather() would propagate it immediately, cancelling all other still‑running coroutines and wiping out the entire batch. That was exactly what happened during my first production deployment: one tiny domain failed to resolve, everything tripped, and the downstream went white again.

The fix is to use gather(..., return_exceptions=True), which returns exceptions as result objects instead of breaking the flow.

async def fetch_with_sem(sem, session, url):
    async with sem:   # 限制并发数，防止瞬间占满文件描述符
        try:
            async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as resp:
                return url, await resp.text()
        except Exception as e:
            return url, f"ERROR: {e}"

async def main():
    urls = [...]  # 几百个 URL
    sem = asyncio.Semaphore(50)  # 限制并发，避免触发系统或服务端限制
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_with_sem(sem, session, url) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)  # 关键！
    for url, content in results:
        if isinstance(content, Exception):
            print(f"{url} 失败: {content}")
        else:
            process(content)

Adding a Semaphore and a retry queue after swallowing exceptions finally made the service stable.

Pitfalls & Cautions: These Are the Real Killers

1. Never call time.sleep() inside a coroutine

time.sleep() puts the entire thread to sleep, completely stalling the event loop. Always use await asyncio.sleep() instead.

2. Beware of unlimited concurrency overwhelming file descriptors

Even though asyncio handles thousands of tasks easily, spawning 5,000 concurrent connections at once can exhaust your system’s file descriptor limit or accidentally trigger the target server’s rate limiting. Use asyncio.Semaphore or connection‑pool limits to constrain concurrency.

3. Don’t forget to back off and retry

Transient network issues are normal. Without a retry mechanism, some failures become permanent data gaps. Combine return_exceptions=True with exponential backoff retries for robust production‑grade code.

Rewriting a synchronous IO‑bound service with asyncio is one of the most satisfying optimizations you can make. But these pitfalls can easily turn it into a nightmare if you’re not careful. I lost three hours and almost a stable production Friday. I hope this post saves you from the same fate.