BAOFUFAN

Posted on May 1

asyncio Pitfalls: 3 Hours Debugging a Concurrent Crawler

#python #异步编程 #asyncio #并发爬虫

Last week my boss threw me a task: pull data from 50 third‑party APIs and build an aggregated report. I thought it was a piece of cake — just write a loop with Requests and be done. But when I ran it, I was dumbfounded: the whole thing was synchronously blocking, and cycling through all 50 endpoints took almost 80 seconds. That’s when I naturally reached for asyncio, Python’s silver bullet for IO‑bound concurrency. I jumped in eagerly, only to spend the next three hours glued to my screen hunting down one weird behavior after another.

I thought I understood asyncio — I'd only scratched the surface

The event loop: a single‑threaded time‑management wizard

At the heart of asyncio sits an event loop. It juggles all coroutines inside a single thread. When a coroutine is waiting on something slow — network, disk — it doesn’t block the thread. Instead, it yields control back to the event loop, which then wakes up the next ready coroutine.

Define coroutines with async def and voluntarily hand over the execution with await:

import asyncio

async def fetch_api(url: str) -> str:
    print(f"开始请求 {url}")
    await asyncio.sleep(1)        # simulates network IO; in real code use aiohttp
    return f"data from {url}"

Real concurrency: gather and create_task

Throw all 50 tasks together and run them concurrently with asyncio.gather. The total time depends on the slowest one, not the sum of all requests:

async def main():
    urls = [f"https://api.example.com/item/{i}" for i in range(50)]
    tasks = [fetch_api(url) for url in urls]
    results = await asyncio.gather(*tasks)
    print(f"获取 {len(results)} 条数据")

asyncio.run(main())

Just like that, I went from 80 seconds to under 2 seconds. I nearly slapped the desk in excitement — but that’s exactly when the real traps started lining up.

Full comparison: sync vs async — how big is the gap?

You can literally copy and run the two snippets below. Trust me, you’ll want to see the difference yourself.

Synchronous version (painfully slow)

import time
import requests

def fetch_sync(url: str) -> str:
    resp = requests.get(url, timeout=5)
    return resp.status_code

def main():
    urls = ["https://httpbin.org/delay/1"] * 10  # 10 slow endpoints
    start = time.perf_counter()
    results = [fetch_sync(url) for url in urls]
    elapsed = time.perf_counter() - start
    print(f"同步耗时: {elapsed:.2f}s, 结果数: {len(results)}")

if __name__ == "__main__":
    main()

Async version (the right way)

import asyncio
import time
import aiohttp

async def fetch_async(session: aiohttp.ClientSession, url: str) -> int:
    async with session.get(url, timeout=aiohttp.ClientTimeout(total=5)) as resp:
        return resp.status

async def main():
    urls = ["https://httpbin.org/delay/1"] * 10
    start = time.perf_counter()
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_async(session, url) for url in urls]
        results = await asyncio.gather(*tasks)
    elapsed = time.perf_counter() - start
    print(f"异步耗时: {elapsed:.2f}s, 结果数: {len(results)}")

if __name__ == "__main__":
    asyncio.run(main())

The synchronous version runs 10 endpoints in about 12 seconds. The async one finishes in just over 1 second. The difference is impossible to miss.

The traps I fell into — each one more subtle than the last

1. Forgetting `await` turns coroutines into zombies

tasks = [fetch_async(session, url) for url in urls]  # only creates coroutine objects, never executed!

Without await or asyncio.gather to wrap them, those coroutines are never scheduled. The code finishes almost instantly, and your “results” list is full of coroutine objects. The fix is simple: always use gather or create_task.

2. Calling `time.sleep` inside a coroutine freezes the entire loop

async def buggy_fetch(url):
    import time
    time.sleep(1)          # blocks the thread — event loop frozen!
    return "data"

time.sleep is a synchronous blocking call. It seizes the only thread and the event loop cannot switch to anything else. You must use await asyncio.sleep(n) or offload synchronous work with loop.run_in_executor.

3. No concurrency limit got me blocked by the target API

50 coroutines bombarded the server at the same time, immediately earning a flood of HTTP 429 responses. The cure is a Semaphore:

sem = asyncio.Semaphore(10)   # at most 10 concurrent requests

async def rate_limited_fetch(session, url):
    async with sem:
        return await fetch_async(session, url)

4. `asyncio.run()` crashes on Windows

On Windows, the default ProactorEventLoop can trigger a RuntimeError in some scenarios.

DEV Community

asyncio Pitfalls: 3 Hours Debugging a Concurrent Crawler

I thought I understood asyncio — I'd only scratched the surface

The event loop: a single‑threaded time‑management wizard

Real concurrency: gather and create_task

Full comparison: sync vs async — how big is the gap?

The traps I fell into — each one more subtle than the last

1. Forgetting `await` turns coroutines into zombies

2. Calling `time.sleep` inside a coroutine freezes the entire loop

3. No concurrency limit got me blocked by the target API

4. `asyncio.run()` crashes on Windows

Top comments (0)

I thought I understood asyncio — I'd only scratched the surface

The event loop: a single‑threaded time‑management wizard

Real concurrency: gather and create_task

Full comparison: sync vs async — how big is the gap?

The traps I fell into — each one more subtle than the last

1. Forgetting await turns coroutines into zombies

2. Calling time.sleep inside a coroutine freezes the entire loop

3. No concurrency limit got me blocked by the target API

4. asyncio.run() crashes on Windows

1. Forgetting `await` turns coroutines into zombies

2. Calling `time.sleep` inside a coroutine freezes the entire loop

4. `asyncio.run()` crashes on Windows