From 20 Minutes to 90 Seconds: Refactoring a 100-Endpoint Crawler with asyncio

#python #异步编程 #爬虫 #性能优化

Last Thursday afternoon, my boss tossed me a “small request”: shrink the data collection cycle of our competitor monitoring system from 4 hours to under 15 minutes. I cracked open the legacy codebase—a 200-line loop of requests.get, serially crawling over 100 endpoints, taking a full 20 minutes per run. My immediate thought: “We’re toast.” But then it hit me—this is exactly where asyncio thrives. After refactoring and deploying, a job that took 1200 seconds dropped to 90 seconds, and CPU usage actually went down. This post walks through the whole journey and the key technical details.

Why the Synchronous Crawler Couldn’t Keep Up

The core logic in the legacy code was absurdly simple:

import time
import requests

def fetch_all(urls):
    results = []
    for url in urls:
        resp = requests.get(url, timeout=10)
        results.append(resp.json())
    return results

urls = [f"https://api.example.com/item/{i}" for i in range(100)]
start = time.time()
data = fetch_all(urls)
print(f"耗时: {time.time() - start:.1f}s")

This code took over 200 seconds for 100 endpoints, 99% of which was waiting on network I/O while the CPU sipped tea. Every requests.get is a blocking call—you can’t fire the next request until the current one fully completes. It’s like queuing in a single lane on a highway. No matter how fat your pipe or how powerful your hardware, the serialized waits eat you alive.

Concurrent Crawling with asyncio: the Core Idea in One Sentence

The core principle of asyncio is: run an event loop in a single thread, and whenever I/O wait occurs, yield control to other tasks. It’s not multi-threading—there’s no GIL contention. It doesn’t spawn new processes—memory overhead is tiny. For I/O-bound tasks like hundreds of network requests, asyncio gives you the best concurrency bang for your buck.

Here’s the refactored version:

import asyncio
import time
import aiohttp

async def fetch_one(session, url):
    try:
        # aiohttp 的 session.get 返回一个协程，await 交出控制权
        async with session.get(url, timeout=10) as resp:
            return await resp.json()
    except Exception as e:
        return {"error": str(e)}

async def fetch_all(urls):
    # TCPConnector 控制连接池大小，避免对端被压垮
    connector = aiohttp.TCPConnector(limit=50)
    async with aiohttp.ClientSession(connector=connector) as session:
        tasks = [fetch_one(session, url) for url in urls]
        # asyncio.gather 并发执行所有任务，总耗时 ≈ 最慢的那个请求
        results = await asyncio.gather(*tasks, return_exceptions=True)
    return results

async def main():
    urls = [f"https://api.example.com/item/{i}" for i in range(100)]
    start = time.time()
    data = await fetch_all(urls)
    print(f"耗时: {time.time() - start:.1f}s, 获取 {len(data)} 条")

if __name__ == "__main__":
    asyncio.run(main())

A few crucial points:

async def + await: define a coroutine function. When you hit await, the current task suspends and the event loop immediately schedules the next ready task. The expression after await must be an Awaitable (coroutine, Task, or Future)—the compiler enforces this, which is far safer than forgetting to release a lock in multi‑threading.
asyncio.gather: launches all coroutines concurrently and waits for every one to finish. The total time approximates the slowest single request, not the sum. If your 100 endpoints each respond in 2 seconds, synchronous code would need 200 seconds; asynchronous code needs just over 2 seconds—real-world bottlenecks like bandwidth and connection pools still apply, but the improvement is still exponential.
aiohttp.TCPConnector(limit=50): caps the number of concurrent connections to avoid bulldozing the target server. The default is 100; tweak it according to anti-scraping policies.
asyncio.run(main()): the standard entry point since Python 3.7. It automatically creates the event loop, runs the coroutine, and cleans up resources. Don’t manually call loop = asyncio.get_event_loop()—that’s a pitfall from older versions.

With the above version, the same 100 endpoints finished in about 18 seconds—a more than 10× improvement. And the whole thing is single‑threaded, with memory usage and debugging complexity orders of magnitude lower than multithreading.

Lessons from the Trenches: Three Hours That Almost Did Me In

Trap 1: Using time.sleep inside an async function

While testing, I casually added a time.sleep(1) to simulate latency. The entire event loop froze solid, and all coroutines stuck in place. That’s because time.sleep blocks the OS thread, never giving the event loop a chance to reschedule. The correct approach is await asyncio.sleep(1), which yields control back to the loop and resumes after the delay. Similarly, never call the requests library or any synchronous I/O inside a coroutine—use the async counterparts (aiohttp, aiofiles, aiomysql, etc.). A single blocking call destroys all concurrency.

Trap 2: Forgetting await turns your coroutine into a ghost

async def demo():
    asyncio.sleep(1)         # 返回一个协程对象，但没执行！
    await asyncio.sleep(1)   # 真正等待 1 秒

The first line merely creates a coroutine object—it’s never scheduled or awaited. This usually leads to logic bugs and a runtime RuntimeWarning: coroutine was never awaited. The solution is simple: whenever you call an async function, you must prepend await, or use asyncio.create_task() to schedule it.