DEV Community

BAOFUFAN
BAOFUFAN

Posted on

I Rewrote Our Scraper with asyncio. My CTO Thought I Added Servers.

Here's the story. Last week, the product manager dropped a requirement: we needed to fetch real-time market data from 200 sources, refresh every 10 seconds, and keep latency under 2 seconds. I looked at the existing code — synchronous requests fetching one by one. A full round took 14 seconds, CPU usage was below 5%, but all the time was wasted waiting on network I/O. The classic Python solution: asyncio. I spent an afternoon rewriting the core of the scraper. After deployment, QPS jumped from 20 to 500. The CTO stared at the monitoring dashboard for five minutes, then turned to ask if I had secretly added more machines. I said, "Nope, just changed a few dozen lines of Python."

Here, I'll share the core techniques, full code, and two deep pitfalls I fell into.


Your Python Spends Most of Its Time Waiting

Let's look at a typical scenario: fetching data from 10 URLs synchronously. The code might look like this:

import time
import requests

def fetch_sync(url: str) -> str:
    print(f"[{time.strftime('%X')}] 请求 {url}")
    resp = requests.get(url, timeout=5)
    return resp.text[:50]  # 截取前50字符示意

def main_sync():
    urls = [f"https://httpbin.org/delay/1?t={i}" for i in range(10)]
    start = time.perf_counter()
    results = [fetch_sync(url) for url in urls]
    elapsed = time.perf_counter() - start
    print(f"同步总耗时: {elapsed:.2f}s,数据条数: {len(results)}")

if __name__ == "__main__":
    main_sync()
Enter fullscreen mode Exit fullscreen mode

This code takes over 12 seconds because each request blocks sequentially, executing the 10 requests one after another. The CPU is basically sleeping — blocking I/O doesn’t consume CPU cycles, but the thread is stuck waiting.

Switch to asyncio: the slowest request decides the total time

The core idea of asyncio: single-threaded + event loop. When a coroutine starts waiting for a network response, it yields control back to the event loop, which immediately schedules the next coroutine. The total time is roughly the duration of the slowest request, not the sum of all requests.

Here’s the full comparison code (you can run it):

import asyncio
import time
import httpx   # httpx 提供原生 async 支持

# --- 异步版本 ---
async def fetch_async(client: httpx.AsyncClient, url: str) -> str:
    print(f"[{time.strftime('%X')}] 请求 {url}")
    resp = await client.get(url, timeout=5)
    return resp.text[:50]

async def main_async():
    urls = [f"https://httpbin.org/delay/1?t={i}" for i in range(10)]
    start = time.perf_counter()
    async with httpx.AsyncClient() as client:
        tasks = [fetch_async(client, url) for url in urls]
        results = await asyncio.gather(*tasks)
    elapsed = time.perf_counter() - start
    print(f"异步总耗时: {elapsed:.2f}s,数据条数: {len(results)}")

if __name__ == "__main__":
    asyncio.run(main_async())
Enter fullscreen mode Exit fullscreen mode

The result: the async version takes about 1.1 seconds, nearly the same as a single request's latency. That’s the power of asyncio.gather — it submits all coroutines to the event loop and lets the scheduler run them while waiting for I/O concurrently.

Advanced Tip: Limit Concurrency to Avoid Overwhelming the Server

In reality, you can’t have unlimited concurrency — upstream APIs have rate limits, or your machine has limited port resources. This is where asyncio.Semaphore comes in for concurrency control, for example, allowing only 20 requests at a time:

import asyncio
import httpx

CONCURRENCY = 20

async def fetch_with_limit(sem: asyncio.Semaphore, client: httpx.AsyncClient, url: str):
    async with sem:  # 超过 20 个协程时,其他会在这里等待
        resp = await client.get(url, timeout=5)
        return resp.status_code

async def main_limited():
    urls = [f"https://httpbin.org/delay/1?t={i}" for i in range(100)]
    sem = asyncio.Semaphore(CONCURRENCY)
    async with httpx.AsyncClient() as client:
        tasks = [fetch_with_limit(sem, client, url) for url in urls]
        results = await asyncio.gather(*tasks)
    print(f"完成 {len(results)} 个请求")

asyncio.run(main_limited())
Enter fullscreen mode Exit fullscreen mode

The semaphore works simply: each async with sem tries to acquire a permit. If 20 coroutines are already running, the new one is suspended until another coroutine finishes and releases the permit. This is asyncio’s native flow control, much more elegant than thread pools.

Lessons Learned: Two Traps That Each Cost Me an Hour

Trap 1: Calling synchronous requests.get inside a coroutine

When I first started migrating, I got lazy and directly wrote requests.get(url) inside an async def. The result: the entire event loop was blocked, and concurrency turned into serial execution. It took me a while to realize — never call synchronous blocking functions inside a coroutine. Either use async libraries (like httpx, aiohttp) or use loop.run_in_executor() to offload the blocking call to a thread pool:

# 临时方案:把同步调用丢
Enter fullscreen mode Exit fullscreen mode

Top comments (0)