Here’s what happened. Last week, my boss dropped a requirement: “Scrape data from these 100,000 product pages, I need it by tonight.” I thought, easy – just a loop with requests. Then I did the math: at 0.5 seconds per request, serial execution would take 13.8 hours. I nearly broke down.
So I pulled out asyncio, planning to squeeze the time down to minutes with concurrent coroutines. But then it crashed three times: first, the target server blacklisted my IP; second, memory exploded to 8GB and the OOM Killer took it down; third, I realized half the requests had returned exceptions and I had no idea. After hitting these walls, I finally started to grasp asyncio. This article is my battle-scarred experience.
The Event Loop: Why It’s So Fast
Many people think asyncio is multi-threaded, but it actually runs in a single thread. The core is the Event Loop, which acts like a constantly spinning scheduler — when one coroutine is waiting for a network response, the loop immediately switches to another ready coroutine, keeping the CPU almost never idle. This makes it perfect for I/O-bound tasks: web scraping, API calls, database queries.
Let’s feel it with a simple example:
import asyncio
import time
async def fetch(url):
# 模拟网络 IO,不阻塞线程
await asyncio.sleep(0.5)
return f"{url} done"
async def main():
start = time.time()
# 10 个任务并发跑,总耗时约 0.5 秒,而不是 5 秒
results = await asyncio.gather(
*[fetch(f"https://page/{i}") for i in range(10)]
)
print(f"耗时: {time.time() - start:.2f}s")
print(results[:3])
asyncio.run(main())
If you replaced the above with synchronous time.sleep(0.5), 10 iterations would take 5 seconds. asyncio.gather submits all 10 coroutines to the event loop at once; the fastest return doesn’t wait for the slowest, and the total time is determined only by the slowest task — that’s the power of async concurrency.
Building a Scraper That Can Handle 100K Requests
The code above only uses asyncio.sleep. For real work you need to make HTTP requests. I used aiohttp, the asyncio ecosystem’s equivalent of requests. But blindly spawning 100K concurrent coroutines is essentially a DDoS — the target server will ban you instantly. That was my first crash.
That’s why you must use a Semaphore to limit concurrency. Usually 200-500 is a safe range:
import asyncio
import aiohttp
SEM_LIMIT = 200
async def fetch(session, url, sem):
async with sem: # 控制同时进行的协程数
try:
async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as resp:
text = await resp.text()
return url, len(text)
except Exception as e:
return url, str(e)
async def crawl(urls):
sem = asyncio.Semaphore(SEM_LIMIT)
# 复用同一个 TCP 连接池,避免反复握手
connector = aiohttp.TCPConnector(limit=SEM_LIMIT)
async with aiohttp.ClientSession(connector=connector) as session:
tasks = [fetch(session, url, sem) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
return results
# 跑起来
urls = [f"https://httpbin.org/delay/1?page={i}" for i in range(1000)]
results = asyncio.run(crawl(urls))
print(f"完成 {len(results)} 个请求")
A few key points here:
-
asyncio.Semaphoreacts like a lock — it only allows N coroutines to enter at once, the rest queue up outside, preventing unlimited connections from being created. -
TCPConnector(limit=SEM_LIMIT)limits the underlying connection pool size, working together with the semaphore to avoid port exhaustion. -
return_exceptions=Trueis extremely important (more on that later). -
aiohttp.ClientTimeoutsets a total timeout of 10 seconds, preventing a single slow request from hanging the whole batch.
The Three Pitfalls That Really Made Me Suffer
Pitfall 1: One Exception in gather Blows Up Everything
Initially I didn’t add return_exceptions=True. After a few minutes of crawling, a single 500 error knocked the whole thing down. The default behavior of asyncio.gather is: if any child coroutine raises an exception, it immediately cancels all other tasks and propagates the exception. When you have 100K requests, a few timeouts or 5xx errors are completely normal — but the whole batch would be wasted. The fix is that return_exceptions=True makes gather return exceptions as results instead of crashing, so the rest of the tasks keep running. Later you can manually filter and handle the failed ones.
Pitfall 2: Accidentally Blocking Inside a Coroutine
While parsing HTML I used BeautifulSoup — CPU-intensive, but not a killer. The real disaster was that in an early version I casually sprinkled time.sleep(0.1) to introduce “anti-scraping delays”. time.sleep is a synchronous block — it brings the entire event loop to a halt; all coroutines are frozen for those 0.1 seconds. The correct way is await asyncio.sleep(0.1). Any operation that could make the thread wait (file I/O, synchronous requests, heavy CPU work) must either be offloaded to a thread pool (loop.run_in_executor) or replaced with an async library. Otherwise, your concurrency collapses into serial execution because of one blocking task.
Pitfall 3: Connection Pool Exhaustion and DNS Resolution Storms
The default TCPConnector has a connection limit of 100. When you set the Semaphore to 500, 400 coroutines will be stuck waiting for a connection. Worse, every new connection needs a DNS lookup, and aiohttp defaults to using the synchronous socket.getaddrinfo… yet another implicit block. The solution is to install aiodns and pass it to the connector (e.g., aiohttp.TCPConnector(limit=..., resolver=aiohttp.AsyncResolver())) to make DNS resolution truly async and avoid dragging the event loop down.
Top comments (0)