TL;DR
Still calling Gemini API in a for loop? Here are 4 async patterns that take 100 requests from 100 seconds to about 5 seconds — plus the Batch API that cuts costs by 50%. Based on the official cookbook notebook.
The Problem
The old google-generativeai SDK has no .aio namespace. If you've been building on that, you've been stuck with sync calls. The new google-genai SDK fixes this.
pip install -U google-genai
Check you're on the new one:
from google import genai
print(genai.__version__) # should be 1.x.x+
Pattern 1 — Basic Async
The async namespace lives at client.aio. That's the only thing you really need to remember.
import asyncio
from google import genai
client = genai.Client(api_key="YOUR_KEY")
async def generate(prompt: str) -> str:
response = await client.aio.models.generate_content(
model="gemini-2.0-flash",
contents=prompt,
)
return response.text
async def main():
prompts = ["Q1...", "Q2...", "Q3..."]
tasks = [generate(p) for p in prompts]
for coro in asyncio.as_completed(tasks):
result = await coro
print(result[:60])
asyncio.run(main())
asyncio.as_completed() streams results as they complete — great for UIs that want to show fast responses first. Use asyncio.gather() when you need all results in order.
Pattern 2 — Semaphore for Rate Limits
Free tier Gemini 2.0 Flash = 15 RPM. Fire off 100 tasks with gather and you'll hit 429 Resource Exhausted immediately.
asyncio.Semaphore(N) caps concurrent in-flight requests:
sem = asyncio.Semaphore(10)
async def safe_generate(prompt: str) -> str:
async with sem:
response = await client.aio.models.generate_content(
model="gemini-2.0-flash",
contents=prompt,
)
return response.text
async def batch_process(prompts):
tasks = [safe_generate(p) for p in prompts]
return await asyncio.gather(*tasks)
Rule of thumb for Semaphore value: ceil(RPM / 60 × avg_response_seconds) × 0.7.
With retry on 429
from google.api_core import exceptions
async def safe_generate_retry(prompt, max_retries=3):
for attempt in range(max_retries):
try:
async with sem:
response = await client.aio.models.generate_content(...)
return response.text
except exceptions.ResourceExhausted:
await asyncio.sleep(2 ** attempt)
raise RuntimeError("max retries exceeded")
Pattern 3 — aiohttp for Parallel Image Processing
If you're analyzing 50+ images with Gemini Vision, the download itself becomes the bottleneck. Combine aiohttp with client.aio:
import aiohttp
async def download_and_analyze(session, url):
async with session.get(url) as resp:
image_bytes = await resp.read()
response = await client.aio.models.generate_content(
model="gemini-2.0-flash",
contents=[
"Describe this image in one sentence.",
{"mime_type": "image/png", "data": image_bytes},
],
)
return response.text
async def main(urls):
async with aiohttp.ClientSession() as session:
tasks = [download_and_analyze(session, u) for u in urls]
return await asyncio.gather(*tasks)
Network I/O and API wait time overlap. Big win for batch media pipelines.
Pattern 4 — GenAI Processors
Google's official high-level library. Operator-based composition:
pip install genai-processors
# Sequential (+)
pipeline = input_proc + gemini_model + output_proc
# Parallel fan-out (//)
parallel = processor_a // processor_b
The // operator routes the same input to both processors at the same tick. Ideal for Real-Time Live agents handling audio + text + function calls together.
Official: google-gemini/genai-processors · Blog post
The Hidden Winner: Batch API
If you don't need real-time responses, Batch API is the obvious choice:
| Metric | Sync | Async Parallel | Batch API |
|---|---|---|---|
| Total time (N req) | T × N | ~T | 1–15 min (SLA 24h) |
| Cost | Standard | Standard | 50% off |
| Max volume | Any | Within RPM | 10k per batch |
| Real-time | Yes | Yes | No |
JSONL format:
{"custom_id":"req-1","request":{"model":"gemini-2.0-flash","contents":[{"parts":[{"text":"Translate: Hello"}]}]}}
{"custom_id":"req-2","request":{"model":"gemini-2.0-flash","contents":[{"parts":[{"text":"Translate: World"}]}]}}
uploaded = client.files.upload(file="batch_input.jsonl")
batch_job = client.batches.create(
model="gemini-2.0-flash",
src=uploaded.name,
)
Official docs: https://ai.google.dev/gemini-api/docs/batch-api
Which Pattern Should I Use?
| Scenario | Pattern |
|---|---|
| Single call, testing | Sync client.models.generate_content
|
| 5–50 immediate responses | Pattern 1 |
| 50–500 immediate responses | Pattern 2 |
| Image/file batch analysis | Pattern 3 |
| Audio + text + function calls | Pattern 4 |
| 10k+ overnight jobs | Batch API |
Gotchas
- TPM limits exist alongside RPM. Long prompts in parallel can hit TPM before RPM.
-
Event loops:
asyncio.run()in scripts, directawaitin Jupyter,nest_asynciofor nested cases. - Parallel != cheaper. Same total tokens = same bill. Want cost savings? Use Batch API.
-
SDK:
google-generativeai(old) has no.aio. Always usegoogle-genai.
Conclusion
Async is the default for any Gemini workload above ~5 requests. Start with Pattern 2 (Semaphore) — it handles most cases. Drop into Pattern 3 for media, Pattern 4 for complex pipelines, Batch API for cost-sensitive volume.
Got a Gemini script with for prompt in prompts: in it right now? Change it today. The throughput delta is enormous.
Top comments (0)