DEV Community

정상록
정상록

Posted on

Gemini API: From 100s to 5s with 4 Async Patterns (google-genai SDK)

TL;DR

Still calling Gemini API in a for loop? Here are 4 async patterns that take 100 requests from 100 seconds to about 5 seconds — plus the Batch API that cuts costs by 50%. Based on the official cookbook notebook.

The Problem

The old google-generativeai SDK has no .aio namespace. If you've been building on that, you've been stuck with sync calls. The new google-genai SDK fixes this.

pip install -U google-genai
Enter fullscreen mode Exit fullscreen mode

Check you're on the new one:

from google import genai
print(genai.__version__)  # should be 1.x.x+
Enter fullscreen mode Exit fullscreen mode

Pattern 1 — Basic Async

The async namespace lives at client.aio. That's the only thing you really need to remember.

import asyncio
from google import genai

client = genai.Client(api_key="YOUR_KEY")

async def generate(prompt: str) -> str:
    response = await client.aio.models.generate_content(
        model="gemini-2.0-flash",
        contents=prompt,
    )
    return response.text

async def main():
    prompts = ["Q1...", "Q2...", "Q3..."]
    tasks = [generate(p) for p in prompts]
    for coro in asyncio.as_completed(tasks):
        result = await coro
        print(result[:60])

asyncio.run(main())
Enter fullscreen mode Exit fullscreen mode

asyncio.as_completed() streams results as they complete — great for UIs that want to show fast responses first. Use asyncio.gather() when you need all results in order.

Pattern 2 — Semaphore for Rate Limits

Free tier Gemini 2.0 Flash = 15 RPM. Fire off 100 tasks with gather and you'll hit 429 Resource Exhausted immediately.

asyncio.Semaphore(N) caps concurrent in-flight requests:

sem = asyncio.Semaphore(10)

async def safe_generate(prompt: str) -> str:
    async with sem:
        response = await client.aio.models.generate_content(
            model="gemini-2.0-flash",
            contents=prompt,
        )
        return response.text

async def batch_process(prompts):
    tasks = [safe_generate(p) for p in prompts]
    return await asyncio.gather(*tasks)
Enter fullscreen mode Exit fullscreen mode

Rule of thumb for Semaphore value: ceil(RPM / 60 × avg_response_seconds) × 0.7.

With retry on 429

from google.api_core import exceptions

async def safe_generate_retry(prompt, max_retries=3):
    for attempt in range(max_retries):
        try:
            async with sem:
                response = await client.aio.models.generate_content(...)
                return response.text
        except exceptions.ResourceExhausted:
            await asyncio.sleep(2 ** attempt)
    raise RuntimeError("max retries exceeded")
Enter fullscreen mode Exit fullscreen mode

Pattern 3 — aiohttp for Parallel Image Processing

If you're analyzing 50+ images with Gemini Vision, the download itself becomes the bottleneck. Combine aiohttp with client.aio:

import aiohttp

async def download_and_analyze(session, url):
    async with session.get(url) as resp:
        image_bytes = await resp.read()

    response = await client.aio.models.generate_content(
        model="gemini-2.0-flash",
        contents=[
            "Describe this image in one sentence.",
            {"mime_type": "image/png", "data": image_bytes},
        ],
    )
    return response.text

async def main(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [download_and_analyze(session, u) for u in urls]
        return await asyncio.gather(*tasks)
Enter fullscreen mode Exit fullscreen mode

Network I/O and API wait time overlap. Big win for batch media pipelines.

Pattern 4 — GenAI Processors

Google's official high-level library. Operator-based composition:

pip install genai-processors
Enter fullscreen mode Exit fullscreen mode
# Sequential (+)
pipeline = input_proc + gemini_model + output_proc

# Parallel fan-out (//)
parallel = processor_a // processor_b
Enter fullscreen mode Exit fullscreen mode

The // operator routes the same input to both processors at the same tick. Ideal for Real-Time Live agents handling audio + text + function calls together.

Official: google-gemini/genai-processors · Blog post

The Hidden Winner: Batch API

If you don't need real-time responses, Batch API is the obvious choice:

Metric Sync Async Parallel Batch API
Total time (N req) T × N ~T 1–15 min (SLA 24h)
Cost Standard Standard 50% off
Max volume Any Within RPM 10k per batch
Real-time Yes Yes No

JSONL format:

{"custom_id":"req-1","request":{"model":"gemini-2.0-flash","contents":[{"parts":[{"text":"Translate: Hello"}]}]}}
{"custom_id":"req-2","request":{"model":"gemini-2.0-flash","contents":[{"parts":[{"text":"Translate: World"}]}]}}
Enter fullscreen mode Exit fullscreen mode
uploaded = client.files.upload(file="batch_input.jsonl")
batch_job = client.batches.create(
    model="gemini-2.0-flash",
    src=uploaded.name,
)
Enter fullscreen mode Exit fullscreen mode

Official docs: https://ai.google.dev/gemini-api/docs/batch-api

Which Pattern Should I Use?

Scenario Pattern
Single call, testing Sync client.models.generate_content
5–50 immediate responses Pattern 1
50–500 immediate responses Pattern 2
Image/file batch analysis Pattern 3
Audio + text + function calls Pattern 4
10k+ overnight jobs Batch API

Gotchas

  1. TPM limits exist alongside RPM. Long prompts in parallel can hit TPM before RPM.
  2. Event loops: asyncio.run() in scripts, direct await in Jupyter, nest_asyncio for nested cases.
  3. Parallel != cheaper. Same total tokens = same bill. Want cost savings? Use Batch API.
  4. SDK: google-generativeai (old) has no .aio. Always use google-genai.

Conclusion

Async is the default for any Gemini workload above ~5 requests. Start with Pattern 2 (Semaphore) — it handles most cases. Drop into Pattern 3 for media, Pattern 4 for complex pipelines, Batch API for cost-sensitive volume.

Got a Gemini script with for prompt in prompts: in it right now? Change it today. The throughput delta is enormous.

References

Top comments (0)