DEV Community

Cover image for Benchmarking on a Budget: Running massive evals for 50% less with the Gemini Batch API ⚡️
Paige Bailey for Google AI

Posted on

Benchmarking on a Budget: Running massive evals for 50% less with the Gemini Batch API ⚡️

Hey developers! 👋

Running evaluations on LLMs can be a bit of a headache. You hit rate limits, you stare at loading bars, and -- probably worst of all -- you burn through your API budget faster than a GPU on a training run. But what if I told you there’s a way to run thousands of prompts asynchronously, at 50% of the cost, without blocking your main thread?

Enter the Gemini Batch API.

Today, we are going to take a classic coding benchmark from Hugging Face -- the OpenAI HumanEval dataset -- and run it through Google’s small, lightweight gemini-2.5-flash-lite model using the Batch API. Then, we’re going to evaluate the code it generates and visualize the results.

Grab your coffee (or tea 🍵), and let’s dive in!


The stack

Here's what we'll be building with today:

  • Google Gen AI SDK: To talk to Gemini.
  • HuggingFace Datasets: To get our evals.
  • Pandas and Seaborn: To make the data look pretty.
  • Python exec: To (carefully!) run the generated code.

Step 1: Preparing the Data

First things first, we need our prompts. We are using the openai_humaneval dataset, which contains 164 coding problems.

The Batch API loves JSONL (JSON Lines) files. Each line is a separate request. We need to iterate through the dataset and format it so Gemini understands that we want it to write Python code.

import json
from datasets import load_dataset

# Load the test split
ds = load_dataset("openai/openai_humaneval", split="test")

jsonl_filename = "humaneval_input.jsonl"
print(f"Generating {jsonl_filename}...")

with open(jsonl_filename, "w") as f:
    for item in ds:
        # Sanitize the ID
        custom_id = item["task_id"].replace("/", "_")

        # Prompt Engineering: Be specific!
        prompt_text = (
            f"Complete the following Python function. "
            f"Provide ONLY the code, no explanation.\n\n{item['prompt']}"
        )

        # Construct the request object
        entry = {
            "custom_id": custom_id,
            "method": "generateContent",
            "request": {
                "contents": [{"role": "user", "parts": [{"text": prompt_text}]}],
                "generation_config": {"temperature": 1.0}
            }
        }
        f.write(json.dumps(entry) + "\n")
Enter fullscreen mode Exit fullscreen mode

Pro Tip: Notice custom_id? That’s your best friend. Since batch jobs are asynchronous, results might not come back in the same order you sent them. The ID helps you map the answer back to the question.

Step 2: Upload the data to Google Cloud

Now that we have our humaneval_input.jsonl, we need to upload it to Google's Cloud Platform and tell Gemini to get to work. We are using gemini-2.5-flash-lite here because it is fast, efficient, and perfect for high-volume tasks like this - but you could use any supported Gemini model docstring instead.

from google import genai
from google.genai import types

client = genai.Client(api_key="YOUR_API_KEY")

# 1. Upload the file
print("Uploading file to Gemini API...")
uploaded_file = client.files.upload(
    file=jsonl_filename,
    config={'mime_type': 'application/jsonl'}
)

# 2. Kick off the Batch Job
print("Submitting batch job...")
batch_job = client.batches.create(
    model="gemini-2.5-flash-lite",
    src=uploaded_file.name,
    config=types.CreateBatchJobConfig(display_name="humaneval_batch_job")
)

print(f"Batch Job Created: {batch_job.name}")
print(f"Current Status: {batch_job.state}")
Enter fullscreen mode Exit fullscreen mode

And now... we wait. ⏳

Batch jobs aren't instant (that's the trade-off for the discount), but for an evaluation pipeline, it's perfect. Go stretch, grab a snack, or check Twitter. On average, my jobs for the HumanEval dataset have been taking ~10 minutes, and the majority complete in less than a few hours.

Step 3: Downloading the results

Once the job hits JOB_STATE_SUCCEEDED, the results are ready to come home.

# Check status (you'd likely loop this in production)
job = client.batches.get(name=batch_job.name)

if job.state.name == 'JOB_STATE_SUCCEEDED':
    remote_filename = job.dest.file_name
    print(f"Downloading results from: {remote_filename}")

    # Save the output locally
    content_bytes = client.files.download(file=remote_filename)

    with open("results.jsonl", "wb") as f:
        f.write(content_bytes)

    print("✅ Results saved locally!")
Enter fullscreen mode Exit fullscreen mode

Step 4: Evaluations

This is where the magic happens. We have the code Gemini wrote; now we need to see if it actually works. We’re going to loop through our results, extract the Python code (removing those pesky markdown backticks), and run it against the unit tests provided in the HumanEval dataset.

⚠️ Warning: We are using exec() here. In a production app, running untrusted code is a huge security no-no. But for a local sandbox evaluation, we live on the edge! We’ll wrap it in a signal timeout so infinite loops don’t freeze our machine.

import signal

# Safety timeout handler
class TimeoutException(Exception): pass
def timeout_handler(signum, frame): raise TimeoutException()
signal.signal(signal.SIGALRM, timeout_handler)

results_map = {}
# ... (Load results into results_map dictionary) ...

passed = 0
total = len(ds)

print("Starting Evaluation with Timeouts...\n")

for item in ds:
    cid = item["task_id"].replace("/", "_")
    generated_code = results_map.get(cid, "")

    # Combine prompt + generated code + test case
    test_script = f"{item['prompt']}\n{generated_code}\n\n{item['test']}\ncheck({item['entry_point']})"

    try:
        signal.alarm(2) # 2-second timeout per problem
        exec_globals = {}
        exec(test_script, exec_globals)
        signal.alarm(0) # Disable alarm
        passed += 1
        print(f"{item['task_id']}: Passed")
    except Exception as e:
        signal.alarm(0)
        print(f"{item['task_id']}: FAILED ({type(e).__name__})")
Enter fullscreen mode Exit fullscreen mode

Step 5: The results!

So, how did gemini-2.5-flash-lite do? Let's visualize it using Seaborn.

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# ... (Create DataFrame from results) ...

plt.figure(figsize=(10, 6))
sns.set_theme(style="whitegrid")
colors = {"Passed": "#2ca02c", "Failed": "#d62728", "Error": "#ff7f0e", "No Code": "#7f7f7f"}

ax = sns.countplot(y="status", data=df, hue="status", palette=colors, order=df['status'].value_counts().index)
plt.title("HumanEval Evaluation Results", fontsize=16)
plt.show()
Enter fullscreen mode Exit fullscreen mode

🥁 Drumroll please...

Bar Chart

  • Passed: 142 tasks (86.59%)
  • Failed: 8 tasks
  • Errors: 9 tasks

86.59% Pass Rate! 🤯

That is incredibly impressive for a "Flash-Lite" model. It handled complex algorithmic logic, string manipulation, and math problems, passing the vast majority of them.

Final thoughts

The Gemini Batch API is a game-changer for workflows like this.

  1. Cost: We saved 50% on tokens.
  2. Scale: We didn't have to manage async loops or retry logic.
  3. Performance: The Gemini 2.5 Flash-Lite model punched way above its weight class.

If you have large datasets, extensive prompting jobs, or nightly evaluations, definitely give the Batch API a spin. And if you'd like to see the full code, check out this Colab notebook.

Happy building! ✨


References:

Top comments (0)