Joske Vermeulen

Posted on May 20 • Edited on May 21

I Upgraded a Production AI Agent to Gemini 3.5 Flash 12 Hours After Google I/O - Here's What I Found

#devchallenge #googleiochallenge

Google I/O Challenge submission

This is a submission for the Google I/O Writing Challenge

The Setup

I'm running an experiment called The $100 AI Startup Race. 7 AI coding agents each get $100 and 12 weeks to autonomously build real startups. No human coding. The agents run on cron jobs, commit to GitHub, deploy to Vercel, and try to generate revenue.

One of those agents is Gemini. It's been running on Gemini CLI with a combo of 2.5 Pro (premium sessions) and 2.5 Flash (cheap sessions) since April 20. I tried 3.1 Pro during the test runs before the race, but it was unreliable - frequent "model not available" errors made it unusable for autonomous cron-based sessions. So I stuck with 2.5. After 4 weeks and 1,259 commits, Gemini is in last place. Stuck in bug loops. Writing code that crashes, filing help requests about database tables it could create itself, and burning sessions on infrastructure it already has.

Then Google I/O happened.

What Google Dropped (May 19)

Gemini 3.5 Flash. The headline numbers:

76.2% Terminal-Bench 2.1 (agentic coding) - beats 3.1 Pro's 70.3%
83.6% MCP Atlas (multi-step workflows) - highest of any model
289 tokens/sec output - 4x faster than Claude Opus 4.7 or GPT-5.5
$1.50 / $9 per 1M tokens - cheaper than 3.1 Pro
A Flash-tier model outperforming the previous Pro model. That's never happened before.

And one more thing: Gemini CLI is being retired on June 18, 2026. Replaced by Antigravity CLI (agy).

I had to upgrade. The model my agent was running on is two generations behind, and the tool it uses is dying in 4 weeks.

Installing Antigravity CLI on a Headless VPS

My race agents run on a VPS (Ubuntu, no GUI). Here's how the install went:

curl -fsSL https://antigravity.google/cli/install.sh | bash

Binary lands at /root/.local/bin/agy. Add to PATH:

export PATH="/root/.local/bin:$PATH"
agy --version  # 1.0.0

The Auth Challenge

First run needs OAuth. On a headless server, agy detects the SSH session and prints an auth URL:

Authentication required. Please visit the URL to log in:
  https://accounts.google.com/o/oauth2/auth?...

Waiting for authentication (timeout 30s)...

You have 30 seconds to open that URL in your browser and complete the Google login. Tight, but it works. Token gets stored and all future calls are authenticated.

Discovery #1: No Model Selection Flag

Here's what surprised me. The old Gemini CLI had -m gemini-2.5-pro to pick your model. Antigravity CLI has... nothing:

Usage of agy:
  --dangerously-skip-permissions  Auto-approve all tool permission requests
  --print                         Run a single prompt non-interactively
  --print-timeout                 Timeout for print mode (default 5m0s)
  --sandbox                       Run in a sandbox

No --model. No env var. No config file. I tried everything - settings.json, GEMINI.md directives, environment variables. Nothing works.

agy auto-selects Gemini 3.5 Flash based on your subscription tier and quota. Server-side routing, no client control. For my use case (autonomous agent on cron), this actually simplifies things - one command, best available model.

Discovery #2: Unified Quota Across Models

On my Mac (same Google account, AI Pro $20/month), I can see the quota dashboard:

Gemini 3.5 Flash (High)      - Refreshes in 4h 42m
Gemini 3.5 Flash (Medium)    - Refreshes in 4h 42m
Gemini 3.1 Pro (High)        - Refreshes in 4h 42m
Gemini 3.1 Pro (Low)         - Refreshes in 4h 42m
Claude Sonnet 4.6 (Thinking) - Refreshes in 4h 58m
Claude Opus 4.6 (Thinking)   - Refreshes in 4h 58m
GPT-OSS 120B (Medium)        - Refreshes in 4h 58m

Two things jumped out:

Gemini Flash and Pro share the same quota pool. When I used 3.5 Flash, the 3.1 Pro timer dropped at the same time. They're not independent buckets - it's one "Gemini compute" pool that both models draw from.
Multi-model access - Antigravity bundles Claude, GPT-OSS, and Gemini models in one $20/month subscription. Google is positioning this as a model-agnostic platform, not just a Gemini wrapper.

The 5-hour refresh cycle and shared pool means you need to be strategic about which models you use and when.

The First Real Test

I set up a minimal bug-fix test in the race-gemini directory:

echo 'Fix the bug in math.js. Run npm test to verify.' | \
  agy --print --print-timeout 3m --dangerously-skip-permissions

Result:

I have successfully fixed the bug in math.js and verified it using npm test.

### Summary of Changes
1. Identified the Target File
2. Fixed the Bug: Updated the add function to use addition (+) instead of subtraction (-)
3. Verified the Fix: npm test passes with output: PASS

It found the file, read it, identified the bug, fixed it, ran the tests, and confirmed. Clean execution. No help requests filed. No infinite loops.

The Migration

Old setup:

# Premium sessions (2x/day)
echo "$PROMPT" | gemini --yolo -m gemini-2.5-pro

# Cheap sessions (6x/day)  
echo "$PROMPT" | gemini --yolo -m gemini-2.5-flash

New setup:

# All sessions (8x/day, single tier)
echo "$PROMPT" | agy --print --print-timeout 10m --dangerously-skip-permissions

I also merged the two backlogs (BACKLOG-PREMIUM.md + BACKLOG-CHEAP.md) into a single BACKLOG.md - same approach as our Kimi agent, which uses one model and one task list. The agent decides what to prioritize each session.

First task in the new backlog: "Merge old backlogs, audit the live site, identify the #1 blocker to first revenue."

What I'm Watching For

The Gemini agent's problem was never lack of capability - it's the most prolific committer in the race (1,259 commits). The problem was operational awareness:

Writing code with bugs it doesn't notice
Filing help requests for things it could solve itself
Building features without checking if they deploy correctly

Gemini 3.5 Flash's MCP Atlas score (83.6% - highest of any model) suggests it's specifically designed for the kind of multi-step, tool-using, autonomous work the race requires. The 4x speed means more iterations per session. The better coding benchmarks mean fewer self-inflicted bugs.

But benchmarks don't test "can you notice your site is returning 500 errors." That's what I'm watching for.

Verdict So Far

What works:

Install is clean (one curl command)
Auth on headless servers is first-class (prints URL, you complete in browser)
3.5 Flash is genuinely fast - responses feel instant
--dangerously-skip-permissions works for autonomous use
The model correctly identifies and fixes bugs in a single pass

What's missing:

No --model flag (can't choose between 3.5 Flash, 3.1 Pro, Claude, etc.)
No way to see remaining quota from CLI
Shared quota across Flash and Pro models could be a problem at scale
30-second auth timeout is tight for headless setups

The big question: Will a better model fix an agent that's been stuck for 4 weeks? Or is the problem deeper than model quality?

First results should come in within 48 hours. I'll update this post.

Follow the race live at aimadetools.com/race - 7 agents, $100 each, 12 weeks, real startups.

Update (May 21): Quota wall + Tripled limits

DEV Community