Stephen Sebastian

Posted on May 22

I Spent $0.37 Testing Google’s Antigravity 2.0 Agent API — Here’s Every Bug You’ll Hit (and How to Fix Them)

#devchallenge #googleiochallenge #ai #discuss

Google I/O Writing Challenge Submission

TL;DR: In my test run, Antigravity 2.0 cut a 90‑min dependency audit to 14 min for about $0.044 in token cost. It caught a critical CVE I’d missed for months — and hallucinated an unreleased package version. Here’s the production checklist, working code, and every bug I hit so you don’t have to.

I tested Google’s Agent API on a real 14‑service workflow to see whether agentic tooling could actually reduce repetitive developer work — and the results were both promising and messy. This article isn’t about theory. It’s about what happened when I threw a new tool at a boring, everyday problem and measured everything that followed.

Antigravity is Google’s preview managed-agent runtime, announced at I/O 2026. I put it through its paces on a real 14-service workflow.

How the Tokens Actually Move

Before diving into code, it helps to see how tokens flow between agents, the sandbox, and the outside world. The flow determines where your cost and latency actually come from.

User Task 
   │
   ▼
Scanner Agent (17.3K tokens) 
   │ writes report.json
   ▼
Security Agent (13K tokens) 
   │ enriches with CVEs
   ▼
Changelog Agent (1.5K tokens) 
   │ generates markdown
   ▼
PR Agent (3K tokens) ──→ GitHub PRs

All four agents run inside the same managed sandbox, which persists across calls so you don’t re‑ingest everything. Token tally from one clean run:

Scanner: 5,600 input, 11,740 output (17,340 tokens)
Security: 2,100 input, 10,902 output (13,002 tokens)
Changelog: 350 input, 1,200 output (1,550 tokens)
PR Agent: 400 input, 2,600 output (3,000 tokens)

Total: 34,892 tokens across four stages. The environment itself costs nothing extra—you just pay for tokens at the model’s per‑token rate. For this preview, I used gemini‑3.5‑flash‑preview at $0.0005/1K input and $0.0015/1K output. The sandbox is bundled in, which changes the economics completely compared to spinning up your own container.

Why this matters: Because all agents share the same environment, the Security Agent doesn’t need to re‑scan the repo. It just reads the JSON that’s already sitting there. That’s where a lot of cost gets saved.

Real Numbers: Time and Money

I compared three ways of doing the exact same 14‑service audit: doing it manually (me, a human), using Antigravity’s Managed Agents, and running the same prompts on a cheap cloud VM with a normal Gemini API call and custom orchestration.

Metric	Manual	Antigravity Agents	Cloud VM + LLM
Scan time	25 min	4 min	8 min
CVE check	20 min	3.5 min	6 min
Changelog	15 min	2 min	1 min
PR creation	50 min (5 services)	4 min	5 min
Total wall‑clock	90 min	14 min	20 min
Cost	$0 (but my time)	$0.044 (tokens)	$0.92 (VM + API)
Setup effort	None	A couple hours	A week of DevOps

Cost Comparison Per Audit Run

Manual: $0 (but 90 min @ $60/hr = $90 value)
Antigravity: $0.044 (14 min human oversight)
Cloud VM: $0.92 (20 min + week of DevOps setup)

Winner: Antigravity saves 76 min and $89.96 in labor value per run.

The cloud VM alternative was a $0.04/hr e2‑micro instance running a Python script that called the Gemini API with the same prompts, plus a GitHub CLI container. The token cost was higher because it had to re‑read the repo for each stage instead of reusing state in a sandbox.

Spot‑check: In my spot-check of the first five dependencies, four matched the registries immediately, and one needed a correction that the verifier caught.

🛠️ Try This Yourself: Minimal Two‑Agent Example

Below is a 20‑line Python script you can run right now. It creates a Scanner Agent that audits a single directory and a Verifier Agent that cross‑checks the version of every package found against the public registry.

from google import genai
import json

client = genai.Client(api_key="YOUR_API_KEY")

# Create a managed sandbox
interaction = client.interactions.create(
    agent="antigravity-preview-05-2026",
    config={
        "tools": ["code_execution", "web_browsing", "file_management"],
        "sandbox": "isolated_linux"
    }
)

# Note: The API structure shown is simplified for clarity.
# Check documentation for the latest SDK reference.

# Scanner Agent: generate dependency report
scan_task = """Scan /workspace for package.json and requirements.txt.
Output a JSON list of {package, current_version, latest_version} to /workspace/deps.json."""

client.interactions.send_message(interaction_id=interaction.id, message=scan_task)

# Verifier Agent: cross-check against registry
verify_task = """Read /workspace/deps.json.
For each package, curl the public registry and confirm the latest version.
Output a corrected /workspace/verified_deps.json."""

client.interactions.send_message(interaction_id=interaction.id, message=verify_task)

print("Done. Check verified_deps.json in the sandbox.")

Setup Snapshot

Cost: ~$0.02 for a test run
Time: 3 minutes to set up, 2 minutes to execute
What you’ll learn: Whether your dependencies are genuinely up‑to‑date and whether the agent hallucinates any version numbers.

Production Readiness Checklist: 6 Things You Must Handle

1. Endless Reasoning Loops

Problem: The agent’s stop condition is “the model decides it’s done.” Tell it to “check all package files recursively” and it may happily circle through node_modules forever.
Solution: Build a wrapper that counts tool calls and force‑stops after 20.
Why it matters: Without a ceiling, a single rogue prompt can burn dollars in minutes.

2. Sandbox Filesystem Consistency

Problem: After writing a large JSON, the next read sometimes returns a stale version—duplicate entries, missing data.
Solution: Explicitly run sync via the shell tool before every read.
Why it matters: Stale state corrupts downstream agents and erodes trust in the entire pipeline.

3. Cost Unpredictability

Problem: The $0.37 figure was my entire weekend; a clean run costs fractions of a cent. But one agent got stuck in a recursive retry loop parsing a malformed package.json and ate $0.89 in seconds.
Solution: At the time of testing, I couldn’t find a native cost‑cap feature in the API. Wrap your API calls with a token budget tracker that raises an exception if limits are exceeded.
Why it matters: Production pipelines need spend guarantees, not gambling.

4. Hallucinations in Dependency Versioning

Problem: Gemini 3.5 Flash confidently reported express@5.0.0 (unreleased) and mis‑identified a Go module’s minor bump as a breaking change.
Solution: Always run a verifier agent that hits the actual package registry via curl. I caught 3 out of 4 hallucinations this way.
Why it matters: A hallucinated CVE or version can lead to unnecessary rollbacks or missed patches.

5. Credential Scoping Is All‑or‑Nothing

Problem: Secrets are scoped to the entire interaction, not per‑agent. My Scanner Agent technically had the same GITHUB_TOKEN as the PR Agent, violating least‑privilege.
Solution: Until Google supports per‑agent secrets, use separate interactions for read‑only and write‑enabled stages.
Why it matters: A compromised scanner shouldn’t be able to open PRs.

6. Debugging Opacity

Problem: No streaming log of agent tool calls. You either stare at a blank terminal or wait 8 minutes for the web dashboard replay.
Solution: After every stage, call client.interactions.get() and assert state == "COMPLETED" and output is non‑empty.
Why it matters: Silent failures waste time and token budgets.

Why This Matters Beyond Dependency Audits

Google didn’t demo 93 agents to sell you a dependency auditor. They signaled a platform bet: that the unit of compute is shifting from “a single model call” to “a managed runtime where agents persist, schedule themselves, and collaborate.”

This is Google’s answer to LangChain and AutoGen — hosted, vertically integrated with TPU‑optimized models, and bundled into Gemini’s pricing. If OpenAI releases comparable managed agents soon, the pricing war will be brutal, and developers will benefit.

My dependency audit is a trivial example. What becomes possible when you can spin up a verifier agent for every PR, or a security‑scanning agent that watches your monorepo continuously and opens fixes before you even wake up? That’s the real headline from I/O 2026, and it’s a shift that will matter long after the smart glasses stop making news.

What I’m Building Next

1. Automated Security Monitoring (Try This Version)

Here’s the daily CVE scanner I’m deploying next week. It runs the Security Agent every morning, diffs the report against yesterday’s, and opens PRs only for new critical findings.

import subprocess
import json
import datetime

today = datetime.date.today().isoformat()
yesterday = (datetime.date.today() - datetime.timedelta(days=1)).isoformat()

# Run the Security Agent (same sandbox, same report path)
subprocess.run(["python", "security_agent.py", f"--date={today}"])

# Diff the CVE reports
with open(f"report_{yesterday}.json") as f: 
    old_report = json.load(f)
with open(f"report_{today}.json") as f: 
    new_report = json.load(f)

new_critical = [
    cve for cve in new_report.get("vulnerabilities", [])
    if cve["severity"] == "CRITICAL"
    and cve not in old_report.get("vulnerabilities", [])
]

# Open PRs for new critical CVEs
for cve in new_critical:
    subprocess.run([
        "gh", "pr", "create",
        "--title", f"CRITICAL: {cve['package']} {cve['id']}",
        "--body", f"Automated PR. CVE details:\n{cve['description']}\n\nFixed in {cve['fixed_version']}",
        "--assignee", "@security-team"
    ])
    print(f"Opened PR for {cve['id']}")

print(f"Checked {len(new_report['vulnerabilities'])} CVEs, {len(new_critical)} new critical.")

Cost: ~$0.15/day for 14 services
Setup: 10 minutes to add your own report paths
Result: Zero‑day vulnerabilities caught within hours instead of weeks.

2. Cost‑Aware Orchestration Wrapper

A thin Python wrapper that enforces token budgets per agent and sends Slack alerts when a pipeline approaches the ceiling. Essential for production peace of mind.

The Bottom Line

The real lesson here is not that agents are perfect, but that they are already useful enough to change how developers approach repetitive operational work. If you adopt them, treat them like fast assistants that still need verification, guardrails, and cost awareness. That combination is what makes agentic workflows usable today, not just impressive in a demo.

📝 AI Transparency & Methodology

I used AI to help scaffold boilerplate, clean up formatting, and refine presentation. The experiments, measurements, workflow observations, and conclusions in this article come from my own hands-on testing on a real 14-service setup.

If this helped, share what you’re building with agentic tools. I’d love to see real experiments.

Top comments (1)

Stephen Sebastian • May 22

I'm genuinely curious — what's the one boring, repetitive task in your workflow that you'd love to hand off to an agent? For me it was dependency audits, but I know there are a hundred other tedious corners of dev work out there. Also, if you've already tried Antigravity (or any other agent runtime), did you hit any of the same bugs, or did you discover entirely new and creative ways to break things? Drop a comment — I'd love to hear what you're building, even if it's just a half-baked experiment. The best ones I'll feature in a follow-up.