Joske Vermeulen

Posted on May 21

The Model Worked. The Cron Job Almost Killed My AI Agent.

#devchallenge #googleiochallenge #antigravity #devops

Google I/O Writing Challenge Submission

This is a submission for the Google I/O Writing Challenge

Gemini 3.5 Flash was not the hard part.

It fixed bugs the old setup had failed to solve for weeks. The model quality was transformational (see Part 1 and Part 2).

The hard part was making it survive cron.

In the first 48 hours, my autonomous agent nearly killed the VPS with an infinite retry loop, failed auth outside SSH, and burned most of its quota re-reading the same files every session.

All three bugs took hours to diagnose. All three fixes were tiny.

Context

I run The $100 AI Startup Race. 7 AI agents building startups autonomously on a VPS via cron jobs. After upgrading the Gemini agent to Antigravity CLI (agy) with Gemini 3.5 Flash, the model worked great. But making it run unattended on a headless server? That's where the real engineering happened.

Bug 1: The Infinite Retry Loop

The symptom

I SSH into the VPS and find it unresponsive. Load average through the roof. The cron log shows 300+ entries from the last 2 minutes, all empty.

What happened

Expected: quota exhaustion returns a non-zero exit code.
Actual: exit code 0 + empty output.

When agy hits its quota limit, it doesn't error out. It returns successfully with an empty response. My orchestrator script interprets "exit code 0" as "the model finished its thought, let's give it another task." So it immediately fires another prompt. Which returns empty. Which triggers another. 300 times in 2 minutes.

=== Run 1 finished at 07:30:03, exit=0 ===
=== Run 2 finished at 07:30:06, exit=0 ===
=== Run 3 finished at 07:30:08, exit=0 ===
=== Run 4 finished at 07:30:10, exit=0 ===
... (296 more)

Each "run" takes 2-3 seconds. No output, no error, no indication that quota is exhausted. Just silence. A human would have seen the empty response and stopped. Cron saw exit code 0 and kept going.

The fix

# Circuit breaker: 3 consecutive empty responses = quota exhausted
EMPTY_COUNT=0
MAX_EMPTY=3

# After each run, check output length
if [[ ${#OUTPUT} -lt 20 ]]; then
    ((EMPTY_COUNT++))
    if [[ $EMPTY_COUNT -ge $MAX_EMPTY ]]; then
        echo "=== 3 consecutive empty responses (quota exhausted?) — stopping session ==="
        break
    fi
else
    EMPTY_COUNT=0
fi

Three empty responses in a row → stop the session. The orchestrator now exits cleanly instead of hammering a dead endpoint.

The lesson

Every autonomous system needs a circuit breaker. AI tools are designed for interactive use. They assume a human will notice when something's wrong. When there's no human, you need explicit failure detection.

Bug 2: The Auth That Only Works in SSH

The symptom

Expected: same user + same token file = works everywhere.
Actual: auth backend changes based on an environment variable.

I test agy via SSH. Works perfectly. I set up the cron job with the exact same command, same user, same working directory. Fails with "Authentication required."

The token file exists. It has a valid refresh token. The binary can read it (verified with strace). But it won't use it.

The investigation

# Works:
ssh race@your-vps "cd /home/race/race-gemini && echo 'test' | agy --print"
# → Responds normally

# Fails (simulating cron):
ssh race@your-vps 'env -i HOME=/home/race PATH=/usr/bin:/home/race/.local/bin bash -c "
  cd /home/race/race-gemini
  echo test | agy --print
"'
# → "Authentication required"

After diffing the environment between SSH and cron, I found it: agy checks for the SSH_CONNECTION environment variable. If it's set, it uses file-based auth (reads the token from ~/.gemini/antigravity-cli/antigravity-oauth-token). If it's not set, it tries the system keyring, which doesn't exist in a non-interactive cron session.

The fix

export SSH_CONNECTION="127.0.0.1 0 127.0.0.1 22"

One fake environment variable. I don't love this fix. But until the CLI exposes an explicit headless auth mode, this makes cron behave exactly like my tested SSH session. If Antigravity adds a --headless-auth or --auth-file flag, I'd replace this immediately.

The lesson

AI CLI tools are built for developers at their desk. Headless/cron environments are second-class citizens. If your tool has multiple auth backends, test which one activates in a bare env -i environment. That's what cron sees.

Bug 3: The Context Tax

The symptom

Expected: each session starts productive work quickly.
Actual: context reload eats 60% of the session.

Session 1 runs for 8 minutes before hitting quota. Of those 8 minutes, 5 are spent reading the codebase: IDENTITY.md, PROGRESS.md, BACKLOG.md, scanning the project structure, understanding what happened last time. Only 3 minutes of actual coding.

With quota this tight, losing 60% of every session to context loading is a dealbreaker.

The discovery

agy has a --continue flag that resumes the previous conversation. The model retains all context from the last session: files it read, decisions it made, what it planned to do next.

The fix

# First session of the day: fresh start, full context load
if [[ "$SESSION_TYPE" == "first" ]]; then
    echo "$PROMPT" | agy --print --print-timeout 25m --dangerously-skip-permissions
else
    # All subsequent sessions: resume previous conversation
    echo "$PROMPT" | agy --print --print-timeout 25m --dangerously-skip-permissions --continue
fi

The result

These measurements were taken before Google's 3x rate limit boost (see Part 2). With the new limits, the gains from --continue still matter, but the pressure is less extreme.

	Fresh session	--continue session
Context loading	~5 minutes	~0 minutes
Productive coding	~3 minutes	~15 minutes
Effective runtime	3 min	15 min

Almost 5x more productive time per session by skipping the context reload. The model remembers what it fixed, what's next, what files it already read.

The lesson

Context is expensive, both in tokens and in quota. If your AI tool supports conversation persistence, use it.

I don't use --continue forever. One fresh session per day as a reset point (prevents stale assumptions from accumulating), then all subsequent sessions within that day resume where the last one left off.

What's Missing: The Infrastructure Layer

These three bugs share a pattern: autonomous AI agents need infrastructure that doesn't exist yet.

No standard circuit breaker for quota exhaustion
No headless-first auth flow
No cron-aware session lifecycle (when to fresh-start vs continue)

Web apps have process managers. Queues have retry policies. APIs expose rate-limit headers. Background jobs have dead-letter queues. Autonomous AI agents have bash scripts.

Every team running AI agents on cron is building their own orchestrator from scratch. The same patterns (retry limits, auth persistence, context reuse, graceful shutdown, cost tracking) get reimplemented by every team independently.

We're in the "build your own orchestrator" era. The models are ready for autonomous work. The infrastructure around them isn't.

The Orchestrator Pattern

Here's the minimal structure that works for me after a week of iteration:

Session start
├── Check quota (circuit breaker armed)
├── Load context (fresh or --continue)
├── Run loop (max N iterations)
│   ├── Send prompt
│   ├── Check output length (empty = increment counter)
│   ├── If 3 empty → break (quota exhausted)
│   ├── If output → commit changes, reset counter
│   └── Check elapsed time → graceful shutdown at limit
├── Push commits
└── Log session stats (duration, files changed, runs)

It's ~50 lines of bash. It handles the three failure modes above. It's not elegant, but it keeps an autonomous agent running unattended across scheduled sessions.

Takeaway

If you're running Antigravity CLI (or any AI coding tool) in autonomous/headless mode:

Add a circuit breaker. Empty responses are silent failures, not completions.
Test auth under cron's environment. In my case, faking SSH_CONNECTION forced file-based auth.
Use --continue between sessions. Context loading eats your quota alive.
Set --print-timeout higher than default. Complex agentic tasks need more than 5 minutes to think.

My Cron-Safe Agent Checklist

[ ] Max runtime per session
[ ] Max loop count per session
[ ] Empty-output circuit breaker
[ ] Non-zero exit handling
[ ] Auth tested with env -i (simulating cron)
[ ] Fresh/continue session strategy
[ ] Commit and push after each meaningful change
[ ] Quota / empty-response events logged separately
[ ] Recovery path after quota exhaustion
[ ] Logs include duration, output length, files changed

AI agents don't just need better models. They need boring production infrastructure.

Gemini 3.5 Flash made the agent smart enough to work.

Bash made it stable enough to survive.

DEV Community

The Model Worked. The Cron Job Almost Killed My AI Agent.

Context

Bug 1: The Infinite Retry Loop

The symptom

What happened

The fix

The lesson

Bug 2: The Auth That Only Works in SSH

The symptom

The investigation

The fix

The lesson

Bug 3: The Context Tax

The symptom

The discovery

The fix

The result

The lesson

What's Missing: The Infrastructure Layer

The Orchestrator Pattern

Takeaway

My Cron-Safe Agent Checklist

Top comments (0)