DEV Community: Onah Sunday.

Google Antigravity 2.0 Is the I/O 2026 Announcement You Should Actually Care About

Onah Sunday. — Fri, 22 May 2026 14:16:13 +0000

This is a submission for the Google I/O Writing Challenge

Everyone is talking about Gemini 3.5 Flash being four times faster than frontier models. Fine. That's real and important.

But if you're a developer who actually ships things — not just watches keynotes — the most structurally significant thing Google announced at I/O 2026 wasn't a model. It wasn't a pricing tier. It wasn't even the intelligent eyewear.

It was Antigravity 2.0.

And most people are going to miss why it matters.

First, a quick rewind

Google launched the original Antigravity in November 2025 alongside Gemini 3. At the time, it was positioned as an AI-native IDE — a Cursor competitor. A heavily modified fork of VS Code, with an agent sidebar, tab completions, and inline commands. An editor view for hands-on coding, plus a Manager surface for spawning and observing multiple agents working asynchronously.

It was interesting. It wasn't a paradigm shift.

Version 2.0 is different. At I/O 2026, Google moved its developer tooling away from IDE-centric assistance and toward multi-agent workflow management as the primary abstraction.

The IDE is now the least interesting part of Antigravity. What they actually shipped is a platform.

What they actually shipped

Here's everything that landed at I/O, in one place:

Antigravity 2.0 desktop app — A new standalone desktop application that enables a full "agent-optimized" user experience. It introduces dynamic subagents for parallelized workflows, scheduled tasks for background automation, and new integrations with Google AI Studio, Android, and Firebase.

Antigravity CLI — Built in Go, snappier and more responsive than Gemini CLI. It fully replaces Gemini CLI, preserving the most critical features: Agent Skills, Hooks, Subagents, and Extensions — now rebranded as Antigravity plugins. Crucially, it shares the same agent harness as the desktop app, meaning all future improvements apply across both surfaces automatically.

Antigravity SDK — Provides programmatic access to the same agent harness that powers Google's own products. Developers can define custom agent behaviors and host them on their own infrastructure.

Managed Agents in the Gemini API — With a single API call, developers can spin up an agent that reasons, uses tools, and executes code in an isolated Linux environment. Three capabilities define this: the agent harness itself, persistent isolated environments where each interaction creates a resumable environment with files and state intact, and custom agent definitions using Markdown skill files.

Enterprise layer — Google Cloud customers can now use Antigravity 2.0 and Antigravity CLI with their Gemini Enterprise Agent Platform project — all agent inference runs via Agent Platform models within a secure cloud boundary, inheriting Google Cloud's data privacy protections.

That's five distinct product surfaces, all sharing the same underlying agent harness. That's not a product update. That's a platform launch.

The part that actually changes how you build

The feature I keep coming back to is Managed Agents. Here's why.

Right now, if you want to add an AI agent to your product — something that reasons, uses tools, and executes code — you have to stand up all the infrastructure yourself. Memory management, tool invocation, isolated execution environments, state persistence across turns. It's a lot of plumbing before you write a line of actual product code.

With Managed Agents, you can spin up an agent that reasons, uses tools, and executes code in an isolated Linux environment with a single API call.

And critically: each interaction creates an environment that can be resumed in follow-up calls with all files and state intact, enabling seamless multi-turn sessions without reinitializing context.

That's persistent, stateful, isolated agent execution — as a managed service. The infrastructure problem is Google's. You just call the API.

If that sounds familiar, it's because it's the same bet Anthropic is making with Claude's computer use and tool use APIs, and the same direction Hermes Agent (Nous Research's open-source agent) took with its persistent gateway model. The difference is Google can embed this directly into the development tools you already use every day.

The Gemini CLI deprecation is the real signal

Gemini CLI will no longer accept new installations for GitHub organizations after June 18, 2026.

This is worth pausing on. Gemini CLI had a massive developer user base. Sunsetting it — even with a migration path — is not a casual decision. It signals that Google is serious about Antigravity as a platform, not a side experiment.

Google is now unifying its developer-facing coding strategy into a single harness across multiple surfaces — editor, terminal, SDK, managed cloud execution. One agent runtime. Multiple entry points. Skills you build once are portable across all of them.

That portability is the core of the bet. Write a Markdown skill file once. Use it in the desktop app, the CLI, the SDK, and Managed Agents in the API. That's how you build a moat around a developer platform — not by having the best model, but by making your ecosystem the easiest place to build agents that work everywhere.

Where it falls short (for now)

I want to be honest about the gaps, because the keynote glossed over them.

Enterprise governance is still coming. Full integration with A2A and Agent Platform governance and security are coming soon — which means enterprise teams at large companies probably can't use this in production yet. Competitors like JetBrains AI Enterprise already ship SOC 2 Type II certification and on-premises deployment options. That gap affects whether security, compliance, and IT teams can approve Antigravity at scale.

The CLI migration isn't frictionless. There won't be 1:1 feature parity right out of the gate. If your team has built workflows and scripts on top of Gemini CLI, you'll need to audit what breaks.

Latency at scale is unproven. The demos showed fast individual agent runs. Multi-agent parallelization at production scale, with real data and real error rates, is a different story. We don't have benchmarks for that yet.

What to try this week

If you want to actually kick the tires, here's where to start:

1. Download Antigravity 2.0 desktop app

Antigravity 2.0 is a GUI app — download it from the official page:

https://antigravity.google/download

Available free for macOS, Windows, and Linux. Sign in with your Google account.

2. Install the Antigravity CLI

The CLI is separate from the desktop app and works in your terminal. These are the real install commands:

# macOS / Linux / WSL
curl -fsSL https://antigravity.google/cli/install.sh | bash

# Windows PowerShell
irm https://antigravity.google/cli/install.ps1 | iex

Note if you're on Windows: I tried the WSL route first and hit a 404 on the wrong URL. The correct path is /cli/install.sh — not /install.sh. Make sure you use the full URL above.

Migrate from Gemini CLI once installed:

antigravity migrate --from gemini-cli

3. Write a custom skill file

Skills are just Markdown — same concept as Hermes Agent's SKILL.md and Claude Code's custom instructions. The syntax is intentionally familiar:

# Skill: pr-reviewer

## Purpose
Review pull requests for security issues, performance regressions,
and style violations.

## Rules
- Flag any hardcoded credentials immediately
- Check for N+1 query patterns in database calls
- Verify error handling on all async operations

Drop it in ~/.gemini/antigravity/skills/ and it's available across all Antigravity surfaces — desktop, CLI, and Managed Agents in the API.

4. Spin up a Managed Agent via the Gemini API

const response = await fetch(
  "https://generativelanguage.googleapis.com/v1beta/agents",
  {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "Authorization": `Bearer ${process.env.GEMINI_API_KEY}`
    },
    body: JSON.stringify({
      model: "gemini-3.5-flash",
      system_instruction: "You are a code review agent.",
      tools: ["code_execution", "file_system"],
      persistent_environment: true
    })
  }
);

const agent = await response.json();
// agent.session_id can be resumed across API calls with full file state intact

The bigger picture

Here's the framing I keep returning to: the AI coding tools that win in 2026 won't be the ones with the best autocomplete. They'll be the ones that make agent orchestration the natural next step after writing a function.

Cursor made AI assistance feel native to the editor. Antigravity 2.0 is trying to make agent orchestration feel native to the entire development workflow — editor, terminal, CI/CD, cloud.

Whether Google pulls it off depends on execution. The platform is coherent. The skill portability story is real. The Managed Agents API removes genuine infrastructure friction. But the governance gaps mean enterprises are watching, not deploying. And the Gemini CLI migration will cause friction before it resolves.

The announcement I'd watch most closely isn't the model benchmark or the pricing tier. It's whether Antigravity skill files become as ubiquitous as .gitignore or Dockerfile — shared, versioned, and composable. If that happens, Google wins the agent tooling layer. If it doesn't, Antigravity is a very good IDE.

The keynote made the former seem possible. Only the next six months will tell us which it actually is.

DevBrief — AI Standup Writer Powered by Hermes Agent (Vercel + Render)

Onah Sunday. — Wed, 20 May 2026 04:27:47 +0000

This is a submission for the Hermes Agent Challenge

Live app: https://devbrief-tau.vercel.app

Repo: github.com/sundayonah/devbrief

Hermes health: devbrief-hermes.onrender.com/health

Setup guide: How I connected Hermes to Next.js

What I Built

DevBrief turns GitHub activity into human-readable standups, PR changelogs, or work logs. Any visitor can sign in with GitHub OAuth, pick a repo, set a time range and branch, filter PRs and authors, choose a tone (casual / formal / concise), and hit Generate.

The Next.js app does not call OpenRouter directly. It fetches commits, PRs, and issues from GitHub, then calls Hermes Agent’s OpenAI-compatible API (POST /v1/chat/completions) on a long-running gateway. Hermes runs the agent loop (skills, tools, server-side model config) and returns the final brief.

Architecture:

Users → DevBrief (Vercel)
            ↓  POST /api/summary → lib/hermes.ts
      Hermes gateway (Docker on Render)
            ↓  model: openrouter/owl-alpha
      OpenRouter (API key only on Hermes — not on Vercel)

Demo

Try it:

Open devbrief-tau.vercel.app
Connect GitHub → select a repo → choose output mode and tone → Generate
Confirm Hermes is up: https://devbrief-hermes.onrender.com/health

Code

Repository: github.com/sundayonah/devbrief

Piece	Location
Hermes client	`lib/hermes.ts` → `HERMES_ENDPOINT` + `POST /v1/chat/completions`
Summary API	`app/api/summary/route.ts`
Standup skill	`hermes-skills/standup-writer.md`
Hermes Docker image	`docker/hermes/Dockerfile`
Production model config	`docker/hermes/config.yaml` (`openrouter/owl-alpha`, `max_tokens: 2048`)
Deploy guide	`docs/DEPLOYMENT.md`

My Tech Stack

Layer	Technology
Frontend	Next.js 14 (App Router), TypeScript, Tailwind CSS
Auth & GitHub	NextAuth.js, Octokit, GitHub OAuth
AI agent	Hermes Agent — `hermes gateway run` + API server
Model	`openrouter/owl-alpha` via OpenRouter (on Hermes host only)
App hosting	Vercel
Agent hosting	Docker on Render

How I Used Hermes Agent

Hermes is not a chatbot wrapper here — the gateway is the brain for every generation.

Hermes capability	How DevBrief uses it
API server	`API_SERVER_ENABLED=true`; Next.js calls `/v1/chat/completions` server-side (no browser CORS)
Gateway	`hermes gateway run` in Docker on Render — not inside Vercel serverless
Skills	`standup-writer.md` copied to `/root/.hermes/skills/` in the image
Server model config	`docker/hermes/config.yaml` sets `model.default: openrouter/owl-alpha` (request `model` field is not what drives inference)
OpenRouter	`OPENROUTER_API_KEY` on Render only — not in Vercel env
Auth	`API_SERVER_KEY` on Render ↔ `HERMES_API_KEY` on Vercel
Cron / messaging	`hermes schedule` documented as a next step in the UI; Slack/Telegram delivery disabled in current deploy

Request flow:

POST /api/summary
  → GitHub API (user OAuth token)
  → generateBrief() in lib/hermes.ts
  → Hermes POST /v1/chat/completions
  → standup / PR changelog / work log
  → UI (copy, edit, history)

The long setup story (WSL PATH, duplicate .env keys, 127.0.0.1 vs localhost, OpenRouter 402, baking config.yaml for Render) is in the tutorial post.

What I learned

Start the gateway, hit /health, then /v1/chat/completions before wiring the app.
Hermes reads ~/.hermes/config.yaml for the real model — env vars and JSON model alone were not enough on Render until we shipped docker/hermes/config.yaml.
Split hosting: serverless Next.js + long-running Hermes elsewhere is the right pattern for this challenge.

Thanks for reading — try the live demo and leave a comment if you hit snags with Hermes on Render or Vercel.

How I Connected Hermes Agent to My Next.js App (And Why It's Not Just Another Chatbot Wrapper)

Onah Sunday. — Tue, 19 May 2026 04:34:44 +0000

This is a submission for the Hermes Agent Challenge

Live app: https://devbrief-tau.vercel.app

Repo: github.com/sundayonah/devbrief

I was skeptical.

Every week there's a new "autonomous AI agent" that turns out to be a thin wrapper around a chat API with a fancy UI on top. So when I heard about Hermes Agent — Nous Research's open-source agent that "grows with you" — I filed it under probably hype and moved on.

Then I actually used it. And I ended up rebuilding an entire side project around it.

This is a practical guide to setting up Hermes Agent locally and connecting it to a real Next.js app. I'll walk through exactly what I did to power DevBrief — a tool that reads your GitHub activity and writes standups, PR changelogs, or work logs — using Hermes as the brain. I'll also include the bugs I hit (wrong hermes binary, empty API keys, IPv6 localhost, OpenRouter credit reservation) so you don't have to rediscover them.

What Makes Hermes Different

Before the setup steps, let me explain why this matters — because the architecture is genuinely different from what you're probably used to.

Most "AI-powered" apps work like this:

Your app → LLM API → response → done

Every request is stateless. The model has no memory of the last call. You're just sending text and getting text back.

Hermes works like this:

Your app → Hermes Agent → skills + memory + tools → response
                ↓
         can learn from the interaction over time

Hermes is a persistent agent that runs on your machine (or server). It has:

Skills — Markdown files that teach it how to handle specific tasks.
Memory — Cross-session context (Hermes can use this to calibrate over time).
Tools — Terminal, files, web search, and more on the agent side (not in your Next.js app).
An OpenAI-compatible API — So connecting from a backend is straightforward.
Cron scheduling — Natural language scheduling for recurring jobs (optional; I wired this as a next step).

For DevBrief, the important part is: my Next.js app does not call OpenRouter directly. It calls Hermes, which already has the model, tools, and skills configured. That's the difference between a wrapper and an agent-backed product.

What DevBrief Actually Does

DevBrief isn't only a standup generator. After you connect GitHub (OAuth), you can:

Pick a repo, time range, branch, and PR filters
Choose output mode: standup, PR changelog, or work log
Pick a tone: casual, formal, or concise

The Next.js route POST /api/summary fetches GitHub activity, then calls lib/hermes.ts → Hermes on port 8642. If Hermes is down, you still get a fallback template so the UI isn't broken.

Try it: devbrief-tau.vercel.app — any GitHub user can sign in with OAuth and use their own repos.

Step 1 — Install Hermes (and use the right binary)

One command. Works on Linux, macOS, and WSL2 on Windows:

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

The installer handles Python 3.11, Node.js, and dependencies. On WSL, the Nous Hermes CLI usually lands at ~/.local/bin/hermes.

Verify:

~/.local/bin/hermes --version
which hermes

Windows gotcha: If you also have Rust/Cargo installed, which hermes might point at ~/.cargo/bin/hermes — that's the IBC relayer, not Nous Hermes. Use ~/.local/bin/hermes explicitly, or put ~/.local/bin before ~/.cargo/bin in your PATH.

Test the agent CLI before touching your app:

~/.local/bin/hermes chat -q "Say hello in one word"

Step 2 — Pick your model provider

Hermes is model-agnostic: OpenRouter, Anthropic, local Ollama, and more.

~/.local/bin/hermes model

I used OpenRouter. Set your key in ~/.hermes/.env (Hermes reads this file; DevBrief does not):

OPENROUTER_API_KEY=sk-or-v1-your-key-here

Pick a model in the wizard. For development I used a free/cheap route (openrouter/owl-alpha) after hitting billing quirks with Sonnet (more in Troubleshooting). For production quality, something like anthropic/claude-sonnet-4.6 on OpenRouter works — but watch max_tokens reservation (Step 2b).

Duplicate key trap: Run grep -n OPENROUTER_API_KEY ~/.hermes/.env. You must have only one non-empty line. If a second empty OPENROUTER_API_KEY= appears at the bottom of the file, python-dotenv uses the last value (empty) and every Hermes call fails with HTTP 400 while curl still works from your shell export.

Step 2b — Optional: cap `max_tokens` for OpenRouter

Hermes may request the model's full output budget (e.g. 64000 tokens). OpenRouter pre-reserves credits for that ceiling. With a small balance you can get:

HTTP 402: You requested up to 64000 tokens, but can only afford 2661.

Fix in ~/.hermes/config.yaml:

model:
  max_tokens: 2048

Or add credits at openrouter.ai/settings/credits.

Step 3 — Enable the API server and start the gateway

This is the step most tutorials skip. DevBrief talks to Hermes over the OpenAI-compatible API server, which runs inside the gateway — not hermes serve (outdated in some docs).

In ~/.hermes/.env:

API_SERVER_ENABLED=true

# Optional for local dev (gateway may accept all requests without a key)
# API_SERVER_KEY=change-me-local-dev

Start the gateway (keep this terminal open):

~/.local/bin/hermes gateway run

You might see warnings about allowlists or missing API keys — that's normal for local dev. Confirm the API is up:

curl http://127.0.0.1:8642/health

Expected:

{"status": "ok", "platform": "hermes-agent"}

Test chat completions (omit Authorization if you didn't set API_SERVER_KEY):

curl http://127.0.0.1:8642/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "hermes-agent",
    "messages": [{"role": "user", "content": "Hello! Are you running?"}],
    "max_tokens": 20
  }'

Use 127.0.0.1, not localhost, on Windows. Node often resolves localhost to IPv6 ::1. I got ECONNREFUSED ::1:8642 until I set HERMES_ENDPOINT=http://127.0.0.1:8642.

WSL + Windows: Run hermes gateway run in WSL and pnpm dev on Windows. WSL2 forwards 127.0.0.1:8642 to the gateway when it's running.

CORS: DevBrief calls Hermes from the Next.js server (/api/summary), not from the browser. You do not need API_SERVER_CORS_ORIGINS unless your frontend calls Hermes directly.

Step 4 — Write a skill

Instead of stuffing a giant system prompt into every API call, you can add a skill file — Markdown that teaches Hermes how to handle standups.

DevBrief ships hermes-skills/standup-writer.md (abbreviated):

# Skill: standup-writer

## Purpose
Given raw GitHub activity (commits, PRs, issues), generate a clean
daily standup summary in three sections: Yesterday, Today, Blockers.

## Style Rules
- Keep each bullet under 12 words
- Use plain English, no jargon
- Infer "Today" from open PRs and unresolved issues

## Output Format
**Yesterday**
- ...

**Today**
- ...

**Blockers**
- None / [describe blocker]

Install it:

cp hermes-skills/standup-writer.md ~/.hermes/skills/

In the app, standup mode uses a system hint plus prompts (the API doesn't pass a separate skill field). PR changelog and work log modes use tailored user prompts in buildPrompt(). The skill file still helps Hermes when you're in standup mode.

Step 5 — Connect it to your Next.js app

Hermes exposes POST /v1/chat/completions. DevBrief's lib/hermes.ts calls it from the server with a system + user message, optional bearer auth, and a long timeout (generations can take 30–60 seconds because the agent runs tools).

Simplified version of the real integration:

lib/hermes.ts

export async function generateBrief(input: {
  activity: GitHubActivity;
  tone: "casual" | "formal" | "concise";
  outputMode: "standup" | "pr_changelog" | "work_log";
}) {
  const base = (process.env.HERMES_ENDPOINT || "http://127.0.0.1:8642")
    .replace("localhost", "127.0.0.1");

  const headers: Record<string, string> = {
    "Content-Type": "application/json",
  };
  if (process.env.HERMES_API_KEY) {
    headers.Authorization = `Bearer ${process.env.HERMES_API_KEY}`;
  }

  const res = await fetch(`${base}/v1/chat/completions`, {
    method: "POST",
    headers,
    body: JSON.stringify({
      model: process.env.HERMES_MODEL_NAME || "hermes-agent",
      messages: [
        {
          role: "system",
          content:
            "You are DevBrief. Turn structured GitHub activity into human-readable summaries.",
        },
        { role: "user", content: buildPrompt(input) },
      ],
      stream: false,
    }),
    signal: AbortSignal.timeout(300_000),
  });

  if (!res.ok) throw new Error(`Hermes ${res.status}`);

  const data = await res.json();
  return data.choices[0].message.content;
}

.env.local (Next.js — not the same file as ~/.hermes/.env):

HERMES_ENDPOINT=http://127.0.0.1:8642

# Only if you set API_SERVER_KEY in ~/.hermes/.env
# HERMES_API_KEY=change-me-local-dev

OPENROUTER_API_KEY in .env.local does nothing for DevBrief; only Hermes uses it via ~/.hermes/.env.

Run the app:

pnpm dev

Click Generate. In the terminal you want POST /api/summary 200 without Hermes Agent not reachable, using fallback generator. The UI should show prose (e.g. a PR changelog with Summary / Changes), not only raw - PR #17 [closed]: title bullets.

Step 6 — Automated scheduling (roadmap)

Hermes can schedule work in plain English and deliver via Telegram/Slack through the gateway. I documented this as a next step in DevBrief's README:

hermes schedule "Every weekday at 8:30am, POST to http://localhost:3000/api/summary ..."

To wire Telegram:

hermes gateway setup

Troubleshooting (what actually bit me)

Symptom	Cause	Fix
`hermes: command not found` or wrong behavior	Wrong `hermes` on PATH (IBC relayer)	Use `~/.local/bin/hermes`
`curl` to OpenRouter works, `hermes chat` HTTP 400	Duplicate empty `OPENROUTER_API_KEY` at bottom of `~/.hermes/.env`	Keep one key line only
`hermes chat` HTTP 402 on Sonnet	OpenRouter reserves credits for huge `max_tokens`	`model.max_tokens: 2048` or add credits
DevBrief `ECONNREFUSED ::1:8642`	IPv6 localhost / gateway not running	`127.0.0.1`, run `hermes gateway run`
Bullet-list output only	Fallback path — Hermes unreachable	Fix gateway + endpoint; check server logs
Request takes ~40s	Normal — full agent + tools	Expected for first successful run

What I learned building DevBrief

The skill + prompt split is useful. Standup format lives in standup-writer.md and in prompt builders; app code stays about GitHub data and UX.

The fallback matters. When Hermes wasn't running, I still tested the UI. generateFallbackBrief() produces a minimal template until the gateway is up.

The OpenAI-compatible API is a real advantage. One fetch to /v1/chat/completions — no custom Hermes SDK in Next.js.

Memory and cron are powerful — and optional. Hermes supports memory and scheduling; I focused the submission on the working path: GitHub → Next.js API → Hermes gateway → formatted brief.

Honest latency. Agent-backed generation is slower than a single LLM call. Worth it for quality; show a loading state in the UI.

Running it in production

DevBrief and Hermes deploy as two services. Hermes is a long-lived gateway; it cannot run inside Vercel’s short-lived serverless functions. On serverless-only Next.js hosts, Hermes must run somewhere else — we use Render (Docker) for Hermes and Vercel for the Next.js app.

Users → DevBrief (Vercel)
            ↓  POST /v1/chat/completions
      Hermes (Render Docker)
            ↓
      OpenRouter (API key only on Hermes)

1. Hermes on Render

The repo ships docker/hermes/Dockerfile (Ubuntu + official Hermes installer + bundled standup-writer skill).

Push the repo to GitHub (include docker/, hermes-skills/, render.yaml).
In Render: New → Web Service → Docker, connect the repo.
Dockerfile path: docker/hermes/Dockerfile · Context: repository root.
Health check path: /health.
Environment variables on the Render service:

Variable	Value
`OPENROUTER_API_KEY`	Your OpenRouter key (Hermes only — not on Vercel)
`API_SERVER_KEY`	Strong secret; same value as DevBrief `HERMES_API_KEY`
`HERMES_MODEL`	e.g. `openrouter/owl-alpha` (optional)

API_SERVER_ENABLED and API_SERVER_HOST are already set in the Dockerfile. The first deploy can take 15–20+ minutes while install.sh runs inside the image.

Verify:

curl https://YOUR-SERVICE.onrender.com/health
# {"status":"ok","platform":"hermes-agent"}

OpenRouter tip: If you see HTTP 402, lower model.max_tokens in Hermes config (e.g. 2048) — Hermes may request a huge default budget and OpenRouter pre-reserves credits.

2. DevBrief on Vercel

Import the repo in Vercel (Next.js).
Environment variables (Production):

Variable	Example
`GITHUB_ID` / `GITHUB_SECRET`	GitHub OAuth app
`NEXTAUTH_SECRET`	`openssl rand -base64 32`
`NEXTAUTH_URL`	`https://your-app.vercel.app`
`HERMES_ENDPOINT`	`https://YOUR-SERVICE.onrender.com` (HTTPS, no `:8642`)
`HERMES_API_KEY`	Same as Render `API_SERVER_KEY`

In your GitHub OAuth app, set Authorization callback URL to:

https://your-app.vercel.app/api/auth/callback/github

/api/summary can run 30–60+ seconds while Hermes generates; this repo sets maxDuration = 60 — you need a Vercel plan that allows ≥60s (typically Pro).
History on Vercel is stored under /tmp (writable but ephemeral across cold starts). For durable history, add a database later.

Do not put OPENROUTER_API_KEY on Vercel for the normal flow — only the Hermes container needs it.

3. Smoke test

curl https://YOUR-SERVICE.onrender.com/health
Open the Vercel app → Connect GitHub → pick a repo → Generate
You want real prose from Hermes, not the fallback bullet template.

Repo: github.com/sundayonah/devbrief · Full checklist: docs/DEPLOYMENT.md.

Optional: local Docker before Render

export OPENROUTER_API_KEY=sk-or-v1-...
export API_SERVER_KEY=your-dev-secret
docker compose -f docker-compose.hermes.yml up --build
curl http://127.0.0.1:8642/health

The full picture

GitHub API
    ↓
Next.js API route (/api/summary)
    ↓
lib/hermes.ts → POST /v1/chat/completions @ 127.0.0.1:8642
    ↓
Hermes gateway (model via OpenRouter, tools, skills)
    ↓
Standup / PR changelog / work log
    ↓
Next.js UI → copy, history

(Optional later)
Hermes cron → POST /api/summary → Telegram via gateway

Resources

Hermes Agent docs: hermes-agent.nousresearch.com/docs
API server guide: hermes-agent.nousresearch.com/docs/user-guide/features/api-server
Hermes GitHub: github.com/NousResearch/hermes-agent
Challenge: Hermes Agent Challenge

If you're building with Hermes: start the gateway, hit /health, then hit /v1/chat/completions with curl before writing application code. On Windows, use 127.0.0.1 and keep hermes gateway run alive while you develop.

Drop a comment if you hit any snags — happy to help.

Gemma 4: The Comprehensive Developer's Guide to Google's Most Capable Open Model Family

Onah Sunday. — Thu, 07 May 2026 23:27:19 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Local AI has been having a serious moment — and Gemma 4 might be the release that makes it impossible to ignore. Google's latest open model family doesn't just inch forward; it makes a genuine leap: native multimodal input, a 256K context window, reasoning modes, and models that range from running on a Raspberry Pi to powering enterprise deployments.

But "most capable open model" means nothing if you don't know which model to pick, how to access it, or what it actually unlocks for your project. This guide covers all of that.

What Is Gemma 4?

Gemma 4 is Google's fourth generation of open-weight language models, built on the same research that powers the Gemini family. "Open-weight" means you can download the model weights and run them yourself — on your laptop, a Raspberry Pi, a cloud GPU, or a phone.

What makes Gemma 4 different from its predecessors:

Native multimodal support — images, video, and audio input baked into the architecture (not bolted on)
128K–256K context window — enough to process entire codebases or long documents in one shot
Advanced reasoning — purpose-built for multi-step planning and deep logic
Apache 2.0 license — commercially permissive, no restrictions on building products with it
Function calling + structured JSON output — production-ready for agentic workflows

The Three Model Variants (And How to Choose)

This is where most guides fall short. Gemma 4 isn't one model — it's a family of three distinct architectures, each designed for a different context. Picking the right one matters.

1. Edge Models: E2B and E4B (2B and 4B effective parameters)

Best for: Mobile apps, IoT, browser-side inference, edge devices, Raspberry Pi, offline use

These are built for environments where compute is constrained. The E2B model is small enough to run on high-end smartphones and even a Raspberry Pi 5. Both models support images and audio natively — which is remarkable at this size.

When to use them:

You need the model to run locally with no cloud dependency
You're building something for mobile or embedded hardware
Latency is critical and you can't afford a round-trip to a server
You want a free, offline AI with no credit card required

Limitations: Smaller capacity means less complex reasoning and less knowledge breadth. These are not the models for tasks that require deep multi-step analysis.

2. Gemma 4 31B Dense

Best for: High-quality text and multimodal tasks, local inference on a powerful workstation, fine-tuning experiments

This is the workhorse. The 31B Dense model ranks #3 on the Arena AI text leaderboard among open models — ahead of many models many times its size. It's the model you'd use when you need serious capability but still want local control.

On hardware: loaded in 4-bit quantization (QLoRA), the 31B model fits in roughly 18–20GB of VRAM — achievable on a modern consumer GPU like an RTX 4090, or serverless cloud GPUs.

When to use it:

Complex reasoning, detailed document analysis, code generation
Fine-tuning on a custom dataset (it's what the Google AI team used for their pet breed classifier)
Tasks where you need the best output quality and have the GPU headroom

3. Gemma 4 26B Mixture of Experts (MoE)

Best for: High-throughput production workloads, efficiency-focused deployments, advanced reasoning

This is the architecturally clever one. MoE (Mixture of Experts) means the model has 26 billion parameters total, but only activates 3.8 billion of them per inference pass. You get near-31B quality at a fraction of the compute cost.

It ranks #6 on the Arena AI leaderboard among open models — outperforming models 20x its size.

When to use it:

High-throughput serving where you need fast response times at scale
You're running many parallel requests and cost/efficiency matters
You need strong reasoning without paying for the full 31B compute on every token

Trade-off: MoE models are slightly more complex to deploy and fine-tune than dense models, and not all inference runtimes support them equally well yet.

Quick Comparison Table

Model	Params (Active)	Context	Multimodal	Best Use Case
E2B	2B	128K	Image, audio	Edge, mobile, offline
E4B	4B	128K	Image, audio	Edge with more capacity
31B Dense	31B	256K	Image	Quality-first tasks, fine-tuning
26B MoE	3.8B active	256K	Image	High-throughput production

How to Access Gemma 4 (Free Options First)

Option 1: Google AI Studio (Free, Easiest)

The fastest way to start is via the Gemini API on Google AI Studio. No credit card required for the free tier. You get API access to Gemma 4 models immediately.

import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemma-4-31b-it")

response = model.generate_content("Explain how Mixture of Experts works in plain English.")
print(response.text)

Option 2: OpenRouter (Free Tier — No Credit Card)

OpenRouter offers the 31B model on a free tier. Useful if you want OpenAI-compatible API calls:

import openai

client = openai.OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_KEY",
)

response = client.chat.completions.create(
    model="google/gemma-4-31b-it:free",
    messages=[{"role": "user", "content": "What are the advantages of open-weight models?"}]
)
print(response.choices[0].message.content)

Option 3: Run Locally via Ollama (No Cloud at All)

For true local inference with zero data leaving your machine:

# Install Ollama: https://ollama.com
ollama pull gemma4:4b
ollama run gemma4:4b

Or use it programmatically:

import ollama

response = ollama.chat(
    model="gemma4:4b",
    messages=[{"role": "user", "content": "Summarize the key differences between MoE and dense models."}]
)
print(response["message"]["content"])

Option 4: Hugging Face / Kaggle

Download model weights directly from Hugging Face or Kaggle. Requires accepting Google's model license (quick process). Useful for fine-tuning workflows.

Multimodal in Practice

One of Gemma 4's biggest leaps is genuine multimodal support. Here's how to use it with an image via the Gemini API:

import google.generativeai as genai
import PIL.Image

genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemma-4-31b-it")

image = PIL.Image.open("my_image.jpg")

response = model.generate_content([
    image,
    "Describe what you see in this image and identify any text present."
])
print(response.text)

The image must come before the text prompt — this is a documented convention for the Gemma 4 architecture and affects output quality.

The 128K–256K Context Window: What It Actually Unlocks

Most models cap out at 8K or 32K tokens. Gemma 4's context window changes what's possible:

Before (with a typical 8K model):

You chunk a large codebase into pieces
Ask questions about each chunk separately
Lose cross-file context and relationships

With Gemma 4's 256K context (31B):

Load an entire repository at once
Ask "what does the authentication flow look like end-to-end?" and get a coherent answer
Analyze a full research paper, legal document, or meeting transcript in a single pass

This is especially powerful for RAG (retrieval-augmented generation) systems, code review tools, and document analysis pipelines.

Fine-Tuning: Is It Worth It?

Yes — and it's more accessible than you might think.

Google's own team fine-tuned Gemma 4 31B for pet breed classification using QLoRA on Cloud Run with serverless NVIDIA RTX 6000 Pro GPUs. Key results:

Baseline accuracy (no fine-tuning): 89%
After fine-tuning on ~4,000 images: ~93% — approaching state-of-the-art for the Oxford-IIIT Pet dataset

The approach: 4-bit quantization (QLoRA) brings the 31B model's VRAM footprint down from ~62GB to ~18–20GB, making it tractable on a single high-end GPU.

Quick QLoRA config for Gemma 4:

from transformers import BitsAndBytesConfig
from peft import LoraConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
)

lora_config = LoraConfig(
    r=64,
    lora_alpha=64,
    target_modules="all-linear",  # Required for Gemma 4 — covers both LM and vision tower
    task_type="CAUSAL_LM",
)

Note: For Gemma 4, always use target_modules="all-linear" rather than targeting specific layer names. The architecture uses a custom Gemma4ClippableLinear wrapper, and specifying individual layer names bypasses it, causing unstable training.

What This Means for Developers

Open models at this capability level change the economics of building AI applications:

Privacy-first applications become viable. You can process sensitive documents, medical records, or private communications locally — with no data ever leaving your infrastructure.

Latency-critical use cases open up. Edge models that run on-device eliminate the round-trip to a cloud API. For real-time transcription, instant image analysis, or offline AI assistants, this is a genuine unlock.

Fine-tuning without massive infrastructure. QLoRA on a single consumer GPU or a serverless GPU instance makes domain-specific models accessible to indie developers and small teams — not just companies with ML infrastructure budgets.

Agentic workflows get a lot more capable. Native function calling, structured JSON output, and a 256K context window make Gemma 4 a serious option for building AI agents that reason over large amounts of context and take real actions.

What This Means for Developers in Africa

There's something worth saying that most Gemma 4 guides won't mention: for developers in regions like Nigeria and across Africa, open-weight models aren't just a technical curiosity — they're genuinely transformative.

Cloud AI APIs come with real barriers here. Dollar-denominated pricing hits harder when you're earning in naira. Latency from distant data centers is a constant frustration. Payment methods that "just work" in the US often don't. And data sovereignty matters — sending sensitive local data to foreign servers is a compliance and trust problem many African startups quietly struggle with.

Gemma 4 changes that equation. A model powerful enough to run locally, with no API costs, no cloud dependency, and no data leaving your machine, levels the playing field in a way that felt impossible two years ago. The E2B model running on a Raspberry Pi or a mid-range Android phone isn't a toy — it's a pathway to building AI-powered products for local markets at local economics.

The next wave of AI applications built for African languages, local businesses, and underserved communities doesn't have to wait for foreign cloud providers to care. With Gemma 4, developers here can build it themselves, on their own terms.

Getting Started Checklist

Experiment first → Google AI Studio free tier, no setup required
Pick your model → Edge tasks? E2B/E4B. Quality tasks? 31B Dense. Scale? 26B MoE
Go local → Ollama for zero-configuration local inference
Fine-tune → Hugging Face + QLoRA + target_modules="all-linear" for Gemma 4

The code for the Google AI team's full fine-tuning pipeline is available on GitHub at GoogleCloudPlatform/devrel-demos — a great starting point for your own experiments.

Wrapping Up

Gemma 4 isn't just a better version of Gemma 3 — it's a genuinely different tier of open model. The combination of multimodal input, long context, reasoning capabilities, and a commercially permissive license puts it in a category that didn't really exist for open-weight models until now.

The most exciting part isn't the benchmarks — it's the use cases that become possible when capable AI runs locally, privately, and cheaply. What will you build with it?