DEV Community: Harsha.B.M

From Prompts to Action: What Gemini 3.5 Flash and the Agentic Stack Mean for Developers

Harsha.B.M — Sun, 24 May 2026 07:33:31 +0000

This is a submission for the Google I/O Writing Challenge

There's a phrase Google kept repeating throughout the I/O 2026 keynotes: "from prompts to action."

At first, it sounds like marketing. But after sitting with the full set of announcements — Gemini 3.5 Flash, Managed Agents, Antigravity 2.0, WebMCP — I think it's actually a precise description of where we are right now as developers. And it's worth unpacking seriously, because the implications for how we build software are bigger than any single model release.

The Headline: Gemini 3.5 Flash Beats Last Year's Pro

Let's start with the model itself, because the benchmark story is genuinely interesting.

Gemini 3.5 Flash outperforms Gemini 3.1 Pro across almost all benchmarks — including challenging agentic benchmarks like Terminal-Bench 2.1 (76.2%) and MCP Atlas (83.6%) — while running four times faster than comparable frontier models. It's available today via the Gemini API, AI Studio, and Android Studio.

This matters for a specific reason: historically, you traded speed for intelligence. Flash was fast and cheap; Pro was smart but slow. That trade-off shaped how we architected agentic systems — you'd use Flash for quick tool calls and route harder reasoning to Pro.

3.5 Flash collapses that boundary. A model at Flash speed that thinks like a Pro model changes the economics and architecture of every agent loop you're building.

Pricing sits at $1.50 input / $9.00 output per million tokens, with a 1M token context window. Dynamic thinking is on by default.

The Real Story: Google Shipped a Vertical Stack

Here's what I think most post-event coverage is underweighting: Google didn't just ship a model. They shipped a production pipeline.

Lay it out end to end:

Gemini 3.5 Flash — the fast, frontier-grade model powering every layer
Managed Agents in the Gemini API — a single API call that spins up an isolated Linux sandbox, where an agent can reason, use tools, execute code, manage files, and browse the web, with persistent state across calls
Antigravity 2.0 — a standalone desktop app for orchestrating agents, with parallel subagent execution, scheduled background tasks, and integrations across AI Studio, Android, and Firebase
Antigravity CLI + SDK — command-line and programmatic access to the same agent harness
WebMCP — a proposed open web standard that lets you expose JavaScript functions and HTML forms as structured tools to browser-based agents
Modern Web Guidance — curated, expert-vetted skills that guide AI coding tools across common use cases, defined in simple markdown files like AGENTS.md and SKILL.md

This is not a model + plugin. It's a full vertical from model inference to production deployment, with Google owning Chrome, Android, Play, and the web standards process at the edges. That's a meaningfully different competitive posture.

What Managed Agents Actually Unlocks

The feature I keep coming back to is Managed Agents, and I think it deserves a closer look.

Previously, building a stateful agent workflow meant managing your own execution environment: provisioning compute, handling context across turns, wiring up tools, and keeping state between calls. A lot of the complexity in agentic systems wasn't AI logic — it was infrastructure plumbing.

Managed Agents changes this. One API call provisions an isolated cloud Linux environment. The agent has tools, can execute code, browse, manage files. Subsequent API calls resume the same session with all state intact — no reinitializing context on every turn. Google describes it as multi-turn agentic workflows that just work.

For developers who've spent time building agent infrastructure from scratch, this is the kind of abstraction that genuinely saves weeks.

One Honest Caveat on Developer Experience

I want to flag something that the official announcements gloss over.

If you're migrating from gemini-3-flash-preview to gemini-3.5-flash, there's a silent breaking change: the default thinking_level is now medium, not high. A straight copy-paste port will produce different outputs without any obvious error.

Also worth knowing: if you're running agent workflows through GitHub Copilot, each Flash call meters at 14x premium requests. For serious agentic work, the direct API path through the Antigravity SDK or Vertex AI is dramatically cheaper — roughly 37x cheaper at scale.

These are the kinds of details that matter when you're building in production, and I wish they were more prominent in the launch documentation.

The Bigger Shift Worth Paying Attention To

Here's what I think I/O 2026 signals at the macro level.

We spent the last two years asking "how smart is the model?" That question is becoming less useful. 3.5 Flash beating 3.1 Pro on agentic benchmarks while running faster is partly a story about model capability — but it's mostly a story about optimization for a specific use case: multi-step, tool-heavy, real-world agent loops.

The new question developers need to be asking is: what is the execution surface?

Google's answer is clear: the execution surface is the agent harness, and they want it to be Antigravity — running in their cloud, on their desktop app, through their API, deployed to Android through their studio. AppFunctions on Android lets apps expose capabilities directly to intelligent agents. WebMCP brings the same primitive to the browser.

This is Google saying: the next layer of developer platform isn't a runtime or a framework. It's an agent execution environment. And they're racing to own it end-to-end.

Whether that's exciting or concerning probably depends on your appetite for platform consolidation. But either way, it's the most coherent platform story I've seen from Google in years.

What I'm Watching Next

A few things I'll be paying close attention to in the weeks ahead:

Gemini 3.5 Pro is confirmed in development and expected to roll out next month (June 2026). If it extends the 3.5 Flash pattern — frontier reasoning at improved speed — that's a significant shift in the model tier structure.

WebMCP adoption will be the real test of whether Google can make agent-native web a standard rather than a proprietary feature. Open standards only work when other browsers and toolchains adopt them.

Managed Agents in production — I want to see real developer reports on latency, reliability, and cost at scale before recommending it for production workloads. The abstraction is elegant; the question is whether the infrastructure behind it delivers.

Final Take

Google I/O 2026 wasn't a "look how smart our model is" event. It was a platform architecture announcement dressed up as a model launch.

The Gemini 3.5 Flash numbers are real and impressive. But the more important thing Google shipped is a complete vertical stack for agent development — from a fast, frontier-grade model to managed execution environments to desktop tooling to web standards. That's infrastructure, not just AI.

For developers, the immediate practical wins are clear: faster and cheaper inference for agentic workflows, and a significantly lower infrastructure burden if you're building stateful agents. The longer arc — whether Google's agentic platform becomes the dominant execution layer for the next generation of applications — is a bigger question, and one that's going to be answered by what gets built on it.

That's the part I find most worth watching.

Have you tried Gemini 3.5 Flash or Managed Agents yet? I'd love to hear what you're building in the comments.

Which Gemma 4 Model Should You Actually Use? A Developer’s Honest Guide

Harsha.B.M — Sun, 24 May 2026 07:13:34 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Which Gemma 4 Model Should You Actually Use? A Developer's Honest Guide

When Google DeepMind dropped Gemma 4 on April 2, 2026, the community response was immediate — 207,000 Ollama pulls in 48 hours, front page of Hacker News, and a same-day Ollama update to support all four variants. The hype was real. But so was the confusion.

Four models. Three naming conventions. Two architectures. One question every developer is quietly Googling:

Which one do I actually run?

This is that answer — practical, specific, with no benchmark-pasting.

First, Decode the Names

The naming is the first thing that trips people up. Let's fix that.

Model	What the name means	Architecture
E2B	Effective 2 Billion parameters	Dense + Per-Layer Embeddings
E4B	Effective 4 Billion parameters	Dense + Per-Layer Embeddings
26B A4B	26B total, 4B Active per token	Mixture of Experts (MoE)
31B	31 Billion parameters, all of them	Dense

The E in E2B and E4B stands for effective — not just a raw parameter count. These models use Per-Layer Embeddings (PLE), an architectural trick that lets them punch above their weight on constrained hardware. The A in 26B A4B stands for active — only 4 billion of those 26 billion parameters fire for any given token. That's the magic of Mixture of Experts.

If the names still feel weird, read them like this:

E2B: "tiny but smart for its size"
E4B: "the everyday laptop model"
26B A4B: "26B quality, 4B speed" ← the sleeper pick
31B: "no compromises"

The Hardware Reality

Before picking a model, be honest about your machine:

E2B — ~2–3 GB storage, runs on phones, Raspberry Pi, and anything with a CPU. If you're deploying to edge devices or need zero-latency local inference on minimal hardware, this is it. Don't use it for complex reasoning — it'll disappoint.

E4B — ~9.6 GB download via Ollama. This is the default ollama pull gemma4 variant for a reason. Runs comfortably on a 16 GB MacBook (M1 or later). Fast enough for interactive use. Good enough for most real tasks. If you're not sure which to pick, this is your answer.

26B A4B — The one most people overlook. You need around 24 GB of RAM (or a 24 GB GPU like an RTX 3090 or 4090). But what you get is near-31B quality at roughly E4B inference speed, because MoE only activates 3.8B parameters per token. Apple Silicon Mac with 32 GB unified memory? This is your best model.

31B Dense — 20 GB minimum RAM/VRAM, 24 GB recommended. Every single one of those 31 billion parameters fires for every token. No shortcuts. It currently sits at #3 among all open models globally on the Arena AI leaderboard. If you have a 4090 or an M2 Ultra, run this.

The Setup (Ollama, 5 Minutes)

Ollama is the fastest path from zero to running. Make sure you have Ollama 0.22 or newer — earlier versions don't handle Gemma 4 properly.

# Check your version
ollama --version

# Pull the model that matches your hardware
ollama pull gemma4:e2b    # phones, Pi, CPU-only machines
ollama pull gemma4        # E4B — 16 GB laptops (default)
ollama pull gemma4:26b    # 24 GB RAM — MoE, best quality/speed
ollama pull gemma4:31b    # 24 GB+ VRAM — maximum quality

# Run it
ollama run gemma4

One Critical Fix You Need to Make

Ollama's default context window for Gemma 4 is set to 4K tokens — but the actual models support 128K (E2B/E4B) and 256K (26B/31B). That default silently cripples long-context work. Fix it:

# Create a Modelfile with the right context
cat << 'EOF' > Modelfile
FROM gemma4
PARAMETER num_ctx 32768
EOF

# Build a custom named model
ollama create gemma4-32k -f Modelfile
ollama run gemma4-32k

For LM Studio users: search for Gemma 4 GGUF builds and use Q4_K_M quantization — it's the sweet spot between quality and RAM usage. Q5 if you have headroom to spare.

What Gemma 4 Actually Gets Right

Multimodal is native, not bolted on

Every Gemma 4 model handles text and images in a single model call — no separate vision pipeline, no switching endpoints. The E2B and E4B models go further and support audio input natively (up to 30 seconds), and the 26B/31B models handle video up to 60 seconds at 1fps. This isn't a demo feature. It's built into the base architecture.

128K context is usable in practice

A lot of models claim long context and then quietly degrade in quality past a few thousand tokens. Gemma 4 uses a hybrid attention mechanism — interleaving local sliding window attention with full global attention — specifically designed to maintain coherence at long range. For RAG pipelines, codebase analysis, or long-document work, this matters.

The license is actually open

Apache 2.0. Not Google's previous custom Gemma license. You can use it commercially, modify it, fine-tune it, and deploy it in products — no restrictions, no royalties. For developers building on top of a local model, this changes the calculus entirely.

The Decision Tree

Stop overthinking it. Use this:

What hardware do you have?
│
├─ Phone / Raspberry Pi / CPU-only → E2B
│
├─ 16 GB laptop (Mac, Windows, Linux) → E4B (ollama pull gemma4)
│
├─ 32 GB Apple Silicon or RTX 3090/4090 → 26B A4B ← don't skip this one
│
└─ 64 GB+ Mac or RTX 4090 and you need maximum quality → 31B Dense

What are you building?
│
├─ Mobile / edge app → E2B or E4B
│
├─ Local dev tool, coding assistant, RAG → E4B or 26B A4B
│
├─ Long-context document analysis, codebase reasoning → 26B or 31B (+ increase num_ctx)
│
└─ Fine-tuning for a specific domain → Start with 26B A4B

What This Actually Means

Here's the thing worth sitting with for a moment.

The 31B Dense model — the one that ranks third among all open models on Earth — runs on a consumer GPU. A single RTX 4090, the kind of card a serious gamer or developer might already own, is sufficient. No cluster. No cloud bill. No API rate limits. No data leaving your machine.

Two years ago, a model this capable required either a research institution's compute budget or a cloud provider's infrastructure. Today you pull it with one terminal command and it runs on hardware you might already own.

The E4B model — the second-smallest in the family — handles image input, supports 128K context, reasons in 140+ languages, and fits in 16 GB of RAM. That's a family phone or a mid-range MacBook.

Developers who internalize this shift will build very differently from those who don't. When inference is local and free, the calculus around what's worth building changes. Offline-first AI features stop being a niche edge case and start being a design choice. Privacy-sensitive applications that couldn't viably use cloud AI now have a real path.

That's what Gemma 4 is: not just a better model, but a different kind of constraint on what's possible.

Quick Reference

	E2B	E4B	26B A4B	31B Dense
Best for	Edge, mobile	Everyday dev	Quality + speed	Max quality
RAM needed	4 GB	8–16 GB	24 GB	20–24 GB+
Context	128K	128K	256K	256K
Multimodal	Text + Image + Audio	Text + Image + Audio	Text + Image + Video	Text + Image + Video
Ollama tag	`gemma4:e2b`	`gemma4` (default)	`gemma4:26b`	`gemma4:31b`
License	Apache 2.0	Apache 2.0	Apache 2.0	Apache 2.0

Pick the model that matches your hardware. Fix the num_ctx default. Build something real.

That's it.

GemmaLens — AI-Powered Repository Understanding Engine Built with Gemma 4 31B

Harsha.B.M — Sun, 24 May 2026 06:48:08 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

GemmaLens — an AI-powered repository understanding and architectural memory engine that lets you drop in any GitHub URL and walk away actually understanding the codebase.

Most developers know the pain: you inherit a repo, join a new team, or revisit old code — and you're stuck archaeologically digging through folders, grepping for entry points, manually tracing imports. GemmaLens eliminates that cold-start problem entirely.

Here's what happens when you paste a GitHub URL:

Real cloning — the backend clones the repo via GitPython (no scraping, no fake data)
Deep scanning — detects languages (20+), frameworks, package managers, and parses actual dependencies from package.json, requirements.txt, Cargo.toml, pom.xml, and composer.json
Architecture graph — traces Python and JS/TS imports across the codebase, builds a NetworkX directed graph, and renders it live with React Flow
Gemma AI summary — the full context (file tree, imports, key file contents, dependency list) is passed to Gemma 4, which produces a grounded, non-hallucinated architectural summary
GemmaChat — ask anything: "explain the authentication flow", "what services depend on the database?", "where is the API defined?" — all answers are grounded in the real scanned context
Documentation generation — one click produces full markdown docs from the actual analysis
LensContext export — a structured JSON file you can paste into Claude, Cursor, ChatGPT, or any AI tool to give it instant repo awareness

Tech stack:

Frontend: Next.js + Tailwind CSS + React Flow (@xyflow/react) + react-markdown
Backend: FastAPI + GitPython + NetworkX + httpx
AI: Gemma 4 31B Dense via OpenRouter

Demo

🚧 Deployment in progress — video walkthrough coming shortly.

Example flow on a real repo (https://github.com/fastapi/fastapi):

Overview tab: Gemma's summary correctly identifies the ASGI framework, Starlette dependency, Pydantic integration, and test structure — all from scanned files
Architecture Graph: nodes for fastapi/, tests/, docs_src/ with real import edges between modules
GemmaChat: asking "how does dependency injection work here?" returns an accurate answer citing actual files
Export: a JSON blob ready to paste into any AI assistant for instant context

📁 GitHub Repository: github.com/HarshaBM-25/gemmalens

Code

Full source on GitHub: https://github.com/HarshaBM-25/gemmalens

Project structure:

gemmalens/
├── backend/
│   ├── app/
│   │   ├── main.py               # FastAPI app + CORS
│   │   ├── models/               # Pydantic request models
│   │   ├── routes/
│   │   │   ├── analyze.py        # Clone + scan endpoint
│   │   │   ├── chat.py           # GemmaChat endpoint
│   │   │   ├── docs.py           # Documentation generation
│   │   │   └── export.py         # LensContext JSON export
│   │   └── services/
│   │       ├── analyzer.py       # GitPython, file scanning, NetworkX graph
│   │       └── gemma.py          # OpenRouter / Gemma 4 integration
│   └── requirements.txt
└── frontend/
    ├── app/
    │   ├── page.tsx              # Homepage with repo input
    │   └── analyze/[repoId]/
    │       └── page.tsx          # 5-tab analysis dashboard
    └── components/
        └── GraphView.tsx         # React Flow architecture graph

To run locally:

# Backend
cd backend
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
export OPENROUTER_API_KEY=your_key_here
uvicorn app.main:app --reload --port 8000

# Frontend
cd frontend
npm install
npm run dev
# Open http://localhost:3000

How I Used Gemma 4

I chose Gemma 4 31B Dense via OpenRouter, and the reasoning was specific — not arbitrary.

Why 31B Dense over E2B or E4B?

The core challenge in repository understanding isn't code generation — it's coherent reasoning across a large, heterogeneous context. A single repository analysis call sends Gemma:

The full file tree (up to 200 files)
Detected languages, frameworks, and all dependencies
Module-to-module import relationships
Key file contents (README, entrypoints, config files) — up to ~20,000 tokens of real code

Smaller models struggle to maintain coherence when the context mixes directory trees, JSON dependency lists, Python imports, and raw source code simultaneously. The 31B Dense architecture handles this without losing the thread — it correctly identifies which file is the entrypoint, how modules relate, and what design patterns are in use.

The E2B and E4B MoE variants, while efficient, trade depth for speed in ways that matter here. When a user asks "explain how authentication is implemented", the answer needs to correctly synthesize information from potentially 5–8 different files. That requires the full model capacity of 31B Dense.

Where Gemma 4 runs in GemmaLens

Gemma 4 powers four distinct features, each with a purpose-built system prompt:

1. Repository Summarization
After the backend scans the repo, Gemma receives the full context and produces a structured architectural summary — identifying purpose, key design decisions, module relationships, and entry points. The system prompt explicitly instructs it to ground every claim in the provided data and flag anything it cannot verify.

2. GemmaChat Q&A
Every chat message is sent with the full repository context prepended. Gemma answers questions like "where is the database connection configured?" by reasoning over the real file tree and import map — not hallucinating. The system prompt holds it accountable: "if it's not in the data, say so."

3. Documentation Generation
Gemma generates a full markdown README — Overview, Architecture, Directory Structure, Dependencies, Getting Started, Key Modules — from the real scanned data. No template filling; actual synthesis.

4. Architecture Explanation
When users ask about specific modules in GemmaChat, Gemma explains the module's role, its dependents, and its dependencies using the real NetworkX graph data passed as context.

The key design decision

I deliberately did not pre-summarize or compress the context before sending it to Gemma. The raw file tree, raw dependency list, raw import relationships — all of it goes in. This is where Gemma 4 31B Dense earns its place: it handles the noise, finds the signal, and produces answers that are genuinely useful rather than generically plausible.

The result is an AI that actually knows the repository — not one that performs knowing it.

Built for the Gemma 4 Challenge. No mock data, no fake dashboards, no hardcoded outputs — everything you see comes from real repository analysis.