Thy Alpha

Posted on May 11 • Edited on May 23

Gemma 4 in the Browser: Why Zero-Backend AI Apps Are the Future (And How to Build One)

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write about Gemma 4

Blog Post

Gemma 4 in the Browser: Why Zero-Backend AI Apps Are the Future

I built an AI interview coaching tool that runs entirely in the browser. No server. No database. No Docker. No monthly hosting bill.

Just an HTML file, Tailwind CSS from a CDN, and Google's Gemma 4 model via a free API. And the result is genuinely good — it conducts realistic 20-minute mock interviews, evaluates answers against rubrics, and generates detailed study plans.

Here's why this architecture works in 2026 and didn't work even a year ago.

The Zero-Backend Architecture

┌──────────────────────────────┐
│       User's Browser         │
│                              │
│  index.html (single file)    │
│  ├── Tailwind CSS (CDN)      │
│  ├── Speech Recognition API  │
│  ├── SpeechSynthesis API     │
│  ├── localStorage            │
│  └── JavaScript (fetch)      │
└──────────────┬───────────────┘
               │
          HTTPS (direct)
               │
               ▼
┌──────────────────────────────┐
│   Google AI Studio (free)    │
│   or OpenRouter (free)       │
│   or NVIDIA NIM (free)       │
│   or HuggingFace (free)      │
│                              │
│   Gemma 4 31B Dense          │
│   128K context window        │
│   Thinking tokens (CoT)      │
└──────────────────────────────┘

Total cost to run: $0. Total cost to host: $0 (GitHub Pages). Total infrastructure to maintain: none.

Why This Wasn't Possible Before Gemma 4

1. 128K Context Window = The Database Is the Prompt

My interview coach processes the candidate's full resume and target job description at session start. That's already ~2,000 tokens. Then it conducts a 15-round interview where it needs to remember what you said in Q2 when evaluating Q12. Then it generates a comprehensive final report referencing the entire session.

Resume + JD:       ~2,000 tokens
System prompt:     ~1,000 tokens
Per Q&A round:     ~500 tokens × 15 rounds = 7,500 tokens
Thinking tokens:   ~400 tokens × 15 rounds = 6,000 tokens
Final report:      ~2,000 tokens
────────────────────────────────────
Total:             ~18,500 tokens ← fits easily in 128K

With older models (4K-8K context), this would require a backend to manage conversation windows, summarization pipelines, and retrieval logic. With Gemma 4's 128K window, the entire session history fits in a single API call.

Zero backend needed. The context window is the database.

2. Thinking Tokens: Gemma 4's Hidden Superpower

This is the feature that changed everything for my app. Most coverage of Gemma 4 focuses on context length and parameter count. But the most impactful feature for evaluation tasks is native thinking tokens.

When you call Gemma 4's API, the response includes parts flagged with thought: true — the model's internal reasoning before it commits to output:

{
  "candidates": [{
    "content": {
      "parts": [
        {
          "text": "Let me analyze this answer against STAR criteria...\n- Situation: Mentioned 'a project' but no specifics\n- Task: 'Had to deliver on time' — vague, no ownership\n- Action: MISSING — said what the team did, not what THEY did\n- Result: 'On time' — no metrics, no business impact\n- Pattern: 3rd answer where they avoid specific numbers\n- Calibration: For Senior Engineer, interviewers expect impact metrics\n- Strategy: Acknowledge topic, push hard for specifics",
          "thought": true
        },
        {
          "text": "Good topic choice — leading a team through pressure is exactly what they ask at senior level. But your answer needs more teeth...\n\n1. **Specific numbers**: How big was the team? What was at stake?\n2. **Your decisions**: What trade-offs did YOU make?\n3. **Measurable result**: 'On time' isn't enough — did you save cost? Improve a metric?\n\nI've noticed across your answers that you tend to skip quantification. At the senior level, that's the difference between 'hire' and 'no hire'."
        }
      ]
    }
  }],
  "usageMetadata": {
    "thoughtsTokenCount": 478
  }
}

478 tokens of structured reasoning — checking STAR criteria, noticing cross-session patterns, calibrating to experience level — all before writing feedback.

This is not chain-of-thought prompting. I didn't ask the model to "think step by step." This is chain-of-thought architecture — the model does it automatically. No other free model has this built in.

Why Thinking Tokens Matter for Evaluation Tasks

Evaluation is harder than generation. Asking interview questions is easy. Evaluating whether an answer is good — considering structure, specificity, depth, relevance, and experience level simultaneously — requires holding multiple criteria in working memory.

Without thinking tokens (typical LLM):

"Good answer! Try adding more details next time."

With thinking tokens (Gemma 4):

"You picked a relevant example, but your answer lacks three things senior interviewers look for: specific numbers (team size, timeline), your personal decisions (not the team's), and measurable impact. I've noticed you've avoided quantification in 3 of your last 4 answers — let's fix that pattern now."

The thinking tokens are like a private scratchpad. The model systematically works through evaluation criteria before responding. The result feels like feedback from an experienced interviewer, not a chatbot.

Three Patterns Where Thinking Tokens Excel

Multi-criteria evaluation — When scoring against a rubric (STAR method, technical accuracy, communication), thinking tokens let the model address each criterion separately before synthesizing.
Cross-session pattern recognition — With 128K context AND thinking tokens, the model notices: "This is the third answer without specific metrics" and adjusts its coaching strategy.
Calibrated difficulty — The model reasons about whether to make the next question harder or easier based on performance trajectory, not just the last answer.

3. Free Tier + Open Source = No Business Model Required

Gemma 4 on Google AI Studio doesn't require a credit card. OpenRouter, NVIDIA NIM, and HuggingFace all offer free inference. This means:

Users bring their own free API key
No payment integration needed
No usage tracking or rate limit management needed
No terms of service needed (users have their own provider agreements)

The entire business model question disappears. It's just... a free tool.

4. 31B Dense + 26B MoE = Two Models for Two Purposes

Variant	Active Params	Best For
31B Dense	31B	Deep reasoning — STAR analysis, comprehensive reports, system design evaluation
26B MoE	~4B	Fast conversational flow — rapid-fire behavioral questions, warm-up rounds

My app lets users choose. This maps model architecture to coaching pedagogy: deep evaluation needs full parameter engagement; conversational flow benefits from MoE speed.

What I Actually Built

6 practice modes: Behavioral (STAR), Technical, System Design, Assessment, Certification, Case Study
Resume + JD awareness: Paste both — questions tailored to the role's requirements
Voice mode: Speak answers, hear feedback — browser Speech API, zero cost
Image analysis: Upload coding screenshots or architecture diagrams
Real-time scoring: Mid-session scorecards on 5 dimensions + final report with 7-day study plan
Report download: Save results as text file
Session timer + session history in localStorage
4 providers: Automatic fallback if one is rate-limited
Dark mode + full mobile responsiveness

The Single-File Decision

The most important architecture decision: make the app a single HTML file.

<!-- The entire application -->
<script src="https://cdn.tailwindcss.com"></script>
<!-- ... 1367 lines of HTML + CSS + JS ... -->

This means:

Fork and customize: git clone → edit → push → your own version
Run locally: Double-click the file. Works offline (except API calls)
No supply chain risk: Zero npm packages
Instant deploy: Drag to any hosting

Why This Pattern Works (And When It Doesn't)

"Single-file apps don't scale!"

True for most products. Not for AI-first tools where:

The AI model handles all business logic (Gemma 4's thinking tokens)
The user provides their own API key (no shared auth)
State is session-scoped (no database needed — the 128K context IS the state)
The browser provides remaining APIs (speech, file system, localStorage)

This pattern works specifically because Gemma 4's context window replaces a database, thinking tokens replace evaluation logic, and free API tiers replace server infrastructure.

The Future: AI Apps Without Infrastructure

I think we'll see an explosion of zero-backend AI tools:

Single HTML file
+ Gemma 4 (128K context + thinking tokens)
+ Browser APIs (Speech, Canvas, File System)
+ Static hosting (GitHub Pages)
= Full-featured app with no backend

The bottleneck was always model quality, context length, and cost. Gemma 4 removes all three simultaneously.