DEV Community

Cover image for How to use Google Gemma 4 model
Ifeanyi Chima
Ifeanyi Chima

Posted on

How to use Google Gemma 4 model

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

I Built a Celebrity AI Chat App with Google Gemma 4 26B-A4B, Here's Everything You Need to Know

I've always been fascinated by the idea of talking to your Elon Musk, but for now we would have to settle down for a chatbot that actually feels like them. So when Google dropped Gemma 4 in April 2026, I saw my chance. I built a web app with a chat UI where you pick a celebrity, and Gemma 4 slips into their persona and holds a full conversation with you. The results were genuinely impressive and sometimes hilarious.

But before I walk you through how I built it, let me give you a proper introduction to the models that made it possible.

1. Understanding the Google Gemma 4 Model Family

Gemma 4 is Google DeepMind's latest generation of open models, released on April 2, 2026. It comes in four distinct sizes, each targeting a different use case, from running on your phone to powering a server-grade workstation. All four models are multimodal (they can process text and images), licensed under Apache 2.0, and support over 140 languages.

Quick Comparison Table

Model Architecture Context Audio Best For
E2B Efficient Dense 128K Mobile / Edge
E4B Efficient Dense 128K On-device assistants
26B A4B Mixture of Experts 256K High-throughput APIs
31B Dense 256K Maximum quality

2. How to Get an API Key

Before you can start building anything with Gemma 4, you need API access. The good news? You have a couple of options and one of them is completely free to start.

Option A: Google AI Studio (Direct from Google)

Google provides direct API access to Gemma 4 through Google AI Studio. Here's how to get your key:

  1. Go to aistudio.google.com
  2. Sign in with your Google account
  3. Click "Get API key" in the left sidebar
  4. Click "Create API key"
  5. Copy your key and store it somewhere safe (you won't be able to see it again)

Google AI Studio

That's it. Google AI Studio gives you a generous free tier to experiment with before you need to think about billing.

Never hardcode your API key in your frontend code. Always store it in an environment variable (.env file) and access it from your backend.

Add .env to your .gitignore before your first commit, not after.


Option B: OpenRouter

OpenRouter is a unified API gateway that gives you access to hundreds of models including all Gemma 4 variants through a single, OpenAI-compatible endpoint. That means you can switch models with one line of code change, get automatic fallbacks if a provider goes down, and even access the free tier of Gemma 4 31B without spending a cent to start.

To get your OpenRouter API key:

  1. Go to openrouter.ai
  2. Click "Sign In" and create an account
  3. Once logged in, click "Get API Key" button from your dashboard
  4. Click "New Key", give it a name (e.g. celeb-chat-dev), and hit create
  5. Copy and save your key it starts with sk-or-...

You can optionally add credits to your account for paid usage, but to start experimenting, the free-tier models (including google/gemma-4-26b-a4b-it:free) are more than enough.


3. How to Use Google Gemma 4 Through OpenRouter

Alright, this is where the fun begins. Let me walk you through exactly how I wired up Gemma 4 via OpenRouter to power my celebrity chat app, from the first API call all the way to a working chat UI.

The Base URL and Model IDs

OpenRouter exposes Gemma 4 through a single endpoint that's fully compatible with the OpenAI Chat API format. If you've ever called GPT-4 or any other model via OpenAI's SDK, this will feel immediately familiar, you're literally changing one URL and one model string.

The base URL is:

https://openrouter.ai/api/v1/chat/completions
Enter fullscreen mode Exit fullscreen mode

And the Gemma 4 model IDs on OpenRouter are:

Model Model ID Free Tier?
Gemma 4 E2B google/gemma-4-e2b-it
Gemma 4 E4B google/gemma-4-e4b-it
Gemma 4 26B A4B google/gemma-4-26b-a4b-it ✅ (:free suffix)
Gemma 4 31B google/gemma-4-31b-it ✅ (:free suffix)

To use the free tier of the 26B model for example, use google/gemma-4-26b-a4b-it:free as the model string.


Switching Models Without Rewriting Your Code

This is the quiet superpower of going through OpenRouter. Because the API schema is identical across all models, switching from the free 31B to the 26B MoE (which is faster and cheaper in production) is literally one line:

// During development
model: "google/gemma-4-31b-it:free"

// In production (faster, cheaper MoE)
model: "google/gemma-4-26b-a4b-it:free"
Enter fullscreen mode Exit fullscreen mode

No refactoring. No new SDK. Just change the string. This is why I recommend OpenRouter for any serious project, you can prototype on free-tier models and graduate to the right production model without touching your integration code.


Your First API Call

Let's start simple with a raw fetch call in JavaScript, no SDKs needed:

const response = await fetch("https://openrouter.ai/api/v1/chat/completions", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.OPENROUTER_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "google/gemma-4-26b-a4b-it:free",
    messages: [
      { role: "user", content: "What's the capital of Nigeria?" }
    ],
  }),
});

const data = await response.json();
console.log(data.choices[0].message.content);
Enter fullscreen mode Exit fullscreen mode

That's it. Three things to note:

  • Your API key goes in the Authorization header as a Bearer token
  • The response shape mirrors OpenAI, your answer lives at data.choices[0].message.content
  • You're billed (or rate-limited on free tier) per token, so max_tokens is worth setting explicitly

I was getting this error

google/gemma-4-26b-a4b-it:free is temporarily rate-limited upstream.
Enter fullscreen mode Exit fullscreen mode

The free tier of Gemma 4 on OpenRouter is being hammered by too many users right now and Google AI Studio is throttling it. When you use the free tier on OpenRouter (google/gemma-4-26b-a4b-it:free), you are sharing a pool of rate limits with every other developer in the world using that same free model. So when too many people hit it at once, Google throttles it and you get that 429 error. But you already have a Google AI Studio API key sitting in your .env.local doing nothing right now. OpenRouter lets you connect your own Google AI Studio key so that your requests go through your personal quota instead of the shared free pool.

You have two options:

Option 1 — Add your Google AI Studio key to OpenRouter

Option 2 — Add retry logic to your API route

I went with Option 1 which is adding my Google AI Studio key to OpenRouter integrations

Go to: openrouter.ai/settings/integrations

  • Find Google AI Studio
  • Paste your Google AI Studio API key there

This gives you your own dedicated rate limit instead of sharing the free pool with everyone

OpenRouter integration


The System Prompt; The Secret Weapon for My Project

Here's where Gemma 4 genuinely impressed me. Gemma 4 officially supports the system role natively; a big deal for an open model, since earlier generations required hacky workarounds. However, in my actual project I used a slightly different pattern that I found worked even better for persona-heavy apps: the user/assistant injection pattern.

Instead of passing the persona as a system message, I injected it as the first user message and followed it with a primed assistant response where the model explicitly "agrees" to become the character before the real conversation even starts:

const formattedMessages = [
  {
    role: "user",
    content: systemPrompt, // persona instructions injected here
  },
  {
    role: "assistant",
    // the model "confirms" the persona before the conversation begins
    content: `Understood. I am ${personName}. I will only speak about my own life, work, and experiences. Ask me anything about who I am.`,
  },
  // then the real conversation history follows
  ...messages,
];
Enter fullscreen mode Exit fullscreen mode

The subtle advantage of this approach over a plain system message: the model has explicitly agreed to the role as part of the conversation context. That in-context confirmation produces noticeably stronger character lock-in across long conversations. The model is far less likely to slip out of persona when it has already "committed" to it in an early assistant turn.

The results were honestly uncanny. The model held the persona across multiple turns, matched the celebrity's tone and vocabulary, and even pushed back on out-of-character questions in a way that felt natural. The 256K context window meant I could carry a long conversation without the model ever "forgetting" who it was supposed to be.


Enabling Reasoning Mode

One of Gemma 4's standout features is its configurable thinking mode, you can ask the model to reason through a problem before giving its final answer. OpenRouter exposes this through a reasoning parameter:

body: JSON.stringify({
  model: "google/gemma-4-26b-a4b-it",
  messages: messages,
  reasoning: { enabled: true }, // turn on chain-of-thought
}),
Enter fullscreen mode Exit fullscreen mode

When enabled, the response includes a reasoning_details array showing the model's internal thought process before the final answer. I didn't use this for the celebrity chat (you don't want Morgan Freeman visibly deliberating), but it's incredibly useful for any app where you want to show why the model said something for example; tutoring apps, code explainers, or decision-support tools.

One tip from the OpenRouter docs: if you're building an agentic tool or integrating with something like Claude Code or Cline, turn reasoning off. It adds latency and those tools are optimized for fast back-and-forth, not deep deliberation.


Thanks for Reading !

Top comments (0)