Ramya Sri M

Posted on May 25

From Understanding Gemma 4 🧠 to Building SpeakUp 🎙️ — An AI English Coach 🤖

#devchallenge #gemmachallenge #gemma #ai

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Why I Started This

Millions of people want to learn English but cannot afford a private tutor. For many families, this is too expensive. So I decided to build one using AI. Not just a simple chatbot but a real English coach for everyone, from a 7-year-old child 👧 to a 60-year-old adult👴. I called it SpeakUp and it is powered by Gemma 4. I am still actively building it. But before I show you SpeakUp, I want to explain Gemma 4 in a simple way. By the end of this article, it will be clear what Gemma 4 is, how it works, and how I am using it to build SpeakUp.

What is Gemma 4?🤖

Gemma 4 is an open-source AI model built by Google DeepMind, released under the Apache 2.0 license — free to use, modify, and build with.

There are two ways to use it:

💻Run locally via Ollama
=> No internet or API key needed
☁️ Access via Google AI Studio API
=> Internet and API key needed

Running it locally means Gemma 4 runs entirely on your own device — no data leaves your machine, no subscription needed. Accessing it via Google AI Studio API means Gemma 4 runs on Google's servers and your app communicates with it over the internet.

Gemma 4 is well-suited for:

📝Text generation
💻Coding assistance
🧠Reasoning
🖼️Image understanding
🔧Function calling and building agentic applications

The Four Gemma 4 Models📦

Gemma 4 has four models — E2B, E4B, 26B A4B, and 31B.

E2B — 📱Effective 2 Billion

The "E" stands for Effective. E2B is the smallest model in the Gemma 4 family. It has 5.1B total parameters but runs with the speed and memory of a much smaller 2.3B model. This is possible because of a technique called Per-Layer Embeddings (PLE) — instead of loading everything at once, each layer of the model gets only the small piece of information it needs. This makes it light, fast, and perfect for phones and low-memory devices.

E4B — 💻Effective 4 Billion

E4B works the same way as E2B using PLE. It has 8B total parameters but runs with the speed and memory of a 4.5B model. It is smarter than E2B and better suited for complex reasoning, while still running on everyday laptops.

26B A4B — Mixture of Experts

The "A" stands for Active. This model has 26B total parameters, but only activates around 4B of them per request. This makes it faster, cheaper, and more efficient while still having access to the full 26B knowledge base.

31B Dense

The 31B Dense model uses all 30.7 billion parameters for every single request — the most thorough and accurate reasoning in the family.

Multimodal Capabilities

All four models handle text and image inputs. E2B and E4B also support audio up to 30 seconds🎵. All generate text output only. And maintains multilingual support in over 140 languages🌍.

Understanding the Building Blocks🔬

Parameters — 🧠Intelligence Capacity

Parameters are the learned knowledge connections inside an AI. More parameters usually means more intelligence, better reasoning, and better understanding.

Layers — 🏢Depth of Reasoning
Layers are the stages of thinking inside an AI. Each layer processes the input and passes a richer understanding to the next layer.
More layers = deeper thinking = better understanding of complex meaning.

Embeddings — The Language Translation Layer🔤
For E2B, total parameters = 2.3B effective + embedding parameters = 5.1B total.

Reasoning parameters (2.3B) — the actual thinking 🧠
Embedding parameters (remaining) — the translation layer

The 2.3B reasoning parameters are the actual “thinking” part of the AI. These parameters help the model understand questions, reason, and generate answers.
The remaining parameters are embedding parameters. Their job is to convert words and tokens into a mathematical form that the AI can understand internally.

So even though E2B has 5.1B total parameters, only 2.3B are mainly used for reasoning, which is why it behaves like a smaller and more efficient model.

Embeddings convert words, sentences, and meaning into numbers that the AI understands internally. For example, the word "cat" becomes a series of numbers inside the model. Embeddings are the language translation layer for the AI brain.

Tokens and Vocabulary — How AI Reads Text 📖
Vocabulary means how many tokens the AI understands. But tokens are not the same as words.
Tokens can be:

Full words — "cat" = 1 token
Parts of words — "unbelievable" = 3 tokens (un / believe / able)
Symbols and punctuation — "!" = 1 token
Code snippets and special characters

Context Length — 💾Total Memory Size
Context length means how much information the AI can remember at once during a conversation. It is measured in tokens.

128K tokens = approximately 90,000 words = a full novel 📗
256K tokens = approximately 180,000 words = two full novels 📗📗

Sliding Window — Local Attention Memory 🪟
Instead of looking at all tokens at once (which would be very slow), the AI focuses on nearby chunks or windows at a time. This is called sliding window attention.
Gemma 4 uses a hybrid approach — local sliding window attention for speed⚡, and global attention every few layers to capture the big picture. The final layer is always global — so Gemma 4 always ends with a complete view before generating its reply.

What is SpeakUp?🎙️
SpeakUp is an AI-powered English learning web app. It runs in Chrome, built with Python (FastAPI) on the backend and plain HTML, CSS and JavaScript on the frontend. Gemma 4 powers every feature.

Two modes:

👤Adult mode — precise grammar rules, professional tone, detailed explanations
👶Kids mode — one toggle transforms the entire app — larger text, simpler words, encouraging tone, slower voice

Features:

📚 Grammar Lessons A to Z
✅ Grammar Correction
🎧 Listening Practice
📖 Reading Practice
✍️ Writing Practice
🤝 Speaking Partner
🔁 Shadowing
🗣️ Pronunciation Check
🎯 Quiz with Celebrations
🧩 Word Scramble

Challenge — The RAM Problem⚠️

I downloaded the E4B model successfully. Everything looked fine. But the moment I tried to run it:

model requires more system memory (9.8 GiB) than is available (4.9 GiB)

My laptop has 8GB total RAM. After Windows took its share, only 4.9 GB was free. E4B needs 9.8 GB. It would not load.
The solution was Google AI Studio. I switched to the 26B A4B model running on Google's servers via the free API.
The offline option is still there. Any user with 10+ GB free RAM can switch to local Ollama by changing one line in main.py.

SpeakUp is still in progress, but it already shows how AI can make English learning more accessible and practical for everyone🌍.

DEV Community

From Understanding Gemma 4 🧠 to Building SpeakUp 🎙️ — An AI English Coach 🤖

Top comments (0)