Kaushikcoderpy

Posted on May 20

GemmaForge: I Built a 7-Pipeline AI Content Engine Using Every Gemma 4 Model — Here's How I Solved the Echo Problem

#devchallenge #gemmachallenge #gemma #contentwriting

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

GemmaForge is a full-stack AI content engine that automates the entire technical content lifecycle — from competitive intelligence to multi-platform distribution — using purpose-routed Gemma 4 models.

The core idea: different tasks need different-sized brains. Instead of throwing one model at everything, GemmaForge routes each stage of the pipeline to the optimal Gemma model:

Gemma E2B strips marketing noise (fast, cheap, aggressive)
Gemma 26B handles complex reasoning — SEO gap analysis, content planning, and multimodal vision
Gemma 31B Dense generates production-ready long-form articles

The result is a system with 7 atomic AI tools, a full distribution engine (11 platforms, 5 quality gates, 3 search engines), and multimodal vision capabilities — all orchestrated through a single FastAPI server with a glassmorphic web UI.

The Problem I Solved

Engineers who ship great code shouldn't spend 3 hours writing and distributing blog posts. GemmaForge takes a raw topic and autonomously:

Analyzes competitor content and identifies semantic gaps
Detects rising engineering trends from structured data
Generates a content blueprint that fills those gaps
Writes a production-ready article (HTML or Markdown)
Optionally distributes to 11 platforms concurrently

And with the multimodal tools, you can upload a whiteboard sketch, architecture diagram, or infographic — and GemmaForge will extract the content, plan the narrative, and write the full article end-to-end.

Demo

Live Repository: https://github.com/Kaushikcoderpy/GemmaForge

Code

Kaushikcoderpy / GemmaForge

Fully asynchronous pipeline using Gemma 4 26B MoE & 2B to transform technical seeds into self-refined articles. Features SERP gap analysis, human-style injection, and a concurrent fan-out to 10+ platforms with instant Google/Bing indexing. Built for high-throughput, local-first dev publishing.

How I Used Gemma 4

This is the section I'm most excited about, because GemmaForge doesn't just "use" Gemma — it was built around understanding how each model size thinks differently. Let me walk through the entire architecture.

The Model Routing Philosophy

Most AI projects use a single model for everything. That's like using a sledgehammer to hang a picture frame. Each Gemma 4 variant has a distinct personality:

Model	Params	What It's Good At	What It's Bad At
Gemma E2B	2B	Fast filtering, signal extraction, noise removal	Complex reasoning, nuanced analysis
Gemma 26B	26B	Comparative analysis, structural planning, vision	Long-form coherent prose generation
Gemma 31B Dense	31B	Long-form article writing, creative narrative flow	Overkill for simple extraction tasks

GemmaForge exploits these differences through dynamic model discovery. At runtime, it queries the Google API to find which models are available and falls back gracefully:

async def _get_target_model(session, requested_name, fallback_keywords):
    """Dynamically checks Google API for available models 
    and returns the best match."""
    api_key = _get_api_key()
    models_url = f"https://generativelanguage.googleapis.com/v1beta/models?key={api_key}"

    async with session.get(models_url) as resp:
        data = await resp.json()

    available_models = [m['name'].replace('models/', '') 
                        for m in data.get('models', [])]

    # Try the exact requested model first
    if requested_name in available_models:
        return requested_name

    # Fall back through keywords by priority
    for keyword in fallback_keywords:
        for model in available_models:
            if keyword in model.lower():
                return model

    raise ModelNotAvailableError(f"No suitable model found")

This means GemmaForge never hard-crashes if a specific model is unavailable — it gracefully degrades to the next best option.

Pipeline Stage 1: Compress Competitor Fluff → Gemma E2B

Why E2B: This is pure signal extraction. We're not reasoning — we're aggressively filtering marketing language and retaining only technical substance. The 2B model handles this at blazing speed without wasting compute.

@retry(wait=wait_exponential(multiplier=1, min=2, max=10), 
       stop=stop_after_attempt(5))
async def compress_competitor_fluff_2b(raw_text: str) -> str:
    """Strip marketing noise. Retain only version numbers, 
    library names, code logic, and architectural constraints."""

    prompt = f"""TASK: Extract raw technical parameters, 
    version numbers, and architecture logic from input.
    FORMAT ENFORCEMENT: Output ONLY a strict Markdown 
    bulleted list (-).
    ZERO introductory text. ZERO conversational filler.
    INPUT:
    {raw_text}
    OUTPUT: A high-density technical summary."""

    # Routes to gemma-4-e2b-it
    target_model = await _get_target_model(
        session, "gemma-4-26b-a4b-it", ["2b", "27b"]
    )

Real example: Input is a 4,000-word competitor blog post filled with "FastAPI has quickly become a popular choice..." — output is a tight 200-word bulleted list of version numbers, port bindings, and architectural decisions.

Pipeline Stage 2: Analyse Trends → Gemma 26B (as 4B replacement)

Why this model: This task requires pattern recognition in structured JSON data (Google Trends output). We need the model to identify engineering-relevant signals while discarding consumer trends like "AI Image Generator." The 26B model provides the analytical depth needed to make those judgment calls.

async def analyse_trends_4b(raw_trends_json: str) -> str:
    """Identify 5 emerging technical signals from trends data.
    Discard general consumer topics."""

    prompt = f"""TASK: Extract 5 technical growth signals 
    from the JSON data.
    FORMAT: Return exactly 5 Markdown headers (##). 
    Under each, provide a bulleted list of extracted 
    entities and growth metrics.
    ZERO conversational text.
    INPUT:
    {raw_trends_json}"""

What it produces:

## Inference-Time Compute Scaling
- o1-preview model: Breakout growth
- Shift to System 2 thinking via inference scaling laws

## Diffusion Transformers (DiT)
- Sora video generator: +450%
- Convergence of ViT and Diffusion for spatiotemporal data

Pipeline Stage 3: SEO Gap Report → Gemma 26B

Why 26B: This is the most intellectually demanding text task in the pipeline. The model must perform differential analysis — reading your draft, reading the top SERP competitors, and precisely identifying what technical concepts you're missing. This requires genuine comparative reasoning that smaller models simply cannot do.

But before Gemma even touches this, we run a local ONNX-based semantic analysis using Nomic Embed v1.5 to prioritize which competitors to analyze:

class NomicEmbedder:
    """Local ONNX embedding model — zero external API dependency."""
    _instance = None

    @classmethod
    def get_instance(cls):
        if cls._instance is None:
            cls._instance = cls()
        return cls._instance

    def __init__(self):
        import onnxruntime as ort
        from transformers import AutoTokenizer
        from huggingface_hub import hf_hub_download

        self.model_id = "Xenova/nomic-embed-text-v1"
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_id)

        model_path = hf_hub_download(
            repo_id=self.model_id,
            filename="onnx/model_quantized.onnx"
        )
        self.session = ort.InferenceSession(
            model_path, providers=['CPUExecutionProvider']
        )

    def embed(self, texts):
        encoded = self.tokenizer(
            texts, padding=True, truncation=True, 
            max_length=512, return_tensors="np"
        )
        outputs = self.session.run(None, {
            "input_ids": encoded["input_ids"].astype(np.int64),
            "attention_mask": encoded["attention_mask"].astype(np.int64),
            "token_type_ids": encoded["token_type_ids"].astype(np.int64)
        })
        # Mean Pooling + L2 Normalization
        token_embeddings = outputs[0]
        mask = np.expand_dims(encoded["attention_mask"], -1).astype(float)
        pooled = np.sum(token_embeddings * mask, axis=1) / \
                 np.clip(mask.sum(axis=1), 1e-9, None)
        norms = np.linalg.norm(pooled, axis=1, keepdims=True)
        return (pooled / norms).tolist()

We compute cosine similarity between your content and the top 5 SERP results, then feed only the highest-gap competitors to Gemma 26B for analysis. This keeps the context window focused and the output actionable.

Pipeline Stage 4: Plan Content → Gemma 26B

Why 26B: Blueprint generation is a synthesis task — merging gap data, trend signals, and style constraints into a structured meta-prompt. The 26B model excels at this because it can hold all three inputs in context simultaneously and produce a coherent architectural plan.

async def plan_the_content_26b(gap_report, trends_summary, human_style):
    """Synthesize gap data + trends into a structured 
    content blueprint."""

    prompt = f"""TASK: Generate a structural outline 
    synthesizing gap data and trends.
    FORMAT: Output ONLY valid Markdown headers (##) 
    and bullet points (-).
    GAP REPORT: {gap_report}
    TRENDS: {trends_summary}
    STYLE: {human_style}"""

Pipeline Stage 5: Write Content → Gemma 31B Dense

Why 31B Dense: This is where the magic (and the pain) happens. Long-form article generation requires the model to maintain coherent narrative flow across thousands of tokens, weave in technical details naturally, and produce prose that doesn't read like a robot wrote it.

But 31B Dense also gave me the hardest problem I've ever faced with an LLM. More on that below.

async def write_the_content_31b(content_plan, output_format="html"):
    """Generate production-ready content from a blueprint.
    Uses Output Anchoring and response prefilling 
    to prevent the Echo problem."""

    target_model = await _get_target_model(
        session, "gemma-4-31b-it", ["31b", "27b", "26b"]
    )

    # RESPONSE PREFILLING: The key anti-echo technique
    anchor = "#" if is_md else "<h1>"
    payload = {
        "contents": [
            {"role": "user", "parts": [{"text": final_prompt}]},
            {"role": "model", "parts": [{"text": anchor}]}  # ← THIS
        ],
        "generationConfig": {
            "temperature": 0.7,
            "maxOutputTokens": 8192,
            "topP": 0.95
        },
        "systemInstruction": {
            "parts": [{"text": system_instruction}]
        }
    }

Notice that second content entry — {"role": "model", "parts": [{"text": anchor}]}. That's response prefilling, and it was the breakthrough that made the entire project work. I'll explain why in the "What I Learned" section.

Pipeline Stage 6 & 7: Multimodal Vision → Gemma 26B

Why 26B for vision: Gemma 4's 26B model has native multimodal vision support baked into the same model that handles text. There's no separate "vision" model — you just pass inlineData alongside your text prompt. This is incredibly elegant.

Tool 6: Generate Alt Text

async def generate_alt_text_26b(image_base64, mime_type="image/jpeg"):
    """Analyze an image and generate SEO-optimized alt text."""

    prompt = "TASK: Generate concise, descriptive alt text. Output ONLY the alt text."

    payload = {
        "contents": [{
            "parts": [
                {"text": prompt},
                {"inlineData": {
                    "mimeType": mime_type, 
                    "data": image_base64
                }}
            ]
        }],
        "systemInstruction": {
            "parts": [{
                "text": "You are an automated SEO utility. "
                        "Output ONLY the final alt text string. "
                        "No drafts, no steps, no explanations."
            }]
        }
    }

Tool 7: Image → Article Pipeline

This is the most ambitious tool. It chains three Gemma calls:

Gemma 26B Vision extracts structural content from an uploaded image (sketch, diagram, plan)
Gemma 26B plans the narrative structure from the extracted data
Gemma 31B writes the full article

# In server.py — the chained pipeline
@app.post("/tools/image-to-article")
async def api_image_to_article(image: UploadFile = File(...), 
                                prompt: str = Form(""),
                                output_format: str = Form("html")):
    contents = await image.read()
    image_base64 = base64.b64encode(contents).decode("utf-8")
    mime_type = image.content_type or "image/jpeg"

    # Step 1: Extract (26B Vision)
    extracted = await extract_image_content_26b(
        image_base64, mime_type, prompt
    )

    # Step 2: Plan (26B Text)
    plan = await plan_the_content_26b(
        "N/A - Image based generation", 
        extracted, 
        "technical, code-heavy"
    )

    # Step 3: Write (31B Dense)
    article = await write_the_content_31b(plan, output_format)

    return {
        "extracted_content": extracted,
        "plan": plan,
        "result": article,
        "format": output_format
    }

Upload a whiteboard photo → get a complete technical article. That's the power of chaining Gemma models.

The BYOK Architecture: Running Gemma in the Browser

One design decision that sets GemmaForge apart: every atomic tool can run entirely in the browser without touching the backend. We call it BYOK — Bring Your Own Key.

Users paste their Gemini API key into a settings panel, and the frontend JavaScript calls the Gemma models directly:

// ai_engine.js — Browser-side model routing
async function getTargetModel(sizeKeyword) {
    if (sizeKeyword === '2b') return "gemma-4-e2b-it";
    if (sizeKeyword === '26b') return "gemma-4-26b-a4b-it";
    if (sizeKeyword === '31b') return "gemma-4-31b-it";
    return "gemma-4-e2b-it"; 
}

// Vision calls use inlineData directly from the browser
async function callGeminiVision(modelName, prompt, base64Data, 
                                 mimeType, generationConfig, 
                                 systemInstruction) {
    const payload = {
        contents: [{
            role: "user",
            parts: [
                { text: prompt },
                { inlineData: { mimeType, data: base64Data } }
            ]
        }],
        generationConfig
    };

    if (systemInstruction) {
        payload.systemInstruction = { 
            parts: [{ text: systemInstruction }] 
        };
    }

    const response = await fetch(
        `https://generativelanguage.googleapis.com/v1beta/` +
        `models/${modelName}:generateContent?key=${apiKey}`,
        { method: "POST", body: JSON.stringify(payload) }
    );

    const data = await response.json();
    return data.candidates[0].content.parts[0].text;
}

This means hackathon judges can test every tool immediately without setting up a Python environment.

What I Learned: The "Instruction vs. Content" Paradox

This is the most important section of this post, because it documents a fundamental challenge that anyone building complex LLM pipelines will face.

The Problem: AI Echo

When I first connected Gemma 31B to the content pipeline, something bizarre happened. Instead of writing an article, the model would echo its own instructions back at me.

I'd send a detailed content blueprint with sections like:

ROLE: Senior Electrical Engineer
OBJECTIVE: Write a deep-dive on EVSE deployment
1. HARDWARE: Contrast L2 vs DCFC architectures
2. GRID: Detail DLM and Peak Shaving

And the model would output:

# OBJECTIVE: Write a Deep-Dive on EVSE Deployment

## Part 1: Hardware
The role of a Senior Electrical Engineer is to...
[proceeds to regurgitate the blueprint verbatim]

The model couldn't distinguish between "instructions about content" and "content itself."

What I Tried (And Why It Failed)

1. Lowering Temperature → Made the echo more deterministic. The model would consistently echo the exact same instructions, every time.

2. Raising Temperature → Broke the echo loop, but the model started hallucinating technical facts and deviating wildly from the blueprint.

3. Stronger System Prompts → Yelling "DO NOT WRITE OUTLINES" inside the system prompt failed completely. Negative constraints are notoriously weak when the context window is flooded with structural examples that look like outlines.

4. Regex Cleaning → Brittle and unscalable. Headers like [PART 1] mutate unpredictably across generations, making regex matching a game of whack-a-mole.

5. Chain-of-Thought Suppression → Disabling reasoning made the model write faster, but it lost the ability to synthesize the SEO gap data intelligently.

What Actually Worked: Architectural Isolation

I abandoned "prompt tweaking" entirely and moved to structural solutions:

Fix 1: Source Data Reframing

Instead of labeling the blueprint as instructions, I wrapped it as inert data:

[SOURCE DATA — DO NOT REPRODUCE]
{content_plan}
[END SOURCE DATA]

Using only the facts above, write a complete article.

This mentally decouples "what to do" from "what to know" for the model.

Fix 2: Response Prefilling (The Breakthrough)

This was the single most impactful technique. By injecting the first token of the response into the model's turn, we physically force the model into a "continuation" state:

payload = {
    "contents": [
        {"role": "user", "parts": [{"text": prompt}]},
        {"role": "model", "parts": [{"text": "<h1>"}]}  # PREFILL
    ]
}

The model sees that it has "already started writing" with <h1> and continues from there. It never enters the instruction-echo phase because it's already past the point where echoing would occur.

Fix 3: Echo Detection + Rescue Generation

Even with prefilling, edge cases exist. So I built a mathematical echo detector:

def _looks_like_echo(result, source):
    """Detect when the model returns the blueprint itself."""
    result_norm = _normalize_for_echo_check(result)
    source_norm = _normalize_for_echo_check(source)

    # Check if result starts with the first line of the source
    first_source_line = next(
        (l.strip() for l in source.splitlines() if l.strip()), ""
    )
    if first_source_line and result.strip().startswith(first_source_line):
        return True

    # Check for blueprint marker contamination
    markers = ["complete, full-length technical article", 
               "minimum 1000 words", "no meta-commentary"]
    marker_hits = sum(1 for m in markers if m in result_norm)
    if marker_hits >= 3:
        return True

    # Check for line-level echoing (>35% of source lines appear in output)
    source_lines = [l.strip() for l in source.splitlines() if len(l.strip()) > 18]
    if source_lines:
        echoed = sum(1 for l in source_lines 
                     if _normalize_for_echo_check(l) in result_norm)
        if echoed >= max(4, int(len(source_lines) * 0.35)):
            return True

    return False

If an echo is detected, the system intercepts it and triggers a Rescue Prompt — a high-temperature, heavily compressed re-generation that bypasses the contaminated context.

The Distribution Engine (Bonus Architecture)

While not directly Gemma-related, the distribution engine showcases how GemmaForge uses AI throughout the entire lifecycle:

Content Ingestion → 5 Quality Gates (parallel) → AI Asset Generation → 11-Platform Broadcast → Search Engine Indexing

Quality Gates run in parallel via asyncio.gather():

PageSpeed Insights (Performance, Accessibility, SEO, Best Practices)
Axe-Core (WCAG accessibility violations)
W3C HTML Validator
Liveness Check (HTTP 200 verification)
Broken Image Scanner

AI Asset Generation uses Gemma to generate:

Dynamic engagement hooks (not generic — crafted from article content)
Platform-optimized tag sets
SERP-analyzed search queries

Broadcast fires concurrently to: Dev.to, Hashnode, LinkedIn, Discord, Bluesky, Mastodon, Telegram, Nostr, Paragraph, Tumblr — plus Google/Bing/Yandex indexing.

All with idempotent state management — if a platform fails, the retry logic resumes from the exact point of failure without re-running expensive AI calls.

Tech Stack

Layer	Technology
Backend	FastAPI, Uvicorn, aiohttp, asyncio
AI Models	Gemma 4 (E2B, 26B, 31B Dense) via Google GenAI API
Vision	Gemma 26B native multimodal (inlineData)
Embeddings	ONNX Runtime + Nomic Embed v1.5 (local, INT8 quantized)
Frontend	Vanilla HTML/CSS/JS, Inter + JetBrains Mono
Streaming	Server-Sent Events (SSE)
Retry Logic	Tenacity with exponential backoff
HTML Parsing	Selectolax (Lexbor) + Trafilatura
Quality	Playwright + Axe-Core, PSI API
State	Atomic JSON persistence with asyncio.Lock

Final Thoughts

Building GemmaForge taught me that the real challenge with LLMs isn't getting them to generate text — it's getting them to generate the right text consistently at scale.

The Echo problem alone took me weeks to solve, and the solution wasn't better prompts — it was better architecture. Response prefilling, source data reframing, and mathematical echo detection are techniques that translate directly to any production LLM system.

Gemma 4's model family made this possible in a way that no single model could. The ability to route tasks by cognitive complexity — E2B for speed, 26B for reasoning and vision, 31B for long-form generation — is a paradigm I'll carry into every future AI project.

If you're building with Gemma, don't treat all tasks equally. Match the model to the task.

Built solo by Kaushik · Source Code · Powered by Gemma 4

DEV Community