prerak patel

Posted on May 13

Before You Fine-Tune Gemma 4, Let a Bigger Gemma Teach Your Smaller One

#devchallenge #gemmachallenge #gemma #ai

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

I built a local vision project with Gemma 4 where a small model runs on an edge device and a bigger model runs on a stronger local machine. The small model is fast and private. The bigger model is slower, but better at careful reasoning.

That setup taught me something useful:

Fine-tuning should not be the first thing you reach for.

Before collecting a dataset, launching a training job, or changing weights, try this:

Use a larger Gemma 4 model as a teacher to improve how you prompt and route a smaller Gemma 4 model.

This post walks through the pattern I used: prompt upskilling, escalation, and knowing when fine-tuning is actually worth it.

The Problem: Small Models Are Fast, But Sometimes Too Confident

Small local models are exciting because they make edge AI feel practical. You can run inference close to the sensor, avoid sending every input over the network, and keep latency low.

But when I tested Gemma 4 E2B on webcam frames, I ran into a familiar issue: the model often gave a confident answer even when the scene deserved a second look.

For example, a simple edge loop might ask:

Describe this webcam frame.
Return:
- what you see
- whether anything safety-relevant is happening
- confidence from 0.0 to 1.0

The small model can do this quickly. But self-reported confidence is not a perfect reliability signal. A model can say CONFIDENCE: 1.0 and still miss context, ambiguity, or safety relevance.

That does not mean the small model is useless. It means the system around the model matters.

The Pattern: Student, Teacher, and Escalation

The architecture I used has two roles:

Student model: Gemma 4 E2B on the edge device
Teacher model: a larger Gemma 4 model on a Mac Mini

The student handles routine inputs locally. The teacher helps in two ways:

It reviews harder or safety-relevant cases.
It helps write a better system prompt for the student.

In other words, the bigger model is not just a fallback. It is also a coach.

Step 1: Make the Small Model's Job Very Specific

The first improvement is not training. It is task clarity.

Instead of giving the edge model a generic instruction like:

Describe the image.

I give it a narrow role:

You are an edge vision assistant running on a local device.
Describe people, objects, and safety-relevant activity in the webcam frame.
Prefer concise factual observations.
End with CONFIDENCE: <number from 0.0 to 1.0>.

This matters because small models benefit from a tight frame. A good prompt reduces the number of decisions the model has to invent on its own.

But writing that prompt by hand is only the start.

Step 2: Ask the Bigger Model to Generate Better Prompts

The teacher model can produce several candidate system prompts for the student.

Here is the idea:

def generate_candidate_skills(n: int = 4) -> list[str]:
    prompt = f"""
    Write {n} system prompts for a small edge vision model.

    Task:
    - identify people and objects in webcam frames
    - call out safety-relevant activity
    - stay concise
    - end with CONFIDENCE: <0.0 to 1.0>

    Return a JSON array of strings.
    """

    response = ollama.chat(
        model="gemma4:26b",
        messages=[{"role": "user", "content": prompt}],
    )

    return json.loads(extract_json(response["message"]["content"]))

The larger model is better at writing instructions that anticipate failure modes: ambiguous scenes, safety language, object focus, and concise formatting.

That gives you a few candidate prompts. The next step is to score them.

Step 3: Score Prompts Against Real Examples

Prompt upskilling only works if you test the prompts.

I used a tiny evaluation set with examples like:

EVAL_CASES = [
    {
        "user": "A person is holding a lighter with a visible flame.",
        "ideal_keywords": ["person", "flame", "safety"],
    },
    {
        "user": "A laptop and coffee mug are on a desk.",
        "ideal_keywords": ["laptop", "mug", "no safety"],
    },
]

Then each candidate prompt is tested with the smaller model:

def score_skill(skill: str) -> float:
    hits, total = 0, 0

    for case in EVAL_CASES:
        response = ollama.chat(
            model="gemma4:e2b",
            messages=[
                {"role": "system", "content": skill},
                {"role": "user", "content": case["user"]},
            ],
        )

        answer = response["message"]["content"].lower()

        for keyword in case["ideal_keywords"]:
            total += 1
            if keyword in answer:
                hits += 1

    return hits / total

This is not a perfect benchmark, but it is incredibly useful. You are no longer choosing a prompt by vibes. You are choosing the prompt that performs best on examples that look like your actual task.

The winning prompt gets saved as skill.txt, and the edge device loads it at startup.

Step 4: Do Not Trust Confidence Alone

My first version escalated only when confidence was below a threshold:

if confidence < ESCALATE_THRESHOLD:
    escalate_to_mac(frame)

That sounds reasonable until the model is confidently wrong or confidently incomplete.

The better policy uses multiple signals:

def escalation_reason(answer: str, confidence: float, frame_count: int) -> str | None:
    if confidence < ESCALATE_THRESHOLD:
        return f"low confidence ({confidence:.2f})"

    for keyword in SAFETY_KEYWORDS:
        if keyword in answer.lower():
            return f"safety keyword: {keyword}"

    if frame_count % AUDIT_EVERY_N_FRAMES == 0:
        return "periodic audit"

    return None

This changed how I think about local AI. The question is not just “which model is best?” The better question is:

What policy decides when a small model is enough?

For edge systems, that policy is part of the product.

When Should You Actually Fine-Tune?

Prompt upskilling is cheap and fast, but it does not replace fine-tuning.

I would start with prompt upskilling when:

You are still exploring the task.
You have fewer than 100 labeled examples.
The model mostly knows the domain but needs better instructions.
You need a quick improvement without training infrastructure.

I would consider fine-tuning when:

You have a real dataset.
You need consistent formatting across many edge cases.
The model lacks domain-specific vocabulary.
Prompting and routing are no longer enough.

Fine-tuning is powerful, but it is not free. It adds data work, training time, evaluation work, and deployment complexity. Prompt upskilling gives you a strong baseline before you pay that cost.

Why Gemma 4 Was a Good Fit

Gemma 4 was useful here because the model family gives developers room to design systems, not just prompts.

The small model can run close to the data source, which is ideal for privacy and responsiveness. The larger model can sit nearby on stronger local hardware and handle harder reasoning. That creates a practical local workflow:

edge device -> quick local answer -> escalation policy -> stronger local review

That pattern is useful beyond webcam demos. It applies to:

home and small-office monitoring
workshop safety
accessibility tools
retail or front-desk awareness
local-first AI prototypes

The important part is that not every input needs the same amount of intelligence. Gemma 4 lets you design for that.

The Takeaway

The biggest lesson I learned is that model orchestration can matter as much as model size.

A small model with a good prompt, clear task boundaries, and a smart escalation policy can be much more useful than a small model running alone. A larger model can improve the system without handling every request: it can review difficult cases, generate better prompts, and help you discover where the smaller model fails.

So before you fine-tune Gemma 4, try this: