<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: prerak patel</title>
    <description>The latest articles on DEV Community by prerak patel (@prerak_patel_).</description>
    <link>https://dev.to/prerak_patel_</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3918703%2F03babfbb-b708-497a-a5e3-d25a1bef3c3c.jpg</url>
      <title>DEV Community: prerak patel</title>
      <link>https://dev.to/prerak_patel_</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/prerak_patel_"/>
    <language>en</language>
    <item>
      <title>Before You Fine-Tune Gemma 4, Let a Bigger Gemma Teach Your Smaller One</title>
      <dc:creator>prerak patel</dc:creator>
      <pubDate>Wed, 13 May 2026 16:13:01 +0000</pubDate>
      <link>https://dev.to/prerak_patel_/before-you-fine-tune-gemma-4-let-a-bigger-gemma-teach-your-smaller-one-5a0d</link>
      <guid>https://dev.to/prerak_patel_/before-you-fine-tune-gemma-4-let-a-bigger-gemma-teach-your-smaller-one-5a0d</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Write About Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I built a local vision project with Gemma 4 where a small model runs on an edge device and a bigger model runs on a stronger local machine. The small model is fast and private. The bigger model is slower, but better at careful reasoning.&lt;/p&gt;

&lt;p&gt;That setup taught me something useful:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fine-tuning should not be the first thing you reach for.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before collecting a dataset, launching a training job, or changing weights, try this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Use a larger Gemma 4 model as a teacher to improve how you prompt and route a smaller Gemma 4 model.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This post walks through the pattern I used: prompt upskilling, escalation, and knowing when fine-tuning is actually worth it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: Small Models Are Fast, But Sometimes Too Confident
&lt;/h2&gt;

&lt;p&gt;Small local models are exciting because they make edge AI feel practical. You can run inference close to the sensor, avoid sending every input over the network, and keep latency low.&lt;/p&gt;

&lt;p&gt;But when I tested Gemma 4 E2B on webcam frames, I ran into a familiar issue: the model often gave a confident answer even when the scene deserved a second look.&lt;/p&gt;

&lt;p&gt;For example, a simple edge loop might ask:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Describe this webcam frame.
Return:
- what you see
- whether anything safety-relevant is happening
- confidence from 0.0 to 1.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The small model can do this quickly. But self-reported confidence is not a perfect reliability signal. A model can say &lt;code&gt;CONFIDENCE: 1.0&lt;/code&gt; and still miss context, ambiguity, or safety relevance.&lt;/p&gt;

&lt;p&gt;That does not mean the small model is useless. It means the system around the model matters.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pattern: Student, Teacher, and Escalation
&lt;/h2&gt;

&lt;p&gt;The architecture I used has two roles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Student model&lt;/strong&gt;: Gemma 4 E2B on the edge device&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Teacher model&lt;/strong&gt;: a larger Gemma 4 model on a Mac Mini&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The student handles routine inputs locally. The teacher helps in two ways:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It reviews harder or safety-relevant cases.&lt;/li&gt;
&lt;li&gt;It helps write a better system prompt for the student.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In other words, the bigger model is not just a fallback. It is also a coach.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Make the Small Model's Job Very Specific
&lt;/h2&gt;

&lt;p&gt;The first improvement is not training. It is task clarity.&lt;/p&gt;

&lt;p&gt;Instead of giving the edge model a generic instruction like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Describe the image.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I give it a narrow role:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are an edge vision assistant running on a local device.
Describe people, objects, and safety-relevant activity in the webcam frame.
Prefer concise factual observations.
End with CONFIDENCE: &amp;lt;number from 0.0 to 1.0&amp;gt;.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This matters because small models benefit from a tight frame. A good prompt reduces the number of decisions the model has to invent on its own.&lt;/p&gt;

&lt;p&gt;But writing that prompt by hand is only the start.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Ask the Bigger Model to Generate Better Prompts
&lt;/h2&gt;

&lt;p&gt;The teacher model can produce several candidate system prompts for the student.&lt;/p&gt;

&lt;p&gt;Here is the idea:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_candidate_skills&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Write &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; system prompts for a small edge vision model.

    Task:
    - identify people and objects in webcam frames
    - call out safety-relevant activity
    - stay concise
    - end with CONFIDENCE: &amp;lt;0.0 to 1.0&amp;gt;

    Return a JSON array of strings.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4:26b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;extract_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The larger model is better at writing instructions that anticipate failure modes: ambiguous scenes, safety language, object focus, and concise formatting.&lt;/p&gt;

&lt;p&gt;That gives you a few candidate prompts. The next step is to score them.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Score Prompts Against Real Examples
&lt;/h2&gt;

&lt;p&gt;Prompt upskilling only works if you test the prompts.&lt;/p&gt;

&lt;p&gt;I used a tiny evaluation set with examples like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;EVAL_CASES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A person is holding a lighter with a visible flame.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ideal_keywords&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;person&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;flame&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;safety&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A laptop and coffee mug are on a desk.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ideal_keywords&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;laptop&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mug&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no safety&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then each candidate prompt is tested with the smaller model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;score_skill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;skill&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;hits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;EVAL_CASES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4:e2b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;skill&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;
            &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;keyword&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ideal_keywords&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;keyword&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not a perfect benchmark, but it is incredibly useful. You are no longer choosing a prompt by vibes. You are choosing the prompt that performs best on examples that look like your actual task.&lt;/p&gt;

&lt;p&gt;The winning prompt gets saved as &lt;code&gt;skill.txt&lt;/code&gt;, and the edge device loads it at startup.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Do Not Trust Confidence Alone
&lt;/h2&gt;

&lt;p&gt;My first version escalated only when confidence was below a threshold:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;ESCALATE_THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;escalate_to_mac&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That sounds reasonable until the model is confidently wrong or confidently incomplete.&lt;/p&gt;

&lt;p&gt;The better policy uses multiple signals:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;escalation_reason&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;frame_count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;ESCALATE_THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;low confidence (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;keyword&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;SAFETY_KEYWORDS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;keyword&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;safety keyword: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;keyword&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;frame_count&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;AUDIT_EVERY_N_FRAMES&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;periodic audit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This changed how I think about local AI. The question is not just “which model is best?” The better question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What policy decides when a small model is enough?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For edge systems, that policy is part of the product.&lt;/p&gt;




&lt;h2&gt;
  
  
  When Should You Actually Fine-Tune?
&lt;/h2&gt;

&lt;p&gt;Prompt upskilling is cheap and fast, but it does not replace fine-tuning.&lt;/p&gt;

&lt;p&gt;I would start with prompt upskilling when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You are still exploring the task.&lt;/li&gt;
&lt;li&gt;You have fewer than 100 labeled examples.&lt;/li&gt;
&lt;li&gt;The model mostly knows the domain but needs better instructions.&lt;/li&gt;
&lt;li&gt;You need a quick improvement without training infrastructure.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I would consider fine-tuning when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have a real dataset.&lt;/li&gt;
&lt;li&gt;You need consistent formatting across many edge cases.&lt;/li&gt;
&lt;li&gt;The model lacks domain-specific vocabulary.&lt;/li&gt;
&lt;li&gt;Prompting and routing are no longer enough.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fine-tuning is powerful, but it is not free. It adds data work, training time, evaluation work, and deployment complexity. Prompt upskilling gives you a strong baseline before you pay that cost.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Gemma 4 Was a Good Fit
&lt;/h2&gt;

&lt;p&gt;Gemma 4 was useful here because the model family gives developers room to design systems, not just prompts.&lt;/p&gt;

&lt;p&gt;The small model can run close to the data source, which is ideal for privacy and responsiveness. The larger model can sit nearby on stronger local hardware and handle harder reasoning. That creates a practical local workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;edge device -&amp;gt; quick local answer -&amp;gt; escalation policy -&amp;gt; stronger local review
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That pattern is useful beyond webcam demos. It applies to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;home and small-office monitoring&lt;/li&gt;
&lt;li&gt;workshop safety&lt;/li&gt;
&lt;li&gt;accessibility tools&lt;/li&gt;
&lt;li&gt;retail or front-desk awareness&lt;/li&gt;
&lt;li&gt;local-first AI prototypes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The important part is that not every input needs the same amount of intelligence. Gemma 4 lets you design for that.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;The biggest lesson I learned is that model orchestration can matter as much as model size.&lt;/p&gt;

&lt;p&gt;A small model with a good prompt, clear task boundaries, and a smart escalation policy can be much more useful than a small model running alone. A larger model can improve the system without handling every request: it can review difficult cases, generate better prompts, and help you discover where the smaller model fails.&lt;/p&gt;

&lt;p&gt;So before you fine-tune Gemma 4, try this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Give the small model a narrow job.&lt;/li&gt;
&lt;li&gt;Ask a larger Gemma 4 model to generate candidate prompts.&lt;/li&gt;
&lt;li&gt;Score those prompts on realistic examples.&lt;/li&gt;
&lt;li&gt;Add an escalation policy that does not rely on confidence alone.&lt;/li&gt;
&lt;li&gt;Fine-tune only after you know prompting and routing are not enough.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That is the practical path I would recommend to anyone building local AI with Gemma 4.&lt;/p&gt;

&lt;p&gt;Full project code: &lt;a href="https://github.com/Prerak1520/gemmaedge-hub" rel="noopener noreferrer"&gt;github.com/Prerak1520/gemmaedge-hub&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
      <category>ai</category>
    </item>
    <item>
      <title>I Built a Local AI Vision System That Knows When to Ask a Bigger Gemma 4 Model for Help</title>
      <dc:creator>prerak patel</dc:creator>
      <pubDate>Wed, 13 May 2026 14:40:34 +0000</pubDate>
      <link>https://dev.to/prerak_patel_/i-built-a-local-ai-vision-system-that-knows-when-to-ask-a-bigger-gemma-4-model-for-help-1lgj</link>
      <guid>https://dev.to/prerak_patel_/i-built-a-local-ai-vision-system-that-knows-when-to-ask-a-bigger-gemma-4-model-for-help-1lgj</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Build with Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;I built &lt;strong&gt;GemmaEdge Hub&lt;/strong&gt;, a two-device local AI vision system that keeps routine webcam analysis on an edge device and escalates harder cases to a stronger local machine.&lt;/p&gt;

&lt;p&gt;The edge device runs &lt;strong&gt;Gemma 4 E2B&lt;/strong&gt; locally for fast, private inference. When a frame is uncertain, safety-relevant, or due for a periodic audit, it sends that frame to a Mac Mini for deeper analysis. The Mac Mini also hosts a live dashboard showing the edge answer, escalated answer, confidence values, latency, and recent frames.&lt;/p&gt;

&lt;p&gt;The core idea is simple: use the small model for the common path, and only spend bigger-model compute when the situation deserves it.&lt;/p&gt;

&lt;p&gt;This architecture is useful for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Home or small-office monitoring, where ordinary frames stay local but possible smoke, fire, injury, or unusual activity gets reviewed.&lt;/li&gt;
&lt;li&gt;Workshop and lab safety, where an edge device can watch for risky visual cues near equipment without sending every frame across the network.&lt;/li&gt;
&lt;li&gt;Accessibility assistance, where quick local scene descriptions can be escalated when a scene is ambiguous or safety-related.&lt;/li&gt;
&lt;li&gt;Retail or front-desk awareness, where routine activity can be summarized locally and unusual situations can be logged for review.&lt;/li&gt;
&lt;li&gt;Edge AI prototyping, because the project makes it easy to experiment with model routing, escalation policies, and prompt-based upskilling.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;The live demo runs across two Macs on the same local network:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The MacBook Air captures webcam frames.&lt;/li&gt;
&lt;li&gt;Gemma 4 E2B gives a fast local answer with a confidence score.&lt;/li&gt;
&lt;li&gt;Routine frames stay on the edge device.&lt;/li&gt;
&lt;li&gt;Uncertain, safety-relevant, or audited frames are escalated.&lt;/li&gt;
&lt;li&gt;The Mac Mini analyzes the escalated frame and updates the dashboard in real time.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Dashboard during the demo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;http://localhost:8000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn9e1xj14atji8vr8thxx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn9e1xj14atji8vr8thxx.png" alt="Terminal output from the GemmaEdge Hub edge device showing webcam frames being captured, analyzed locally with Gemma 4 E2B, and selectively escalated to the Mac Mini. The logs show local confidence scores, periodic audits, safety keyword escalation for a visible flame, and stronger model responses returned from the Mac server." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqfvnwdfvnrtlksf800qv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqfvnwdfvnrtlksf800qv.png" alt="Terminal output from the Mac Mini server running the GemmaEdge Hub FastAPI dashboard. The logs show the server starting successfully on localhost:8000, repeated dashboard status requests, and an incoming escalated vision request from the edge device." width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1cv9ty7q04oqh3ic3lm1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1cv9ty7q04oqh3ic3lm1.png" alt="GemmaEdge Hub web dashboard in a browser showing the escalation log. A table displays recent webcam frames, edge model answers, edge confidence, Mac Mini model answers, Mac confidence, and latency for each escalated request. One highlighted row shows a safety-related flame detection escalated for deeper analysis." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;Repository:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Prerak1520/gemmaedge-hub" rel="noopener noreferrer"&gt;https://github.com/Prerak1520/gemmaedge-hub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Main files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;air/sensor.py&lt;/code&gt;: webcam capture, local inference, and escalation decisions&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;air/client.py&lt;/code&gt;: HTTP client for sending escalations to the Mac Mini&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;mac/server.py&lt;/code&gt;: FastAPI server, stronger-model inference, and live dashboard&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;mac/upskill_train.py&lt;/code&gt;: teacher-student prompt optimization&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;shared/protocol.py&lt;/code&gt;: shared request/response schema&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How I Used Gemma 4
&lt;/h2&gt;

&lt;p&gt;I chose &lt;strong&gt;Gemma 4 E2B&lt;/strong&gt; for the edge device because it is small enough to run locally and quickly while keeping routine camera frames private. That made it the right fit for an edge-first vision workflow.&lt;/p&gt;

&lt;p&gt;Gemma 4 powers the main loop:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The edge model describes each webcam frame.&lt;/li&gt;
&lt;li&gt;The system extracts a confidence signal.&lt;/li&gt;
&lt;li&gt;Escalation logic decides whether the local answer is enough.&lt;/li&gt;
&lt;li&gt;Safety keywords and periodic audits catch overconfident answers.&lt;/li&gt;
&lt;li&gt;A stronger local Gemma model can review harder cases on the Mac Mini.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One important lesson was that self-reported confidence alone is not enough. During testing, the small model often returned high confidence even when the answer still deserved review. I updated the system so escalation considers low confidence, safety-relevant keywords, and periodic audits of overconfident answers.&lt;/p&gt;

&lt;p&gt;I also added a teacher-student upskilling step. The Mac Mini generates and scores improved system prompts for the smaller edge model, then the winning prompt is copied back to the edge device as &lt;code&gt;skill.txt&lt;/code&gt;. This improves the edge model's behavior without fine-tuning weights.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Fits the Build Criteria
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Intentional and effective use of Gemma 4&lt;/strong&gt;: Gemma 4 is central to the system. E2B handles fast local inference where privacy and responsiveness matter most, while escalation gives harder cases more reasoning power.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Technical implementation and code quality&lt;/strong&gt;: The project includes separate edge and server modules, shared Pydantic protocol models, FastAPI escalation, configurable audit behavior, safer dashboard rendering, and clear setup docs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Creativity and originality&lt;/strong&gt;: Instead of building a single-model demo, this treats local AI like a small distributed system with routing, auditing, and teacher-student prompt improvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Usability and user experience&lt;/strong&gt;: The dashboard makes the system understandable in real time by showing local answers, escalated answers, confidence, latency, and recent frames.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;The biggest design lesson was that model orchestration matters as much as model choice. A small local model is great for privacy and responsiveness, but it needs a good policy for knowing when to ask for help. A larger local model is powerful, but it is too slow and expensive to run on every frame.&lt;/p&gt;

&lt;p&gt;GemmaEdge Hub combines both: private edge inference by default, stronger local reasoning when needed, and a dashboard that makes the escalation path visible.&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
    </item>
  </channel>
</rss>
