<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ayush Shekhar</title>
    <description>The latest articles on DEV Community by Ayush Shekhar (@ayushh0110).</description>
    <link>https://dev.to/ayushh0110</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3898782%2Fad3df0e7-5f5e-45e7-93c4-495fd4566407.jpeg</url>
      <title>DEV Community: Ayush Shekhar</title>
      <link>https://dev.to/ayushh0110</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ayushh0110"/>
    <language>en</language>
    <item>
      <title>I Built a Privacy-First Alternative to Microsoft Recall — Using All 3 Gemma 4 Modalities</title>
      <dc:creator>Ayush Shekhar</dc:creator>
      <pubDate>Sat, 23 May 2026 10:43:57 +0000</pubDate>
      <link>https://dev.to/ayushh0110/i-built-a-privacy-first-alternative-to-microsoft-recall-using-all-3-gemma-4-modalities-26bb</link>
      <guid>https://dev.to/ayushh0110/i-built-a-privacy-first-alternative-to-microsoft-recall-using-all-3-gemma-4-modalities-26bb</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Build with Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;After Microsoft's Recall got torn apart for storing screenshots in plaintext with telemetry phoning home, I thought — the idea is genuinely useful. Knowing what you were doing 3 days ago, finding that one message someone sent you, remembering what you were working on before lunch. The execution was just terrible.&lt;/p&gt;

&lt;p&gt;So I built &lt;strong&gt;ScreenMind&lt;/strong&gt; — an open-source screen activity journal that runs entirely on your machine. It captures your screen, analyzes every screenshot with Gemma 4's vision, and builds a searchable, chat-able AI memory of your digital life.&lt;/p&gt;

&lt;p&gt;The difference from Recall: nothing ever leaves your computer. No cloud. No telemetry. No "trust us with your screenshots." Everything — capture, analysis, search, chat — runs locally on a single GPU.&lt;/p&gt;

&lt;p&gt;But it's way more than a Recall clone. Here's what it actually does:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;📸 Smart Capture&lt;/strong&gt; — Doesn't blindly screenshot every 30 seconds. Uses perceptual hashing to detect when your screen actually changes. Cursor blinks and clock updates get ignored. Real content changes get captured.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🧠 Gemma 4 Vision Analysis&lt;/strong&gt; — Every screenshot goes through Gemma 4 with OCR context. It figures out what app you're using, what you're doing, categorizes the activity, detects your mood, and writes a detailed scene description. Not just "user is in Chrome" — more like "user is reading a pull request review on GitHub for the auth-middleware refactor."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;📐 Spatial Layout Detection&lt;/strong&gt; — OCR boxes get classified into screen regions (sidebar, chat area, toolbar, profile panel) using coordinate-based parsing. Text gets organized by section so when you search or chat, you get structured context — not a wall of raw OCR.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔍 Hybrid Search&lt;/strong&gt; — Semantic search (MiniLM embeddings + cosine similarity) combined with FTS5 keyword search. Ask "debugging the auth module" and it finds screenshots by meaning, not just exact word matches. Results show OCR text highlighted directly on the screenshot.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;💬 Chat With Your Screen History&lt;/strong&gt; — This is the feature people love most. Ask "what should I reply to that Discord message?" and it pulls up the relevant screenshot, reads the organized text, and answers. Ask "did I get any email from Zerodha?" and it finds your inbox screenshot and tells you. It's RAG over your actual life, not documents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🎙️ Voice Memos&lt;/strong&gt; — Hold Ctrl+Shift+V, speak, release. Gemma 4's native audio encoder transcribes it. A screenshot is captured alongside so you have visual context with every memo.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🎤 Meeting Transcription&lt;/strong&gt; — Auto-detects when you're in Zoom, Teams, Discord, or Meet. Records audio, transcribes in 15-second chunks using Gemma's audio encoder, then runs map-reduce summarization for long meetings. Outputs structured summaries with topics, decisions, and action items.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🤖 Agent Platform&lt;/strong&gt; — This is the part I'm most proud of. You can build custom automations by writing a markdown file in plain English:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Daily Focus Report&lt;/span&gt;
&lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;every 6h&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;timeline, apps, mood&lt;/span&gt;
&lt;span class="na"&gt;output&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;local, obsidian&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

Analyze my screen activity and generate a focus report:
&lt;span class="p"&gt;-&lt;/span&gt; How many hours of deep work vs shallow work?
&lt;span class="p"&gt;-&lt;/span&gt; What were my main distractions?
&lt;span class="p"&gt;-&lt;/span&gt; Give me a focus score out of 10.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Drop it in a folder. It runs automatically. Gemma processes your prompt with injected screen data. No code needed. For developers who want more control, there's a full Python SDK with state persistence and GPU-safe LLM access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔌 MCP Server&lt;/strong&gt; — Exposes your screen history to Claude Desktop, Cursor, and VS Code via Model Context Protocol. 8 tools: search, recent activity, time-range queries, daily summaries, meeting transcripts, instant capture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔐 Privacy&lt;/strong&gt; — Auto-redacts credit cards, SSNs, API keys, and passwords from captured text before storage. Optional AES encryption at rest. Dashboard PIN lock. App blocklist. Incognito mode.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;📊 Analytics&lt;/strong&gt; — Category breakdown, top apps, hourly heatmap, meeting stats. See where your time actually goes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⏪ Day Rewind&lt;/strong&gt; — Timelapse playback of your entire day with play/pause/scrub/speed controls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔗 Integrations&lt;/strong&gt; — Obsidian vault sync, Notion database export, webhooks (Slack, Discord, IFTTT) with HMAC signing and auto-retry.&lt;/p&gt;
&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/CxkkBT_EvPw"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;
&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/ayushh0110" rel="noopener noreferrer"&gt;
        ayushh0110
      &lt;/a&gt; / &lt;a href="https://github.com/ayushh0110/ScreenMind" rel="noopener noreferrer"&gt;
        ScreenMind
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      🧠 AI-powered screen memory — captures, analyzes, and lets you search/chat your screen history. Powered by Gemma 4 E2B. 100% local, 100% private.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div&gt;
&lt;br&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/853212f5e82c3a9cb6462c61e4761aa9e91f9b57c8d39e6375ae61ea2c8bd8ff/68747470733a2f2f696d672e736869656c64732e696f2f62616467652ff09fa7a05f53637265656e4d696e642d596f75725f41495f4d656d6f72792d3842354346363f7374796c653d666f722d7468652d6261646765266c6162656c436f6c6f723d306130653161"&gt;&lt;img src="https://camo.githubusercontent.com/853212f5e82c3a9cb6462c61e4761aa9e91f9b57c8d39e6375ae61ea2c8bd8ff/68747470733a2f2f696d672e736869656c64732e696f2f62616467652ff09fa7a05f53637265656e4d696e642d596f75725f41495f4d656d6f72792d3842354346363f7374796c653d666f722d7468652d6261646765266c6162656c436f6c6f723d306130653161" alt="ScreenMind" height="40"&gt;&lt;/a&gt;

&lt;p&gt;&lt;strong&gt;Captures your screen → Analyzes with Gemma 4 → Builds a searchable AI memory&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;100% local. 100% private. Zero cloud dependencies.&lt;/strong&gt;&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;a href="https://python.org" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/b43cbee196e104f1912e1e1f08745aac72ee904fe95aa463d7b246cc2ccfe691/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f507974686f6e2d332e31302b2d3337373641423f7374796c653d666c61742d737175617265266c6f676f3d707974686f6e266c6f676f436f6c6f723d7768697465" alt="Python 3.10+"&gt;&lt;/a&gt;
&lt;a href="https://ai.google.dev/gemma" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/844da5300027994562f93c3bb8374b9e8f4dead8dcbc4e86665b35e8ecc81aac/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f47656d6d615f342d4532425f566973696f6e2b417564696f2d3842354346363f7374796c653d666c61742d737175617265266c6f676f3d676f6f676c65266c6f676f436f6c6f723d7768697465" alt="Gemma 4 E2B"&gt;&lt;/a&gt;
&lt;a href="https://github.com/ggerganov/llama.cpp" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/93311f1629279cbfa441a20d30148fe3a19e03c32d9dafddda1d652cf3e2d189/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6c616d612e6370702d4c6f63616c5f496e666572656e63652d3333333f7374796c653d666c61742d737175617265" alt="llama.cpp"&gt;&lt;/a&gt;
&lt;a href="https://github.com/ayushh0110/ScreenMind/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/e5a871941420f5f2b2a1031c619420263e0f19160cc870adc79a65940bd828f4/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d3130423938313f7374796c653d666c61742d737175617265" alt="License MIT"&gt;&lt;/a&gt;
&lt;a href="https://github.com/ayushh0110/ScreenMind/MCP_SETUP.md" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/1d0888d7bd62299ed228e7e2f4813cab76614ea70bde8f0f4efbb69744bd97b0/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4d43502d436c617564655f2537435f437572736f725f2537435f5653436f64652d4635394530423f7374796c653d666c61742d737175617265" alt="MCP Ready"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;a href="https://github.com/ayushh0110/ScreenMind#-features" rel="noopener noreferrer"&gt;&lt;strong&gt;Features&lt;/strong&gt;&lt;/a&gt; · &lt;a href="https://github.com/ayushh0110/ScreenMind#-how-gemma-4-is-used" rel="noopener noreferrer"&gt;&lt;strong&gt;Gemma 4 Deep Dive&lt;/strong&gt;&lt;/a&gt; · &lt;a href="https://github.com/ayushh0110/ScreenMind#-quick-start" rel="noopener noreferrer"&gt;&lt;strong&gt;Quick Start&lt;/strong&gt;&lt;/a&gt; · &lt;a href="https://github.com/ayushh0110/ScreenMind#-architecture" rel="noopener noreferrer"&gt;&lt;strong&gt;Architecture&lt;/strong&gt;&lt;/a&gt; · &lt;a href="https://github.com/ayushh0110/ScreenMind#-agent-platform" rel="noopener noreferrer"&gt;&lt;strong&gt;Agent Platform&lt;/strong&gt;&lt;/a&gt; · &lt;a href="https://github.com/ayushh0110/ScreenMind#-mcp-server-claude--cursor--vs-code" rel="noopener noreferrer"&gt;&lt;strong&gt;MCP&lt;/strong&gt;&lt;/a&gt; · &lt;a href="https://github.com/ayushh0110/ScreenMind#-api-reference" rel="noopener noreferrer"&gt;&lt;strong&gt;API&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/ayushh0110/ScreenMind/docs/screenshots/agents.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fayushh0110%2FScreenMind%2FHEAD%2Fdocs%2Fscreenshots%2Fagents.png" alt="Timeline — AI-analyzed screen activity feed"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agents&lt;/th&gt;
&lt;th&gt;Chat with your memory&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a rel="noopener noreferrer" href="https://github.com/ayushh0110/ScreenMind/docs/screenshots/timeline.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fayushh0110%2FScreenMind%2FHEAD%2Fdocs%2Fscreenshots%2Ftimeline.png" alt="Agents"&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a rel="noopener noreferrer" href="https://github.com/ayushh0110/ScreenMind/docs/screenshots/chat.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fayushh0110%2FScreenMind%2FHEAD%2Fdocs%2Fscreenshots%2Fchat.png" alt="Chat"&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Microsoft showed the world wants screen-aware AI with Recall.&lt;/strong&gt; But Recall stores data in plaintext, sends telemetry, and was met with massive privacy backlash. ScreenMind is the open-source, privacy-first alternative — every screenshot analyzed, every insight generated, every search result — all computed locally using Gemma 4's multimodal capabilities.&lt;/p&gt;
&lt;p&gt;It's not just a screen recorder. It's an &lt;strong&gt;AI memory&lt;/strong&gt; you can talk to, search through, and build automations on top of.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;✨ Features&lt;/h2&gt;
&lt;/div&gt;

&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;🧠 Core Intelligence&lt;/h3&gt;
&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;📸 Smart Capture&lt;/strong&gt; — Content-change detection, not a fixed timer. Captures when your screen &lt;em&gt;actually&lt;/em&gt; changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔬 Gemma 4 Vision Analysis&lt;/strong&gt; — Every screenshot analyzed: app detection, activity categorization, mood…&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/ayushh0110/ScreenMind" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;h2&gt;
  
  
  How I Used Gemma 4
&lt;/h2&gt;

&lt;p&gt;I chose the &lt;strong&gt;Gemma 4 family&lt;/strong&gt; — and it's not a preference, it's an architectural requirement. E2B is the default for 4GB GPUs, E4B for users with more headroom. Let me explain why no other model family works here.&lt;/p&gt;

&lt;p&gt;ScreenMind runs continuously in the background. It needs to analyze a screenshot every 30-40 seconds, transcribe voice memos on demand, power a chat interface, and run agent prompts — all on a single consumer GPU. These constraints eliminate everything else:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Constraint&lt;/th&gt;
&lt;th&gt;What it eliminates&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Must run continuously in background on 4GB VRAM&lt;/td&gt;
&lt;td&gt;Rules out 12B+ models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Must understand screenshots natively&lt;/td&gt;
&lt;td&gt;Rules out text-only models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Must transcribe audio natively&lt;/td&gt;
&lt;td&gt;Rules out models without audio encoder&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Must stay 100% local&lt;/td&gt;
&lt;td&gt;Rules out cloud APIs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Must be fast enough for 30-40s capture cycle&lt;/td&gt;
&lt;td&gt;E2B does it in 12-76s depending on mode&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Gemma 4 E2B is the only model that checks all five boxes. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;All three modalities in one product:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vision&lt;/strong&gt; — Every screenshot gets sent to Gemma 4 with OCR text as context. The prompt asks for structured JSON: app name, activity category, summary, detailed context, mood, confidence, scene description, and layout regions. I built three analysis modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Fast (~12s)&lt;/em&gt; — uses a no-thinking prefill trick (pre-fill &lt;code&gt;&amp;lt;think&amp;gt;\n&amp;lt;/think&amp;gt;&lt;/code&gt; in the assistant message to skip reasoning)&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Balanced (~40s)&lt;/em&gt; — natural thinking, analysis only&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Accurate (~76s)&lt;/em&gt; — thinking + spatial layout detection in one call&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Audio&lt;/strong&gt; — Gemma 4 E2B has a native audio encoder. I use it for voice memo transcription and meeting transcription. No Whisper, no separate ASR model. One model handles everything. For meetings, audio gets chunked into 15-second segments, each transcribed by Gemma, then a final Gemma call does map-reduce summarization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reasoning&lt;/strong&gt; — Daily summaries use &lt;code&gt;think=True&lt;/code&gt; for deep reasoning over a day's activities. Chat uses Gemma to answer questions grounded in screen context. Agents feed screen data into Gemma prompts for custom analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance engineering around a single GPU:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Since there's only one GPU slot, I built a priority system. Chat cancels in-flight analysis instantly (closes the HTTP client → llama-server frees the slot in &amp;lt;1s). The cancelled analysis gets re-queued at the front, not the back. Users never wait for background work to finish.&lt;/p&gt;

&lt;p&gt;I also built a per-app pHash cache with three tiers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Identical screens (diff ≤3): skip everything, copy from cache — 0ms&lt;/li&gt;
&lt;li&gt;Minor changes (diff ≤9): re-run OCR only, reuse Gemma analysis — 3-10s
&lt;/li&gt;
&lt;li&gt;Full change (diff 10+): run the complete pipeline — 12-76s&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This cuts Gemma inference calls by 30-50% during typical usage. Combined with the three analysis modes, ScreenMind runs comfortably on my GTX 1650 with 4GB VRAM as a daily driver.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The multi-model pipeline:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Screenshot → EasyOCR (text) → Gemma 4 E2B (understanding) → MiniLM (embeddings) → SQLite + FTS5
                                     ↑
                              OCR text fed as context
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four AI models working together, with Gemma 4 as the brain. OCR extracts what's written. Gemma understands what you're doing. MiniLM enables semantic search. FTS5 handles instant keyword lookup. Each model does what it's best at.&lt;/p&gt;

&lt;p&gt;I've been using this daily for two weeks. The chat feature is genuinely addictive — being able to ask "what was I working on before lunch?" or "what did that email say?" and getting an actual answer from your own screen history changes how you think about your computer.&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
    </item>
    <item>
      <title>From Heuristics to Fine-Tuning: Teaching a Model to Use Tools</title>
      <dc:creator>Ayush Shekhar</dc:creator>
      <pubDate>Sun, 26 Apr 2026 13:28:09 +0000</pubDate>
      <link>https://dev.to/ayushh0110/from-heuristics-to-fine-tuning-teaching-a-model-to-use-tools-1c9g</link>
      <guid>https://dev.to/ayushh0110/from-heuristics-to-fine-tuning-teaching-a-model-to-use-tools-1c9g</guid>
      <description>&lt;p&gt;&lt;em&gt;How I replaced 200 lines of regex with a fine-tuned 7B model — and why it was worth it.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;I built an &lt;a href="https://github.com/ayushh0110/autonomous-agent" rel="noopener noreferrer"&gt;autonomous AI agent&lt;/a&gt; with 9 tools: web search, calculator, weather, Wikipedia, translation, and more. The first question every request must answer is deceptively simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Which tool should I use?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;My first solution was a heuristic classifier — a function called &lt;code&gt;classify_query()&lt;/code&gt; that uses regex patterns to detect intent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 200+ lines of patterns like this:
&lt;/span&gt;&lt;span class="n"&gt;_SEARCH_INDICATORS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\b(latest|current|news|today|recent|who won|score|price|&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stock|update|happening|trending|release|launched)\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;_KNOWLEDGE_INDICATORS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\b(explain|what is|how does|define|difference between|&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;why do|concept of|overview|meaning of|works)\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It worked. About 75% of the time.&lt;/p&gt;

&lt;p&gt;The remaining 25% was a graveyard of edge cases: "say hello in Japanese" (needs &lt;code&gt;translate&lt;/code&gt;, matched nothing), "what's 15% of 2850" (needs &lt;code&gt;calculator&lt;/code&gt;, matched &lt;code&gt;what's&lt;/code&gt; → routed to search), "compare React vs Vue" (needs autonomous executor, matched &lt;code&gt;compare&lt;/code&gt; → routed to direct answer).&lt;/p&gt;

&lt;p&gt;Every fix introduced new regressions. &lt;strong&gt;Regex-based routing doesn't scale.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Idea
&lt;/h2&gt;

&lt;p&gt;What if the model itself could learn the routing? Not a giant foundation model — a small, fast 7B model fine-tuned specifically for this task. The hypothesis:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A QLoRA-adapted 7B model trained on 1K high-quality tool-call traces should outperform hand-crafted regex, with comparable latency.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This became &lt;a href="https://github.com/ayushh0110/toolforge" rel="noopener noreferrer"&gt;&lt;strong&gt;ToolForge&lt;/strong&gt;&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Generating Training Data (The Hard Part)
&lt;/h2&gt;

&lt;p&gt;I had 9 tools but no labeled dataset. Creating one manually would take weeks. Instead, I used &lt;strong&gt;teacher distillation&lt;/strong&gt; — using a stronger model (Gemini 2.5 Flash) to generate high-quality training examples.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Distillation Pipeline
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User queries (generated) → Gemini 2.5 Flash → Structured tool-call traces → Filtered dataset
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The trick was &lt;strong&gt;diversity&lt;/strong&gt;. I needed queries covering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Single-tool requests ("What's the weather in Tokyo?")&lt;/li&gt;
&lt;li&gt;Multi-tool chains ("What's the weather in Tokyo and convert 25°C to Fahrenheit?")&lt;/li&gt;
&lt;li&gt;No-tool queries ("Explain recursion")&lt;/li&gt;
&lt;li&gt;Ambiguous queries ("Tell me about Python" — search or direct answer?)&lt;/li&gt;
&lt;li&gt;Edge cases ("sqrt of 44567" — calculator, not search)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I built a &lt;code&gt;ClientPool&lt;/code&gt; that rotates across 6 free-tier Gemini API keys to avoid rate limits:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ClientPool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Round-robin pool of (key, model) slots for maximum throughput.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;next_client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Pick the slot that has rested the longest
&lt;/span&gt;        &lt;span class="n"&gt;best&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_slots&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_used&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;elapsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_used&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;elapsed&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_min_gap&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_min_gap&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;elapsed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After filtering for quality (valid JSON, correct schema, no hallucinated tools), I had &lt;strong&gt;1,173 clean examples&lt;/strong&gt; — enough for fine-tuning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dataset Distribution
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;%&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;web_search&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;287&lt;/td&gt;
&lt;td&gt;24%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;calculator&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;156&lt;/td&gt;
&lt;td&gt;13%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;weather&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;143&lt;/td&gt;
&lt;td&gt;12%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;translate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;132&lt;/td&gt;
&lt;td&gt;11%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;wikipedia&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;td&gt;11%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;no_tool&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;119&lt;/td&gt;
&lt;td&gt;10%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;dictionary&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;78&lt;/td&gt;
&lt;td&gt;7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;datetime&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;68&lt;/td&gt;
&lt;td&gt;6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;unit_converter&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;62&lt;/td&gt;
&lt;td&gt;5%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The distribution is intentionally skewed toward &lt;code&gt;web_search&lt;/code&gt; — mirroring real-world query patterns.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Training with QLoRA
&lt;/h2&gt;

&lt;p&gt;I trained on a Kaggle T4 GPU (free tier). The key insight: &lt;strong&gt;you don't need an A100 for fine-tuning.&lt;/strong&gt; QLoRA with 4-bit NF4 quantization fits a 7B model in ~6GB VRAM.&lt;/p&gt;

&lt;h3&gt;
  
  
  Configuration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;bnb_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BitsAndBytesConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;load_in_4bit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bnb_4bit_quant_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nf4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bnb_4bit_compute_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bnb_4bit_use_double_quant&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Double quantization saves ~0.4GB
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;lora_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LoraConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                    &lt;span class="c1"&gt;# LoRA rank
&lt;/span&gt;    &lt;span class="n"&gt;lora_alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# Scaling factor (alpha/r = 2)
&lt;/span&gt;    &lt;span class="n"&gt;target_modules&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;k_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;o_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gate_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;up_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;down_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;lora_dropout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why these choices?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;r=64&lt;/strong&gt;: Higher rank = more parameters = more capacity to learn tool routing patterns. I tested r=16 (too small) and r=64 (sweet spot).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All attention + MLP layers&lt;/strong&gt;: Tool routing requires understanding query intent (attention) AND mapping it to structured output (MLP). Targeting only attention heads wasn't enough.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;alpha=128 (2×r)&lt;/strong&gt;: Standard scaling that prevents gradient instability.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Step 3: The Ablation Study
&lt;/h2&gt;

&lt;p&gt;This is where the project goes from "I fine-tuned a model" to "I systematically evaluated design choices." I ran 4 experiments:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Run&lt;/th&gt;
&lt;th&gt;Base Model&lt;/th&gt;
&lt;th&gt;LoRA Rank&lt;/th&gt;
&lt;th&gt;LR&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Mistral-7B-Instruct-v0.3&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;2e-4&lt;/td&gt;
&lt;td&gt;78.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Mistral-7B-Instruct-v0.3&lt;/td&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;2e-4&lt;/td&gt;
&lt;td&gt;81.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Qwen2.5-7B-Instruct&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;2e-4&lt;/td&gt;
&lt;td&gt;83.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Qwen2.5-7B-Instruct&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;64&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2e-4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;86.2%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All tracked on &lt;a href="https://wandb.ai/shekharayush56-cognizant/toolforge" rel="noopener noreferrer"&gt;Weights &amp;amp; Biases&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Findings
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Qwen &amp;gt; Mistral for tool routing (+4.5%)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Qwen2.5-7B-Instruct has stronger structured output capabilities out of the box. Its chat template naturally handles tool-call JSON, while Mistral required more prompt engineering to produce valid output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. r=64 &amp;gt; r=16 for both models (+3-4%)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The routing task isn't trivial — the model needs to learn mappings between natural language patterns and 9 discrete tool categories plus argument extraction. r=16 underfits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Eval loss converges by epoch 2&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All runs showed minimal improvement after epoch 2, with some showing slight overfitting in epoch 3. &lt;code&gt;load_best_model_at_end=True&lt;/code&gt; was essential.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Integration
&lt;/h2&gt;

&lt;p&gt;The integration into the autonomous agent was designed as a &lt;strong&gt;feature flag&lt;/strong&gt; — zero behavior change in production unless explicitly enabled:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# In executor.py
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;is_toolforge_available&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;decision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;toolforge_classify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory_hits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;has_memory&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;router_source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;toolforge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;decision&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;decision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;classify_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory_hits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;has_memory&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# heuristic fallback
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;toolforge_classify()&lt;/code&gt; function:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Loads the LoRA adapter lazily on first query&lt;/li&gt;
&lt;li&gt;Runs inference with greedy decoding (deterministic routing)&lt;/li&gt;
&lt;li&gt;Parses the model's tool-call output&lt;/li&gt;
&lt;li&gt;Maps specific tools to the agent's decision types (&lt;code&gt;web_search&lt;/code&gt; → &lt;code&gt;needs_search&lt;/code&gt;, no tool → &lt;code&gt;direct_answer&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Returns &lt;code&gt;None&lt;/code&gt; on any failure → heuristic takes over&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Production (HF Spaces, CPU)&lt;/strong&gt;: heuristic runs as before&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU-enabled environments&lt;/strong&gt;: ToolForge model handles routing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The code is always visible&lt;/strong&gt;: interviewers can see the integration pattern&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Heuristic (Regex)&lt;/th&gt;
&lt;th&gt;ToolForge (QLoRA)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Overall Accuracy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~75%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;86.2%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Approach&lt;/td&gt;
&lt;td&gt;200 lines of regex&lt;/td&gt;
&lt;td&gt;Fine-tuned Qwen2.5-7B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency&lt;/td&gt;
&lt;td&gt;0ms (regex)&lt;/td&gt;
&lt;td&gt;~200ms (GPU)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Handles edge cases&lt;/td&gt;
&lt;td&gt;❌ Constant regressions&lt;/td&gt;
&lt;td&gt;✅ Learned from data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maintenance cost&lt;/td&gt;
&lt;td&gt;High (new regex per bug)&lt;/td&gt;
&lt;td&gt;Low (retrain on new data)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 15% accuracy improvement isn't just a number — it means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Say hello in Japanese" → correctly routes to &lt;code&gt;translate&lt;/code&gt; (was: missed entirely)&lt;/li&gt;
&lt;li&gt;"sqrt(44567)" → correctly routes to &lt;code&gt;calculator&lt;/code&gt; (was: matched "what" → search)&lt;/li&gt;
&lt;li&gt;"Compare React vs Vue for 2026" → correctly routes to &lt;code&gt;autonomous_task&lt;/code&gt; (was: partial match → direct answer)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;More data&lt;/strong&gt;: 1.1K examples is enough for proof-of-concept, but 5K+ would likely push accuracy above 90%. The distillation pipeline can scale — I just ran out of free API quota.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Argument extraction evaluation&lt;/strong&gt;: I evaluated tool &lt;em&gt;selection&lt;/em&gt; accuracy but didn't formally measure argument &lt;em&gt;extraction&lt;/em&gt; quality (e.g., did the model extract "Tokyo" from "weather in Tokyo?"). The traces show it works, but a proper F1 metric would be stronger.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;GGUF quantization for CPU inference&lt;/strong&gt;: The current serving path requires GPU. Converting to GGUF and using llama.cpp would enable CPU inference at ~1-2s latency — viable for production on free-tier hosting.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;p&gt;This project isn't about fine-tuning. Fine-tuning is a technique — anyone can run &lt;code&gt;SFTTrainer&lt;/code&gt;. The story is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;I built an agent&lt;/strong&gt; with hand-crafted routing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;I measured where it failed&lt;/strong&gt; (75% accuracy, constant regex regressions)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;I generated training data&lt;/strong&gt; using teacher distillation from my own pipeline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;I trained and compared models&lt;/strong&gt; with systematic ablation studies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;I proved it works&lt;/strong&gt; with quantitative evaluation (86.2% accuracy)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;I integrated it&lt;/strong&gt; as a production-ready feature flag&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's not a tutorial project. That's the ML engineering loop — identify problem → collect data → train → evaluate → deploy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ToolForge repo&lt;/strong&gt;: &lt;a href="https://github.com/ayushh0110/toolforge" rel="noopener noreferrer"&gt;github.com/ayushh0110/toolforge&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Autonomous Agent&lt;/strong&gt;: &lt;a href="https://github.com/ayushh0110/autonomous-agent" rel="noopener noreferrer"&gt;github.com/ayushh0110/autonomous-agent&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;W&amp;amp;B Dashboard&lt;/strong&gt;: &lt;a href="https://wandb.ai/shekharayush56-cognizant/toolforge" rel="noopener noreferrer"&gt;wandb.ai/shekharayush56-cognizant/toolforge&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Live Agent Demo&lt;/strong&gt;: &lt;a href="https://autonomous-agent-one.vercel.app" rel="noopener noreferrer"&gt;autonomous-agent-one.vercel.app&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://github.com/ayushh0110" rel="noopener noreferrer"&gt;Ayush Shekhar&lt;/a&gt;. If you're working on tool-use fine-tuning, I'd love to hear what approach you're taking — reach out on &lt;a href="https://linkedin.com/in/ayush-shekhar" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
