<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: AaryaP</title>
    <description>The latest articles on DEV Community by AaryaP (@aarya_prakash_1328e1617f6).</description>
    <link>https://dev.to/aarya_prakash_1328e1617f6</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3986326%2F298a581f-8e65-4af1-960a-4d80b80a9541.png</url>
      <title>DEV Community: AaryaP</title>
      <link>https://dev.to/aarya_prakash_1328e1617f6</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aarya_prakash_1328e1617f6"/>
    <language>en</language>
    <item>
      <title>Build Small Hackathon - Quillwright</title>
      <dc:creator>AaryaP</dc:creator>
      <pubDate>Mon, 15 Jun 2026 22:59:04 +0000</pubDate>
      <link>https://dev.to/aarya_prakash_1328e1617f6/build-small-hackathon-quillwright-573f</link>
      <guid>https://dev.to/aarya_prakash_1328e1617f6/build-small-hackathon-quillwright-573f</guid>
      <description>&lt;p&gt;&lt;em&gt;How Quillwright turns a photo and a voice note into a tradesperson's estimate, with an orchestra of small models, on your own machine, and not a single number invented by an LLM.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The job nobody wants
&lt;/h2&gt;

&lt;p&gt;Every tradesperson does the same unpaid hour after the real work is done: writing up the estimate. Parts, quantities, labor, a defensible total. Quillwright is an on-device, human-supervised agent that does that draft from a &lt;strong&gt;field capture&lt;/strong&gt; (a job photo plus a spoken note) and hands back an itemized, editable estimate.&lt;/p&gt;

&lt;p&gt;The constraints we set ourselves were the interesting part: &lt;strong&gt;small models&lt;/strong&gt; (≤32B), &lt;strong&gt;no third-party AI APIs&lt;/strong&gt;, and a hard rule that &lt;strong&gt;no customer-facing number ever comes from a language model.&lt;/strong&gt; Those three constraints shaped every decision below.&lt;/p&gt;

&lt;h2&gt;
  
  
  An orchestra, not a soloist
&lt;/h2&gt;

&lt;p&gt;There's no single model doing the work. Each role in the pipeline resolves to a small, purpose-fit model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Perception&lt;/strong&gt;: MiniCPM-V (OpenBMB) reads the job photo into observations ("RUN CAPACITOR", a nameplate model number).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent Brain&lt;/strong&gt;: NVIDIA Nemotron-3-Nano drives a narrow tool-calling loop: which items, what quantities, when it's done.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audio&lt;/strong&gt;: Cohere Transcribe turns the voice note into text on-device.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multilingual&lt;/strong&gt;: Cohere Aya translates the customer-facing copy (Spanish, French, Mandarin), descriptions only, never the numbers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding&lt;/strong&gt;: a small embedder powers semantic recall of similar past jobs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The brain's tool surface is deliberately tiny: essentially &lt;em&gt;add a priced item&lt;/em&gt; and &lt;em&gt;finish&lt;/em&gt;. That narrowness is &lt;strong&gt;why a 4B model is reliable&lt;/strong&gt; here: it does routing and judgment, not arithmetic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Facts-from-Tools: the rule that runs through everything
&lt;/h2&gt;

&lt;p&gt;The correctness rule is simple to state and ruthless to enforce: &lt;strong&gt;any number that reaches the customer (price, quantity, tax, total) comes from a tool (a catalog lookup, a deterministic &lt;code&gt;compute&lt;/code&gt;) or from a human edit. Never from the model's free generation.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It holds in the obvious places (the brain calls &lt;code&gt;lookup_price&lt;/code&gt;, not "I think this costs $40") and the non-obvious ones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Edits&lt;/strong&gt; re-run through a server-authoritative recalc. The browser never computes its own total.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Translation&lt;/strong&gt; changes words, not digits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document Capture&lt;/strong&gt; (reading a supplier quote) produces &lt;em&gt;Proposed Line Items&lt;/em&gt;: the document is the source, but a price only becomes customer-facing once a human confirms it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The refinement chat&lt;/strong&gt; keeps a sanitized history: when you reopen an estimate and keep editing, the model sees &lt;em&gt;what you asked&lt;/em&gt; ("make it 2 hours") but takes the &lt;em&gt;numbers&lt;/em&gt; from the current line items, so a stale dollar figure can never leak back in. Even the conversation's own compaction is done in code, not by asking a model to summarize.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The eval story (the part I'd tell another builder)
&lt;/h2&gt;

&lt;p&gt;Here's the moment that changed how we built this. We ran the agent by hand on a handful of jobs and it looked &lt;strong&gt;perfect&lt;/strong&gt;. Then we wrote an eval set and scored it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Item F1: 0.367.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F95fufempqk30ahu9ytl9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F95fufempqk30ahu9ytl9.png" alt="Agent Brain item F1" width="800" height="514"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Manual testing had been lying to us: we'd unconsciously fed it the cases it handled. The eval set didn't. Two fixes, both measured:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Fuzzy catalog lookup&lt;/strong&gt;: "refrigerant" should find &lt;code&gt;refrigerant_r410a&lt;/code&gt;. F1 jumped to &lt;strong&gt;0.880&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt tuning&lt;/strong&gt; the brain's tool-calling, to &lt;strong&gt;0.967&lt;/strong&gt;, with quantity accuracy going from 0.40 to 1.00.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The lesson isn't "we got a good number." It's that the good number only existed because we were willing to be told a bad one first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Memory that gets smarter, measured the same way
&lt;/h2&gt;

&lt;p&gt;Quillwright recalls similar past jobs to inform a new estimate. The first version used keyword matching. We measured &lt;strong&gt;recall@1 = 0.750&lt;/strong&gt;. Swapping in a small embedder for a semantic re-rank moved it to &lt;strong&gt;0.875&lt;/strong&gt;, with one honest remaining miss we left in, because a benchmark with no failures is a benchmark you don't trust.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6w19wawskp98whmpmdh4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6w19wawskp98whmpmdh4.png" alt="Episodic recall" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Fine-tuning a small vision model on receipts, and on the real domain
&lt;/h2&gt;

&lt;p&gt;The 🎯 artifact is a MiniCPM-V LoRA fine-tune. On the public &lt;strong&gt;CORD&lt;/strong&gt; receipt benchmark, the tune lifted item F1 from &lt;strong&gt;0.588 → 0.681&lt;/strong&gt; (+0.09). But CORD is receipts, not trade invoices, so we also generated a grounded-synthetic set of trade invoices (built from a real 381-entry trade catalog) and fine-tuned on that. In-distribution, the tune went from &lt;strong&gt;0.703 → 0.933&lt;/strong&gt; (+0.23), with price accuracy hitting 1.00.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffulow5q439ci2cmzwxqn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffulow5q439ci2cmzwxqn.png" alt="MiniCPM-V fine-tune" width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The +0.23 is the honest headline: a small model, fine-tuned on the actual domain, closes most of the gap to a clean read. The +0.09 on CORD is the conservative one: it's a harder, out-of-domain benchmark, and we report it anyway.&lt;/p&gt;

&lt;h2&gt;
  
  
  Artifacts
&lt;/h2&gt;

&lt;p&gt;Both LoRA adapters are on the Hub, and every number above is reproducible from the eval scripts in the repo:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🎯 &lt;a href="https://huggingface.co/Aarya2004/minicpmv-trade-lora" rel="noopener noreferrer"&gt;&lt;code&gt;Aarya2004/minicpmv-trade-lora&lt;/code&gt;&lt;/a&gt;: the in-domain trade-invoice tune (0.703 → 0.933).&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/Aarya2004/minicpmv-cord-lora" rel="noopener noreferrer"&gt;&lt;code&gt;Aarya2004/minicpmv-cord-lora&lt;/code&gt;&lt;/a&gt;: the conservative CORD baseline (0.588 → 0.681).&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Agent Brain item F1&lt;/td&gt;
&lt;td&gt;0.367&lt;/td&gt;
&lt;td&gt;0.967&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Episodic recall@1&lt;/td&gt;
&lt;td&gt;0.750&lt;/td&gt;
&lt;td&gt;0.875&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MiniCPM-V item F1 (trade, in-domain)&lt;/td&gt;
&lt;td&gt;0.703&lt;/td&gt;
&lt;td&gt;0.933&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MiniCPM-V item F1 (CORD, OOD)&lt;/td&gt;
&lt;td&gt;0.588&lt;/td&gt;
&lt;td&gt;0.681&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  "On your own machine", and the honesty around it
&lt;/h2&gt;

&lt;p&gt;The hero claim is &lt;em&gt;no cloud&lt;/em&gt;. The honest version of that claim has two parts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;Private Stack&lt;/strong&gt; is open small models with no third-party AI APIs. Locally, those models genuinely run on the dev machine via Ollama / llama.cpp, and we filmed an &lt;strong&gt;Airplane-Mode Proof&lt;/strong&gt;: Wi-Fi off, a real forge completing.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;hosted demo Space&lt;/strong&gt; is wired live to &lt;strong&gt;Modal&lt;/strong&gt; GPUs, the &lt;strong&gt;Best Stack&lt;/strong&gt;: a Nemotron-3-Nano 30B brain, Nemotron-Omni for vision and audio, Aya-Expanse for multilingual. It's the same agent loop and the same Facts-from-Tools guarantees as the local run, just with more headroom; the apps scale to zero when idle, so the Space can fall back to a lightweight CPU mode (and says so on the page) when the models aren't wired. The local Private Stack and the hosted Best Stack are the same family at two tiers: flip one env var and the brain moves from a 4B on a laptop to a 30B on a GPU without touching the agent code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same agent, same tools, same Facts-from-Tools guarantee. Only the models behind each role change:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;🔒 Private Stack (local)&lt;/th&gt;
&lt;th&gt;⚡ Best Stack (Modal)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Brain&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Nemotron-3-Nano 4B (NVIDIA)&lt;/td&gt;
&lt;td&gt;Nemotron-3-Nano 30B (NVIDIA)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Perception&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MiniCPM-V (OpenBMB)&lt;/td&gt;
&lt;td&gt;Nemotron-Omni 30B (NVIDIA)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Audio&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cohere Transcribe (on-device)&lt;/td&gt;
&lt;td&gt;Nemotron-Omni 30B (NVIDIA)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multilingual&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Aya (Cohere)&lt;/td&gt;
&lt;td&gt;Aya-Expanse 8B (Cohere)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Embedding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;on-device (sentence-transformers)&lt;/td&gt;
&lt;td&gt;&lt;em&gt;same on-device path&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Extraction&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;no local path&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;Parse extractor (fine-tuned)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Runs offline?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Yes, Airplane-Mode Proof&lt;/td&gt;
&lt;td&gt;❌ No, hosted GPU endpoints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost / GPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0, your hardware&lt;/td&gt;
&lt;td&gt;scales to zero when idle&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;We hold the same line everywhere a feature could over-claim. The "Finalize &amp;amp; Send" feature really texts or emails the estimate &lt;strong&gt;on the local path&lt;/strong&gt; with your own provider creds; on the public Space it drafts only and tells you nothing was transmitted. Same for the phone call and the phone-capture QR: real on the tunneled local machine, honestly framed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three ways in
&lt;/h2&gt;

&lt;p&gt;Once the core was solid, the capture surface grew. Each path lands in the &lt;em&gt;same&lt;/em&gt; pipeline and the &lt;em&gt;same&lt;/em&gt; Facts-from-Tools guarantees:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The Workspace&lt;/strong&gt;: type/paste a note, add a photo, watch the Digital Apprentice stream.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Call a phone number&lt;/strong&gt;: describe the job out loud; it transcribes the call, forges a &lt;strong&gt;draft&lt;/strong&gt; estimate, reads the total back, and texts you the PDF. A human approves later.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scan a QR&lt;/strong&gt;: capture a photo and voice note on your phone; the desktop forges it live on screen.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What I'd carry to the next project
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Write the eval before you trust the demo.&lt;/strong&gt; 0.37 was the most useful number in the whole build.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep the model's job small.&lt;/strong&gt; The brain is reliable because it never touches arithmetic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Make the honesty structural, not aspirational.&lt;/strong&gt; "The model never emits a number" is a code path, not a promise, and it's the same code path on every capture surface.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Quillwright: tell it about the job; it drafts the estimate.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>smallmodels</category>
      <category>agents</category>
      <category>evals</category>
      <category>finetuning</category>
    </item>
  </channel>
</rss>
