<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vasyl</title>
    <description>The latest articles on DEV Community by Vasyl (@mrviduus).</description>
    <link>https://dev.to/mrviduus</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F333461%2Fea3cc6b2-e942-4848-8606-30c345279779.jpg</url>
      <title>DEV Community: Vasyl</title>
      <link>https://dev.to/mrviduus</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mrviduus"/>
    <language>en</language>
    <item>
      <title>An AI Feature Has No "Tests Pass" Moment. So I Write the Eval First.</title>
      <dc:creator>Vasyl</dc:creator>
      <pubDate>Tue, 23 Jun 2026 12:00:00 +0000</pubDate>
      <link>https://dev.to/mrviduus/an-ai-feature-has-no-tests-pass-moment-so-i-write-the-eval-first-1f7p</link>
      <guid>https://dev.to/mrviduus/an-ai-feature-has-no-tests-pass-moment-so-i-write-the-eval-first-1f7p</guid>
      <description>&lt;p&gt;I was building an "Ask This Book" feature: readers can ask questions about a book while they're reading it.&lt;/p&gt;

&lt;p&gt;One requirement sounded simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A reader on chapter 3 must never receive spoilers from chapter 30.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;My first instinct was the same as everyone else's: tell the model not to spoil future chapters. Something like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Please don't reveal information from chapters the reader hasn't reached yet."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And honestly, it mostly worked.&lt;/p&gt;

&lt;p&gt;The problem is that "mostly" is useless. A user only needs one spoiler.&lt;/p&gt;

&lt;p&gt;That was the moment I realized the feature had no definition of done.&lt;/p&gt;

&lt;p&gt;With normal software, something pushes back. The compiler complains. The tests fail. The types don't line up.&lt;/p&gt;

&lt;p&gt;With an LLM feature, none of that happens. The output looks plausible by default — fluent, confident, well formatted — even when it's wrong.&lt;/p&gt;

&lt;p&gt;So "it looked right in the demo" quietly becomes the finish line.&lt;/p&gt;

&lt;p&gt;That's exactly why I write the eval before I write the feature.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Eval Is the Specification
&lt;/h2&gt;

&lt;p&gt;Most teams treat evals as QA. Build the feature, ship something that works, add evals later.&lt;/p&gt;

&lt;p&gt;I increasingly think that's backwards. For AI systems, the eval is often the only concrete definition of success.&lt;/p&gt;

&lt;p&gt;The moment I wrote the spoiler eval, I had to define failure: spoiler leakage must be zero. Not low. Not acceptable. Zero.&lt;/p&gt;

&lt;p&gt;And that requirement immediately exposed a problem. No prompt can guarantee zero.&lt;/p&gt;

&lt;p&gt;Prompts are probabilistic. Users can phrase questions differently. Models can interpret instructions differently. Future model updates can behave differently. You cannot get a hard guarantee from a soft instruction.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Eval Changed the Architecture
&lt;/h2&gt;

&lt;p&gt;Once the eval demanded zero spoilers, the solution stopped being a prompt problem. It became a retrieval problem.&lt;/p&gt;

&lt;p&gt;Instead of telling the model not to reveal future chapters, I prevented future chapters from entering the context at all:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;chapter_ord&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;maxChapterOrd&lt;/span&gt;&lt;span class="nv"&gt;`&lt;/span&gt;&lt;span class="se"&gt;``&lt;/span&gt;&lt;span class="nv"&gt;

Anything beyond the reader's progress never enters the retrieval set. The model can't leak information it never saw.

And the eval that checks it is just as blunt — a retrieved chunk past the reader's progress is a leak:



&lt;/span&gt;&lt;span class="se"&gt;``&lt;/span&gt;&lt;span class="nv"&gt;`&lt;/span&gt;&lt;span class="n"&gt;csharp&lt;/span&gt;
&lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="n"&gt;One&lt;/span&gt; &lt;span class="n"&gt;retrieved&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="n"&gt;past&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="s1"&gt;'s progress = one spoiler leak.
public static int LeakCount(IEnumerable&amp;lt;RetrievedChunk&amp;gt; retrieved, int gateChapterOrd) =&amp;gt;
    retrieved.Count(c =&amp;gt; c.ChapterOrd &amp;gt; gateChapterOrd);
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Across the adversarial test cases, that number has to be zero. That's the moment the idea really clicked for me: the eval didn't test the design. It produced the design.&lt;/p&gt;

&lt;p&gt;A measurable failure condition forced a better architecture than I would have built if I had started with prompt engineering.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Same Thing Happened to Retrieval Quality
&lt;/h2&gt;

&lt;p&gt;The spoiler requirement wasn't the only eval. I also defined two other targets before building the feature:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieval must surface the correct passage near the top of the results.&lt;/li&gt;
&lt;li&gt;Answers must remain grounded in the passages they cite.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because those requirements were measurable, every change received a verdict instead of an opinion.&lt;/p&gt;

&lt;p&gt;A single semantic search wasn't clearing the bar. So I ended up combining two retrieval approaches:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;vector search for semantic similarity&lt;/li&gt;
&lt;li&gt;full-text search for exact names, phrases, and quotations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The results are fused using Reciprocal Rank Fusion — less mysterious than it sounds. Each chunk scores the sum of 1/(k+rank) across the lists it appears in, so anything ranked highly by both retrievers floats to the top:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ranked highly by both vector AND lexical -&amp;gt; floats to the top.&lt;/span&gt;
&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;+=&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt; &lt;span class="p"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// i is 0-based; RRF rank is 1-based&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I didn't choose hybrid retrieval because it's fashionable. I chose it because it moved the number. The eval said the system wasn't good enough. The architecture changed until it was.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Note on the Stack
&lt;/h2&gt;

&lt;p&gt;None of this is a no-dependencies flex. The judge that scores grounding is a custom evaluator on Microsoft.Extensions.AI.Evaluation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;sealed&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RubricEvaluator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Rubric&lt;/span&gt; &lt;span class="n"&gt;rubric&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;IEvaluator&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I lean on the Microsoft stack on purpose. What I keep hand-rolled is the part that decides quality — the retrieval, the fusion, the spoiler gate. The line I draw isn't "no libraries." It's no agent framework hiding the parts that determine whether the thing actually works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Eval-First Development
&lt;/h2&gt;

&lt;p&gt;Traditional software development gives us confidence almost for free. Compilers. Type systems. Unit tests. Integration tests.&lt;/p&gt;

&lt;p&gt;AI systems don't. The difficult part isn't implementing the feature. The difficult part is defining what "correct" means.&lt;/p&gt;

&lt;p&gt;That's why I increasingly think of eval-first development as the AI equivalent of TDD. With traditional software, tests verify the implementation. With AI systems, evals often define the implementation.&lt;/p&gt;

&lt;p&gt;Build the feature first and the eval later, and the eval can only grade what you've already built. Build the eval first and it starts shaping the system itself.&lt;/p&gt;

&lt;p&gt;It defines done. It tells you when you've regressed. And sometimes it forces a better architecture than the one you originally had in mind.&lt;/p&gt;

&lt;p&gt;Otherwise you're not shipping a feature. You're shipping a guess that happened to demo well.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Want to go deeper on evals? I've written a separate, more hands-on series on building production AI on .NET: &lt;a href="https://vasyl.blog/2026/06/10/what-are-ai-evals/" rel="noopener noreferrer"&gt;what evals actually are&lt;/a&gt;, &lt;a href="https://vasyl.blog/2026/06/10/error-analysis-for-evals/" rel="noopener noreferrer"&gt;error analysis&lt;/a&gt;, &lt;a href="https://vasyl.blog/2026/06/10/golden-datasets-that-dont-lie/" rel="noopener noreferrer"&gt;golden datasets&lt;/a&gt;, &lt;a href="https://vasyl.blog/2026/06/10/llm-as-judge-done-right/" rel="noopener noreferrer"&gt;LLM-as-judge&lt;/a&gt;, and &lt;a href="https://vasyl.blog/2026/06/10/evals-in-ci-and-production/" rel="noopener noreferrer"&gt;evals in CI and production&lt;/a&gt;. This post was originally published on &lt;a href="https://vasyl.blog/2026/06/17/evals-before-rag/" rel="noopener noreferrer"&gt;my blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>dotnet</category>
      <category>csharp</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>AI Evals, Part 5: From a Number to a Gate Evals in CI and Production</title>
      <dc:creator>Vasyl</dc:creator>
      <pubDate>Wed, 17 Jun 2026 17:43:25 +0000</pubDate>
      <link>https://dev.to/mrviduus/ai-evals-part-5-from-a-number-to-a-gate-evals-in-ci-and-production-1j33</link>
      <guid>https://dev.to/mrviduus/ai-evals-part-5-from-a-number-to-a-gate-evals-in-ci-and-production-1j33</guid>
      <description>&lt;p&gt;&lt;em&gt;Part 5, the finale, of a series on building production AI on .NET. We've built the pieces — &lt;a href="https://vasyl.blog/what-are-ai-evals/" rel="noopener noreferrer"&gt;what evals are&lt;/a&gt;, &lt;a href="https://vasyl.blog/error-analysis-for-evals/" rel="noopener noreferrer"&gt;error analysis&lt;/a&gt;, &lt;a href="https://vasyl.blog/golden-datasets-that-dont-lie/" rel="noopener noreferrer"&gt;golden datasets&lt;/a&gt;, and a &lt;a href="https://vasyl.blog/llm-as-judge-done-right/" rel="noopener noreferrer"&gt;trustworthy judge&lt;/a&gt;. Now we make them earn their keep.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;By now you can produce a defensible quality score for an AI feature. But a score you only &lt;em&gt;look at&lt;/em&gt; is a vanity metric. The entire point of all that work is to make quality something your engineering process &lt;strong&gt;acts on automatically&lt;/strong&gt; — the same way a failing unit test stops a bad commit. That means two homes for your evals: a &lt;strong&gt;gate&lt;/strong&gt; before you ship, and &lt;strong&gt;monitoring&lt;/strong&gt; after.&lt;/p&gt;

&lt;h2&gt;
  
  
  Home 1: CI — a safety net against regressions
&lt;/h2&gt;

&lt;p&gt;Because TextStack's judge is a custom &lt;code&gt;IEvaluator&lt;/code&gt; on Microsoft.Extensions.AI.Evaluation, an eval is just a &lt;code&gt;dotnet test&lt;/code&gt;. The MEAI evaluator emits the rubric's axes plus an overall as numeric metrics, and a quality &lt;em&gt;floor&lt;/em&gt; is expressed as a Pass/Fail interpretation on the overall:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// In the evaluator: the overall metric is interpreted Pass/Fail against a floor.&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;overallFloor&lt;/span&gt; &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="n"&gt;floor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;overall&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Interpretation&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;EvaluationMetricInterpretation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;RatingFor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Mean&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;failed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Mean&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;floor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$"floor &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;floor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="m"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (mean &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Mean&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="m"&gt;0.00&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That catches &lt;em&gt;gross&lt;/em&gt; breakage — "something is badly wrong." But the more valuable gate is &lt;strong&gt;relative&lt;/strong&gt;: store a baseline score per feature, and fail the build when a change drops quality by more than a threshold versus that baseline. That turns "did this prompt change help?" into a red/green answer and makes improving a prompt a tight loop — change, run, compare, keep or revert. It's the AI equivalent of TDD.&lt;/p&gt;

&lt;p&gt;Honest status from our codebase: the floor and on-demand runs exist today; the automatic &lt;em&gt;baseline-versus-regression&lt;/em&gt; gate is the next step. I'm flagging that deliberately, because plenty of "we do eval-driven development" claims are really "we have a number nobody gates on." The hard 80% — the measuring instrument — is built; wiring the ratchet is the lighter remaining 20%.&lt;/p&gt;

&lt;h2&gt;
  
  
  The constraint CI forces: evals cost money
&lt;/h2&gt;

&lt;p&gt;Every eval case is a real generation &lt;strong&gt;plus&lt;/strong&gt; a real judge call. Running the full suite on every commit is slow and expensive, so evals have to be deliberate. TextStack's are &lt;strong&gt;opt-in&lt;/strong&gt;: tagged so default CI skips them, and they self-skip when the provider isn't configured.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;… dotnet &lt;span class="nb"&gt;test &lt;/span&gt;tests/TextStack.AiEvals &lt;span class="nt"&gt;--filter&lt;/span&gt; &lt;span class="nv"&gt;Category&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Eval
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Default CI stays green and free; the expensive truth runs on purpose. The pragmatic pattern: a small, cheap subset on pull requests for a fast signal, and the full suite nightly or pre-release. Treat eval spend like any cloud cost — budget it, don't let it run unbounded.&lt;/p&gt;

&lt;h2&gt;
  
  
  Home 2: Production — monitoring and guardrails
&lt;/h2&gt;

&lt;p&gt;A curated golden set, however good, is a snapshot of inputs you &lt;em&gt;imagined&lt;/em&gt;. Production sends inputs you didn't. So the offline gate is only half the system; the other half runs against live traffic.&lt;/p&gt;

&lt;p&gt;This is where evals and observability become one thing. Every AI call in TextStack is tagged with its feature and recorded — cost, latency, tokens, errors — and runs persist to an &lt;code&gt;eval_runs&lt;/code&gt; table surfaced on an internal &lt;strong&gt;&lt;code&gt;/ai-quality&lt;/code&gt;&lt;/strong&gt; dashboard (Traces and Evals tabs), with an admin "Run evals" button to trigger the suite on demand. Because the judge is the &lt;em&gt;same&lt;/em&gt; component offline and online, you can sample real outputs per feature and score them with the identical rubric. Two modes fall out of that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Background monitoring&lt;/strong&gt; — sample a slice of live outputs, judge them, and watch the score over time to catch drift before users complain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Guardrails&lt;/strong&gt; — for high-stakes outputs, judge &lt;em&gt;in the critical path&lt;/em&gt; and block, retry, or fall back when a result fails. (Use sparingly: it adds a judge call's worth of latency and cost to the request.)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The flywheel
&lt;/h2&gt;

&lt;p&gt;Put the two homes together and you get a loop that compounds. Production surfaces a new failure mode → you do error analysis on it → it becomes a new golden case → your gate now defends against it → quality climbs → cleaner output produces cleaner traffic. Each turn makes the next regression harder to ship. That continuous-improvement flywheel — not any single dashboard — is the real product of an eval system.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pitfalls
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A number nobody gates on&lt;/strong&gt; — if a bad score can't fail a build or page someone, it's decoration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A fixed floor mistaken for a regression gate&lt;/strong&gt; — a floor catches breakage, not a 2%-worse change. You want both.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evals on every commit&lt;/strong&gt; — the bill and the wait will kill the habit; subset on PRs, full suite nightly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Offline-only&lt;/strong&gt; — you'll ship regressions from inputs your golden set never imagined.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Guardrails everywhere&lt;/strong&gt; — judging in the critical path is powerful but costs latency; reserve it for outputs that matter.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Online scores you never read&lt;/strong&gt; — monitoring you don't look at is just a more expensive log.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The series, in one line each
&lt;/h2&gt;

&lt;p&gt;That's the whole discipline, start to finish:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Evals are the test suite for non-deterministic code&lt;/strong&gt; — graded judgement over a representative sample.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error analysis comes first&lt;/strong&gt; — read your failures and name them; the taxonomy decides what to measure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The golden set is the ruler&lt;/strong&gt; — representative, leak-free, fresh, and run through the &lt;em&gt;real&lt;/em&gt; prompt and gateway.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The judge is a model too&lt;/strong&gt; — defensive, dedicated, routed, and validated against humans with Cohen's κ.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A score must become a gate&lt;/strong&gt; — CI to catch regressions before ship, monitoring to catch drift after.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of it requires Python or a heavyweight platform. On .NET it's an &lt;code&gt;ILlmService&lt;/code&gt; seam, a golden dataset in JSON, a custom &lt;code&gt;IEvaluator&lt;/code&gt; on Microsoft.Extensions.AI.Evaluation, and an opt-in test category — built on a real product, in production. Done right, evals turn &lt;em&gt;"I think this AI feature is fine"&lt;/em&gt; into &lt;em&gt;"I can prove it, and I'll know the moment it stops being true."&lt;/em&gt; That's the difference between shipping AI and gambling with it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;TextStack is a reader that helps you finish the dense technical book you keep quitting — it builds every modern AI primitive (observability, evals, RAG, agents) as a real production feature on .NET. Try it at &lt;a href="https://textstack.app" rel="noopener noreferrer"&gt;textstack.app&lt;/a&gt;, or read the code at &lt;a href="https://github.com/mrviduus/textstack" rel="noopener noreferrer"&gt;github.com/mrviduus/textstack&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>evals</category>
      <category>llm</category>
      <category>dotnet</category>
    </item>
    <item>
      <title>AI Evals, Part 4: LLM-as-Judge, Done Right</title>
      <dc:creator>Vasyl</dc:creator>
      <pubDate>Wed, 17 Jun 2026 17:28:22 +0000</pubDate>
      <link>https://dev.to/mrviduus/ai-evals-part-4-llm-as-judge-done-right-31eg</link>
      <guid>https://dev.to/mrviduus/ai-evals-part-4-llm-as-judge-done-right-31eg</guid>
      <description>&lt;p&gt;&lt;em&gt;Part 4 of a series on building production AI on .NET. We've covered &lt;a href="https://vasyl.blog/what-are-ai-evals/" rel="noopener noreferrer"&gt;what evals are&lt;/a&gt;, &lt;a href="https://vasyl.blog/error-analysis-for-evals/" rel="noopener noreferrer"&gt;error analysis&lt;/a&gt;, and &lt;a href="https://vasyl.blog/golden-datasets-that-dont-lie/" rel="noopener noreferrer"&gt;golden datasets&lt;/a&gt;. Now: how do you turn a paragraph into a number you can trust?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You have a golden dataset and your feature's real output for each case. Now you need a score. But you can't &lt;code&gt;assert ==&lt;/code&gt; two paragraphs — there's no single right answer, and exact-match comparison is meaningless for prose. String-similarity metrics (BLEU, ROUGE) don't help either; they reward overlapping words, not correct meaning.&lt;/p&gt;

&lt;p&gt;The pragmatic answer the field has converged on is &lt;strong&gt;LLM-as-judge&lt;/strong&gt;: use a second, capable model to read the reference and the actual output and score it against a rubric. It's powerful, it scales, and — handled carelessly — it will hand you confident, biased numbers that feel rigorous and aren't. This post is about doing it right.&lt;/p&gt;

&lt;h2&gt;
  
  
  The basic shape
&lt;/h2&gt;

&lt;p&gt;A judge takes the rubric and an &lt;em&gt;evidence&lt;/em&gt; block (the inputs, the reference answer, and the model's actual output), and returns a structured verdict. In TextStack the judge is one feature-agnostic component built on &lt;a href="https://learn.microsoft.com/dotnet/ai/conceptual/evaluation-libraries" rel="noopener noreferrer"&gt;Microsoft.Extensions.AI.Evaluation&lt;/a&gt; — Microsoft's official .NET evaluation library — implemented as a custom &lt;code&gt;IEvaluator&lt;/code&gt;. The core is a single judge call asking for strict JSON:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;
    &lt;span class="s"&gt;"You are a strict, fair evaluator of an AI feature's output. "&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt;
    &lt;span class="s"&gt;"Score each of three dimensions on an integer scale 1-5 (5 = excellent, 1 = poor):\n"&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt;
    &lt;span class="s"&gt;$"- d1 = &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;rubric&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dim1&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s"&gt;\n- d2 = &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;rubric&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dim2&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s"&gt;\n- d3 = &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;rubric&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dim3&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s"&gt;\n"&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt;
    &lt;span class="s"&gt;"Return ONLY strict JSON: {\"d1\": int, \"d2\": int, \"d3\": int, \"rationale\": \"...\"}"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The rubric is a &lt;strong&gt;parameter, not a hardcode&lt;/strong&gt; — three named axes passed in per feature. That's what lets one judge score Explain, Translate, distractors, and book metadata, each on the dimensions its own error analysis surfaced (Explain → accuracy / conciseness / usefulness; Translate → accuracy / fluency / register; and so on). One judge, many rubrics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three things that separate a toy judge from a production one
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Parse defensively.&lt;/strong&gt; Judges wrap their JSON in prose or code fences no matter how firmly you forbid it. Don't trust the whole string — extract the first &lt;code&gt;{…}&lt;/code&gt; span:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;IndexOf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sc"&gt;'{'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;LastIndexOf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sc"&gt;'}'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;||&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;JudgeScore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"unparseable: no JSON object"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Fail to a number, not an exception.&lt;/strong&gt; An unparseable or failed judge call returns a zero score with the reason attached, which drags the run's mean &lt;em&gt;down&lt;/em&gt; instead of crashing it. A judge that silently throws is worse than one that scores zero — the zero is a visible signal you can investigate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use a dedicated, stronger judge — and route it like everything else.&lt;/strong&gt; The model that &lt;em&gt;judges&lt;/em&gt; should be more capable than the models that &lt;em&gt;generate&lt;/em&gt;. TextStack generates features on small, cheap models but judges with a &lt;code&gt;gpt-4.1&lt;/code&gt;-class model. And the judge call carries the same &lt;code&gt;eval.judge&lt;/code&gt; feature tag and flows through the same gateway as production traffic, so it's traced and cost-accounted like any other call. Evaluating is itself an AI feature; treat it like one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The biases that quietly wreck your judge
&lt;/h2&gt;

&lt;p&gt;This is the part that separates people who &lt;em&gt;use&lt;/em&gt; an LLM judge from people who can &lt;em&gt;trust&lt;/em&gt; one. A judge is a language model, and it brings model-shaped biases to grading. Ignore them and your scores are precise and wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Position bias.&lt;/strong&gt; In pairwise comparisons ("is A or B better?"), judges favour whichever answer appears first (sometimes second) regardless of content. &lt;em&gt;Mitigation:&lt;/em&gt; run each comparison both ways and average, or randomise order and watch the swap rate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verbosity bias.&lt;/strong&gt; Judges reliably prefer longer, more elaborate answers even when the extra words add nothing — actively harmful for a feature like Explain whose rubric &lt;em&gt;demands&lt;/em&gt; conciseness. &lt;em&gt;Mitigation:&lt;/em&gt; name length explicitly in the rubric and watch for score creeping up with token count.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-preference bias.&lt;/strong&gt; A judge scores text from its own model family higher. I'll be concrete about where TextStack sits here: features generated on a local model (distractors, book metadata) are judged cross-family by OpenAI — good, that's independent. But Explain and Translate are generated &lt;em&gt;and&lt;/em&gt; judged within the OpenAI family (different sizes — &lt;code&gt;gpt-4.1-nano&lt;/code&gt; to generate, &lt;code&gt;gpt-4.1&lt;/code&gt; to judge — but the same lineage), so some self-preference is still in play. The honest read: the absolute number is treated as soft; the &lt;em&gt;deltas between runs&lt;/em&gt; are what we trust. A fully independent second judge is on the roadmap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sycophancy and scale compression.&lt;/strong&gt; Judges drift toward agreeable, middling scores, clustering around 3–4 on a 1–5 scale and flattening your signal. &lt;em&gt;Mitigation:&lt;/em&gt; anchor each dimension with a concrete description (not just a one-word label), always give the judge the reference answer as a yardstick, and consider a coarser scale if the judge can't use the full range reliably.&lt;/p&gt;

&lt;h2&gt;
  
  
  Your judge needs its own eval
&lt;/h2&gt;

&lt;p&gt;Here's the step almost everyone skips: &lt;strong&gt;validate the judge against humans.&lt;/strong&gt; You wouldn't ship a feature on an unvalidated model, and a judge &lt;em&gt;is&lt;/em&gt; a model — so prove it agrees with human judgement before you trust its scores.&lt;/p&gt;

&lt;p&gt;Hand-label a sample of outputs yourself, then measure agreement between you and the judge. The right metric is &lt;strong&gt;inter-rater agreement&lt;/strong&gt; — Cohen's κ (kappa), which corrects for the agreement you'd get by chance — not raw percent-agreement, which flatters you when scores cluster. A judge around κ ≥ 0.6 against human labels is usable; near zero means it's rolling dice and your whole pipeline is theatre. Re-check it whenever you change the judge model or the rubric.&lt;/p&gt;

&lt;p&gt;There's a design subtlety worth applying here: treat the &lt;em&gt;judge prompt itself&lt;/em&gt; as something you iterate on against a labelled split. Tune the judge prompt on one slice of human-labelled cases, validate κ on a held-out slice — exactly the train/test discipline from the last post, applied one level up. The judge is software; it deserves the same rigour as the feature it grades.&lt;/p&gt;

&lt;p&gt;This closes a loop people miss. The golden set evaluates the feature; a human-labelled slice evaluates the judge. Skip the second and you've just moved your trust problem one level up and hidden it from yourself.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pitfalls
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Trusting an unvalidated judge&lt;/strong&gt; — measure κ against human labels or it's theatre.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same model generating and judging&lt;/strong&gt; — self-preference inflates the score; prefer a different (ideally cross-family) judge.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A weak judge model&lt;/strong&gt; — the judge should be &lt;em&gt;more&lt;/em&gt; capable than the generator, not the same one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ignoring position/verbosity bias&lt;/strong&gt; — randomise order, penalise padding, anchor the rubric.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One-word rubric axes&lt;/strong&gt; — "accuracy" alone means different things to the model each run; describe it concretely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Throwing on a bad verdict&lt;/strong&gt; — score it zero and surface it; don't let one parse failure kill the run.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;LLM-as-judge is the only practical way to score prose at scale, but a judge is a model with a model's biases — so build it like production code (defensive parsing, a dedicated stronger model, routed and traced) and validate it like a model (human labels, Cohen's κ, a tuned-and-tested judge prompt). Do that and your scores mean something. Skip it and you've automated the production of confident nonsense.&lt;/p&gt;

&lt;p&gt;Next, and last in the series: &lt;strong&gt;from a number to a gate&lt;/strong&gt; — wiring evals into CI and online monitoring so quality regressions turn the build red, on Microsoft.Extensions.AI.Evaluation, without bankrupting your pipeline.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;TextStack is a reader that helps you finish the dense technical book you keep quitting — it builds every modern AI primitive (observability, evals, RAG, agents) as a real production feature on .NET. Try it at &lt;a href="https://textstack.app" rel="noopener noreferrer"&gt;textstack.app&lt;/a&gt;, or read the code at &lt;a href="https://github.com/mrviduus/textstack" rel="noopener noreferrer"&gt;github.com/mrviduus/textstack&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>evals</category>
      <category>llm</category>
      <category>dotnet</category>
    </item>
    <item>
      <title>AI Evals, Part 3: Golden Datasets That Dont Lie</title>
      <dc:creator>Vasyl</dc:creator>
      <pubDate>Tue, 16 Jun 2026 21:28:24 +0000</pubDate>
      <link>https://dev.to/mrviduus/ai-evals-part-3-golden-datasets-that-dont-lie-3fog</link>
      <guid>https://dev.to/mrviduus/ai-evals-part-3-golden-datasets-that-dont-lie-3fog</guid>
      <description>&lt;p&gt;&lt;em&gt;Part 3 of a series on building production AI on .NET. &lt;a href="https://vasyl.blog/what-are-ai-evals/" rel="noopener noreferrer"&gt;Part 1&lt;/a&gt; was the overview; &lt;a href="https://vasyl.blog/error-analysis-for-evals/" rel="noopener noreferrer"&gt;Part 2&lt;/a&gt; was error analysis. Now we turn the failure taxonomy you built into something you can measure against — without quietly fooling yourself.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A golden dataset is a set of representative inputs, each paired with a reference answer a knowledgeable human would accept. It's the ruler you hold every model output against. And it is, in my experience, the single most important and most neglected asset in an eval pipeline — because a sloppy ruler doesn't announce itself. Your scores still come out green. They're just measuring the wrong thing.&lt;/p&gt;

&lt;p&gt;This post is about building a golden set that tells the truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it looks like in practice
&lt;/h2&gt;

&lt;p&gt;In TextStack, each AI feature has ~30 hand-curated cases stored as plain JSON, loaded at runtime into a typed record that mirrors exactly what the production endpoint receives:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;record&lt;/span&gt; &lt;span class="nc"&gt;ExplainGolden&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;Word&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;Sentence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="n"&gt;Genre&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;TargetLang&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;ExpectedExplanation&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plain JSON on disk, deserialised case-insensitively. No database, no platform lock-in — the dataset is a checked-in artifact you can diff in code review:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;goldens&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;GoldenData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Load&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;ExplainGolden&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;(&lt;/span&gt;&lt;span class="s"&gt;"explain.json"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The format is the easy part. The honesty is in four properties of the &lt;em&gt;content&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Representativeness — mirror reality, not the demo
&lt;/h2&gt;

&lt;p&gt;Your set should reflect the real distribution of inputs your feature meets in production, including the hard, weird, and adversarial cases. This is where Part 2 pays off: the failure taxonomy tells you which kinds of input break things, so you deliberately stock the set with them.&lt;/p&gt;

&lt;p&gt;The opposite — a set of only easy, happy-path cases — is the most common way an eval lies. The model aces them, your average climbs, and meanwhile the inputs that actually matter never get measured. Stratify on purpose: domains, lengths, languages, edge cases. For TextStack's Explain set that means technical passages &lt;em&gt;and&lt;/em&gt; casual prose, common words &lt;em&gt;and&lt;/em&gt; rare ones, several target languages — not thirty variations of the same easy lookup.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Reference quality — the ceiling you measure against
&lt;/h2&gt;

&lt;p&gt;The reference answer defines what "good" means for that case, so a lazy reference caps the meaning of your whole score. If the reference for explaining &lt;em&gt;idempotent&lt;/em&gt; is a paraphrased dictionary entry, your judge will happily reward dictionary entries — the exact failure mode you were trying to eliminate.&lt;/p&gt;

&lt;p&gt;References should be written or vetted by someone who understands the domain. For Explain, that means genuinely good in-context explanations: what the word means &lt;em&gt;here&lt;/em&gt;, in &lt;em&gt;this&lt;/em&gt; sentence, the way you'd want it explained to you. The reference is the bar; set it where you actually want the product.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Leakage — keep a real train/test split
&lt;/h2&gt;

&lt;p&gt;Here's the subtle statistical sin. If you tune your prompt against the same cases you score against, you're overfitting to the test, and your number is fiction — you've optimised for those thirty examples, not for the feature. It's the prompt-engineering version of training on your test set.&lt;/p&gt;

&lt;p&gt;Keep a slice you never look at while iterating. Tune on one part; report on the held-out part. This feels heavy for thirty cases, but the discipline is what keeps the score meaningful as you iterate. The split is just as real for prompts as it is for model weights.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Size and freshness — a floor, and a living asset
&lt;/h2&gt;

&lt;p&gt;Thirty cases is a deliberate &lt;em&gt;floor&lt;/em&gt;, not a target: enough to catch gross regressions cheaply, small enough to run often and to keep every reference high quality. (It's statistically thin for detecting small changes — that's the next post's problem.) More important than size is that the set is &lt;strong&gt;alive&lt;/strong&gt;: every new failure mode you find in production should earn a new case. A golden set that never changes slowly stops resembling reality, and a stale ruler is a lying ruler.&lt;/p&gt;

&lt;p&gt;When you genuinely lack real examples — a brand-new feature with no traffic — you can bootstrap with &lt;em&gt;synthetic&lt;/em&gt; cases (have a strong model generate realistic inputs across your taxonomy's dimensions). It's a legitimate starting point, but treat it as scaffolding: replace synthetic cases with real ones as traffic arrives, because real users are more creative than any generator.&lt;/p&gt;

&lt;h2&gt;
  
  
  The silent killer: dataset drift from production
&lt;/h2&gt;

&lt;p&gt;Now the trap that quietly invalidates an otherwise perfect golden set, and the one I'd most want a reviewer to check for.&lt;/p&gt;

&lt;p&gt;You write your feature's prompt in the API endpoint. You write the eval, and — naturally — you write the prompt &lt;em&gt;again&lt;/em&gt; in the test. Two copies. Someone tweaks the production prompt for a hotfix and doesn't touch the test copy. From that moment your eval measures a prompt &lt;strong&gt;that no longer exists in production&lt;/strong&gt;. The score stays green; the product changed underneath it. Nobody notices, because the test reports with total confidence.&lt;/p&gt;

&lt;p&gt;The fix is structural, not disciplinary: extract the prompt into one builder that &lt;em&gt;both&lt;/em&gt; production and the eval call. There is no second copy to drift.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Built once, called by BOTH the endpoint and the eval — they cannot disagree.&lt;/span&gt;
&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ExplainPrompt&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="nf"&gt;BuildSystemPrompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="n"&gt;genre&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;targetLang&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="cm"&gt;/* ... */&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="nf"&gt;BuildUserPrompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="cm"&gt;/* ... */&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The eval's case-to-request mapping wires that shared builder straight in, and crucially the request goes through the &lt;strong&gt;same model gateway&lt;/strong&gt; production uses, selected by the feature's tag:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="n"&gt;LlmRequest&lt;/span&gt; &lt;span class="nf"&gt;ToRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ExplainGolden&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;SystemPrompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ExplainPrompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;BuildSystemPrompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Genre&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TargetLang&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;Messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;LlmMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ExplainPrompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;BuildUserPrompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Word&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sentence&lt;/span&gt;&lt;span class="p"&gt;))],&lt;/span&gt;
    &lt;span class="n"&gt;MaxOutputTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;FeatureTag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"explain"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// same routing, same model, same path as prod&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you remember one thing from this post: &lt;strong&gt;an eval that runs a copy of the prompt is worse than no eval, because it manufactures false confidence.&lt;/strong&gt; Same prompt, same gateway, same path — or you're measuring a ghost.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pitfalls
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A happy-path-only set&lt;/strong&gt; — the score rises while the product falls. Stock it from your failure taxonomy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weak reference answers&lt;/strong&gt; — they cap your score's meaning and can reward the very failure you're chasing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Train/test leakage&lt;/strong&gt; — tuning and scoring on the same cases overfits to fiction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A frozen set&lt;/strong&gt; — inputs drift; a dataset that never grows slowly measures a product that no longer exists.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synthetic-forever&lt;/strong&gt; — fine to bootstrap, dangerous to rely on; real traffic is weirder.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A duplicated prompt&lt;/strong&gt; — the drift trap. One shared builder, through the real gateway.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;A golden dataset is not a formality you generate once and forget. It's a carefully curated, honestly-split, continuously-refreshed ruler — and it has to run the &lt;em&gt;real&lt;/em&gt; prompt through the &lt;em&gt;real&lt;/em&gt; path or it measures nothing. Get the dataset right and every downstream number means something. Get it wrong and you've built an instrument that lies to you in green.&lt;/p&gt;

&lt;p&gt;Next in the series: &lt;strong&gt;LLM-as-judge, done right&lt;/strong&gt; — how to turn a paragraph into a trustworthy number, the biases that wreck judges, and why your judge needs its own eval.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;TextStack is a reader that helps you finish the dense technical book you keep quitting — it builds every modern AI primitive (observability, evals, RAG, agents) as a real production feature on .NET. Try it at &lt;a href="https://textstack.app" rel="noopener noreferrer"&gt;textstack.app&lt;/a&gt;, or read the code at &lt;a href="https://github.com/mrviduus/textstack" rel="noopener noreferrer"&gt;github.com/mrviduus/textstack&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>evals</category>
      <category>llm</category>
      <category>dotnet</category>
    </item>
    <item>
      <title>AI Evals, Part 2: Error Analysis The Unglamorous Superpower Behind Good Evals</title>
      <dc:creator>Vasyl</dc:creator>
      <pubDate>Fri, 12 Jun 2026 22:46:23 +0000</pubDate>
      <link>https://dev.to/mrviduus/ai-evals-part-2-error-analysis-the-unglamorous-superpower-behind-good-evals-4k2h</link>
      <guid>https://dev.to/mrviduus/ai-evals-part-2-error-analysis-the-unglamorous-superpower-behind-good-evals-4k2h</guid>
      <description>&lt;p&gt;&lt;em&gt;Part 2 of a series on building production AI on .NET. &lt;a href="https://vasyl.blog/what-are-ai-evals/" rel="noopener noreferrer"&gt;Part 1&lt;/a&gt; covered what evals are and the Analyze → Measure → Improve lifecycle. This post is about the step everyone wants to skip: **Analyze&lt;/em&gt;&lt;em&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When a team decides to "take evals seriously," the first thing they usually do is wrong. They open a dashboard tool, wire up a generic "correctness" score, and watch a number. It feels productive. It produces a chart. And it tells them almost nothing, because they skipped the step that decides &lt;em&gt;what the chart should even measure.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That step is &lt;strong&gt;error analysis&lt;/strong&gt;: reading your AI's actual outputs and naming, precisely, the ways they go wrong. It's unglamorous — no library, no dashboard, just you and a few dozen real examples. It is also, by a wide margin, the highest-leverage thing you will do in evals: error analysis is where the signal comes from. Everything downstream is just operationalising what you find here.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why you can't skip straight to metrics
&lt;/h2&gt;

&lt;p&gt;There's a gap between you and your running system that's easy to underestimate. Thousands of inputs flow through your AI feature daily, in shapes you never anticipated, and you have no realistic way to &lt;em&gt;see&lt;/em&gt; them at scale. Call it the &lt;strong&gt;comprehension gap&lt;/strong&gt; — the distance between the developer and a true understanding of what the data and the model are actually doing.&lt;/p&gt;

&lt;p&gt;Metrics don't bridge that gulf; they presuppose it's already bridged. To measure "conciseness" you must first have &lt;em&gt;noticed&lt;/em&gt; that verbosity is a failure mode worth caring about. If you pick your metrics before you've read your data, you're measuring your assumptions, not your product. The classic result: a dashboard glowing green while users quietly churn over a problem your metrics were never designed to catch.&lt;/p&gt;

&lt;p&gt;Error analysis is how you cross the gulf. You trade scale for truth — you can't read everything, so you read a &lt;em&gt;sample&lt;/em&gt;, carefully.&lt;/p&gt;

&lt;h2&gt;
  
  
  How error analysis actually works
&lt;/h2&gt;

&lt;p&gt;It's a three-move loop, and the moves are deliberately low-tech.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Get a starting dataset and read it.&lt;/strong&gt; Pull a sample of real (or realistic) outputs — 50 to 100 is plenty to start. Not the happy-path demo cases; the real distribution, including the weird inputs. Then actually read them. Slowly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Open-code the failures.&lt;/strong&gt; For each output that's wrong, write a short, free-text note describing &lt;em&gt;what specifically is wrong&lt;/em&gt; — in your own words, no fixed categories yet. "Explained the word using a dictionary definition instead of the meaning it has in this sentence." "Translation is correct but the tone is far too formal for a casual chat." "The quiz distractor is so obviously wrong it gives the answer away." This is open coding: you're labelling reality, not forcing it into boxes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Cluster the notes into a taxonomy.&lt;/strong&gt; Once you have 40–50 notes, patterns emerge. Group them. Those groups are your &lt;strong&gt;failure taxonomy&lt;/strong&gt; — a ranked list of &lt;em&gt;how your feature fails&lt;/em&gt;, with rough frequencies. Now you know what to fix first (the common, severe modes) and, crucially, &lt;em&gt;what your metrics should measure.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That's the whole secret. The taxonomy is the output, and it's worth more than any single score, because every later step — the rubric, the golden set, the judge — is downstream of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  A mindset note: be a detective, not a judge (yet)
&lt;/h2&gt;

&lt;p&gt;The hard part of error analysis isn't mechanical, it's psychological. You will be tempted to immediately assign a 1–5 score, or to jump to "the fix is to add a line to the prompt." Resist both. Scoring too early collapses rich information ("it's a 2") into a number that hides &lt;em&gt;why&lt;/em&gt;. Fixing too early means you patch the first failure you see instead of the most common one.&lt;/p&gt;

&lt;p&gt;Stay descriptive for as long as you can. Your only job in this phase is to understand and categorise. Judgement and repair come later.&lt;/p&gt;

&lt;p&gt;A second trap is doing it alone. When two people label the same outputs, they disagree — and the disagreements are gold, because they reveal that "good" isn't actually defined yet. A short alignment session to resolve them sharpens your definition of quality before you bake it into a rubric. (Solo founders can approximate this by labelling, sleeping on it, and re-labelling cold.)&lt;/p&gt;

&lt;h2&gt;
  
  
  How error analysis shaped TextStack's evals
&lt;/h2&gt;

&lt;p&gt;This isn't abstract for us. TextStack has seven AI surfaces, and every rubric we score against came directly out of reading failures, not out of a generic template.&lt;/p&gt;

&lt;p&gt;Take &lt;strong&gt;Explain&lt;/strong&gt; (tap a word, get a short in-context explanation). Reading real outputs surfaced a recurring failure: the model would produce a competent &lt;em&gt;dictionary&lt;/em&gt; definition while ignoring the sentence the reader was actually looking at — useless for someone trying to understand &lt;em&gt;this&lt;/em&gt; passage. That single observation is why the Explain rubric scores &lt;strong&gt;accuracy in context&lt;/strong&gt; and &lt;strong&gt;usefulness to a learner&lt;/strong&gt; as distinct axes, and explicitly penalises dictionary boilerplate under &lt;strong&gt;conciseness&lt;/strong&gt;. The rubric is a direct transcription of the taxonomy.&lt;/p&gt;

&lt;p&gt;Other surfaces produced different taxonomies, and therefore different axes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Translate&lt;/strong&gt; kept failing on &lt;em&gt;register&lt;/em&gt; — accurate but wrong formality — so register became its own scored dimension alongside accuracy and fluency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vocabulary distractors&lt;/strong&gt; (wrong answers in a quiz) failed by being &lt;em&gt;implausible&lt;/em&gt; (too obviously wrong) or &lt;em&gt;too similar&lt;/em&gt; to the right answer, so the rubric scores plausibility, distinctness, and difficulty.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We didn't invent those dimensions in a meeting. We read outputs until the dimensions were obvious. And because every AI call is traced and viewable on an internal &lt;code&gt;/ai-quality&lt;/code&gt; page, error analysis isn't a one-time exercise — new production failures keep feeding new categories back into the taxonomy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pitfalls
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scoring before describing.&lt;/strong&gt; A number erases the &lt;em&gt;why&lt;/em&gt;. Open-code in words first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vague categories.&lt;/strong&gt; "Bad output" isn't a category; "ignored the sentence context" is. Specific enough to act on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Too small a sample, or only the easy cases.&lt;/strong&gt; If you only read successes, you'll conclude everything is fine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fixing during analysis.&lt;/strong&gt; Note the failure, move on. Triage &lt;em&gt;after&lt;/em&gt; you can see the whole picture.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Labelling solo with no calibration.&lt;/strong&gt; Disagreement is information; surface it before it hardens into a bad rubric.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Doing it once.&lt;/strong&gt; Inputs drift. The taxonomy is a living document, refreshed from real traffic.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;Error analysis is the part of evals with no tooling, no dashboard, and the highest payoff — and that's exactly why it gets skipped. Read your failures, name them in plain language, and cluster them into a taxonomy. That taxonomy tells you what to fix and what to measure. Skip it and you'll build a beautiful measurement system pointed at the wrong target.&lt;/p&gt;

&lt;p&gt;Next in the series: &lt;strong&gt;golden datasets that don't lie&lt;/strong&gt; — turning your taxonomy into a curated set of cases you can score against, without quietly fooling yourself.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;TextStack is a reader that helps you finish the dense technical book you keep quitting — it builds every modern AI primitive (observability, evals, RAG, agents) as a real production feature on .NET. Try it at &lt;a href="https://textstack.app" rel="noopener noreferrer"&gt;textstack.app&lt;/a&gt;, or read the code at &lt;a href="https://github.com/mrviduus/textstack" rel="noopener noreferrer"&gt;github.com/mrviduus/textstack&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>evals</category>
      <category>llm</category>
      <category>dotnet</category>
    </item>
    <item>
      <title>AI Evals, Explained: How We Actually Know Our AI Is Any Good</title>
      <dc:creator>Vasyl</dc:creator>
      <pubDate>Wed, 10 Jun 2026 15:10:15 +0000</pubDate>
      <link>https://dev.to/mrviduus/ai-evals-explained-how-we-actually-know-our-ai-is-any-good-23hj</link>
      <guid>https://dev.to/mrviduus/ai-evals-explained-how-we-actually-know-our-ai-is-any-good-23hj</guid>
      <description>&lt;p&gt;&lt;em&gt;Part 1 of a series on building production AI on .NET — drawn from &lt;a href="https://textstack.app" rel="noopener noreferrer"&gt;TextStack&lt;/a&gt;, a reader with seven shipping AI features.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You can build an AI feature in an afternoon. Wiring up an API call and a prompt is genuinely easy now. The hard part — the part that separates a demo from a product — is answering one deceptively simple question:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is it any good? And did my last change make it better or worse?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For normal code, that question has a normal answer: a test suite. &lt;code&gt;Add(2, 2)&lt;/code&gt; should return &lt;code&gt;4&lt;/code&gt;; if it doesn't, the build goes red. But an AI feature doesn't return &lt;code&gt;4&lt;/code&gt;. Ask it to explain a word and it returns a &lt;em&gt;paragraph&lt;/em&gt; — a slightly different paragraph every single time, and "correct" is a whole range of good answers, not one. You cannot write &lt;code&gt;Assert.Equal&lt;/code&gt; against prose. The thing software engineering relies on most — a fast, automatic signal that something broke — is gone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evals are how you get that signal back.&lt;/strong&gt; This post is a plain-English introduction to what they are and how we actually run them in production. No hype, no notebooks — just the mental model and a real implementation.&lt;/p&gt;

&lt;h2&gt;
  
  
  So what &lt;em&gt;is&lt;/em&gt; an eval?
&lt;/h2&gt;

&lt;p&gt;Strip away the jargon and an eval is just &lt;strong&gt;a systematic way to measure the quality of an AI output.&lt;/strong&gt; Where a unit test gives you pass/fail by exact match, an eval gives you a &lt;em&gt;graded judgement&lt;/em&gt; over a representative sample of inputs. Instead of "is this exactly right?" it asks "across 30 realistic cases, how good is this, on the axes I care about?"&lt;/p&gt;

&lt;p&gt;That measurement gets used in three different places, and it helps to keep them separate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;As monitoring&lt;/strong&gt; — you score a sample of real traffic over time, to catch quality silently drifting downward.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;As a guardrail&lt;/strong&gt; — you score an output &lt;em&gt;before&lt;/em&gt; the user sees it, and block or retry if it fails.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;As a ruler for improvement&lt;/strong&gt; — you score before and after a change, so "did this prompt edit help?" finally has an answer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most teams want the third one first and never build it. That's the gap this series is about.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lifecycle: Analyze → Measure → Improve
&lt;/h2&gt;

&lt;p&gt;The most useful framing I've found is to treat evaluation as a loop of &lt;strong&gt;Analyze, Measure, Improve.&lt;/strong&gt; It's worth internalising because it stops you from doing the steps in the wrong order — which is the single most common mistake.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Analyze — look at your failures before you measure anything.&lt;/strong&gt;&lt;br&gt;
The instinct is to jump straight to a metrics dashboard. Resist it. The highest-leverage activity in all of evals is boring: take 50–100 real outputs, read them, and label &lt;em&gt;how&lt;/em&gt; each one is wrong. Not a score — a category. "Restated the dictionary definition instead of using the sentence's context." "Translation was accurate but too formal." You cluster these into a &lt;em&gt;failure taxonomy&lt;/em&gt;, and that's what tells you which dimensions are even worth measuring. Skip this and you'll confidently measure the wrong things while users churn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Measure — turn those failure modes into a repeatable number.&lt;/strong&gt;&lt;br&gt;
This is where the golden dataset and the LLM judge come in (the next two posts go deep on each). In short: you assemble a set of representative inputs with reference answers, run your feature over them, and have a second, stronger model &lt;em&gt;score&lt;/em&gt; each output against a rubric built from your taxonomy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Improve — change something, re-run, and trust the delta.&lt;/strong&gt;&lt;br&gt;
Now you can edit a prompt, swap a model, or restructure a pipeline, run the eval, and &lt;em&gt;see&lt;/em&gt; whether quality moved. When you wire that comparison into CI, a quality regression turns the build red — the same safety net you have for ordinary code, finally extended to the non-deterministic part.&lt;/p&gt;

&lt;p&gt;It's a flywheel: production traffic reveals new failure modes → you analyze them → they become new measured cases → improvements get gated → better output produces cleaner traffic. Round and round.&lt;/p&gt;
&lt;h2&gt;
  
  
  How we run evals at TextStack
&lt;/h2&gt;

&lt;p&gt;Theory is cheap, so here's the concrete version. TextStack is an ASP.NET Core reading app with seven AI surfaces — Explain a word in context, Translate, generate vocabulary quiz distractors, book metadata, an audio podcast, and more. One rule sits above all of them:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Every AI feature ships with its own eval suite from day one. Eval is part of the pull request, not a follow-up.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Concretely, for each feature there's:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A golden dataset.&lt;/strong&gt; ~30 hand-curated cases per feature, stored as plain JSON, each pairing a realistic input with a reference answer a human would accept.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generation through the real path.&lt;/strong&gt; The eval runs each case through the &lt;em&gt;same&lt;/em&gt; code production uses — the same prompt, the same model gateway — so the test can never quietly drift away from what users actually get. (That drift is a classic, silent way to make an eval lie; more on it in the golden-dataset post.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A dedicated judge.&lt;/strong&gt; A second, stronger model (we use a &lt;code&gt;gpt-4.1&lt;/code&gt;-class model, deliberately separate from the small, cheap models that &lt;em&gt;generate&lt;/em&gt; the features) scores each output 1–5 on a short, feature-specific rubric — for Explain that's &lt;em&gt;accuracy / conciseness / usefulness&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The judge runs on &lt;strong&gt;&lt;a href="https://learn.microsoft.com/dotnet/ai/conceptual/evaluation-libraries" rel="noopener noreferrer"&gt;Microsoft.Extensions.AI.Evaluation&lt;/a&gt;&lt;/strong&gt; — Microsoft's official, open-source evaluation library for .NET. This is a deliberate choice: most of the eval ecosystem assumes you're in Python (Braintrust, Phoenix, LangSmith), but a .NET shop doesn't have to leave the platform to do this properly. Our judge is implemented as a custom &lt;code&gt;IEvaluator&lt;/code&gt;, so it slots into the same harness as Microsoft's built-in evaluators and runs as an ordinary &lt;code&gt;dotnet test&lt;/code&gt;. The whole pipeline is plain C# — no Python bridge, no LangChain. The library is young and moving fast, which also makes it one of the more approachable corners of the .NET AI stack to &lt;em&gt;contribute back&lt;/em&gt; to.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// A custom IEvaluator on Microsoft.Extensions.AI.Evaluation.&lt;/span&gt;
&lt;span class="c1"&gt;// One judge, many features: the rubric is a parameter, not hardcoded.&lt;/span&gt;
&lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;sealed&lt;/span&gt; &lt;span class="k"&gt;record&lt;/span&gt; &lt;span class="nc"&gt;Rubric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;Dim1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;Dim2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="n"&gt;Dim3&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;explain&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;Rubric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"accuracy: matches the meaning the word carries in THIS sentence"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"conciseness: 2-3 sentences, no dictionary boilerplate"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"usefulness: would a learner find it genuinely helpful"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Persistence and a dashboard.&lt;/strong&gt; Every run is stored, and an internal &lt;code&gt;/ai-quality&lt;/code&gt; page shows scores and traces per feature, so quality is something we can actually watch over time — not a number that scrolls past in a CI log.&lt;/p&gt;

&lt;p&gt;The honest status: we can run the full suite on demand and gate individual features against a quality floor; turning that into an automatic "fail the build if we regress more than X% versus last week" ratchet is the next step. The measuring instrument is built — and building the instrument is the hard 80%.&lt;/p&gt;

&lt;h2&gt;
  
  
  The traps (so you don't learn them the expensive way)
&lt;/h2&gt;

&lt;p&gt;A quick preview of what the rest of the series unpacks, because these are where eval setups quietly break:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Metrics before error analysis&lt;/strong&gt; — you measure what was easy to imagine, not what actually fails.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An easy golden set&lt;/strong&gt; — the score goes up while the product goes down.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A judge you never validated&lt;/strong&gt; — an LLM grading prose is itself a model; if it doesn't agree with human judgement, your whole pipeline is theatre.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Judge bias&lt;/strong&gt; — judges quietly prefer longer answers, the first option shown, and text from their own model family.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shipping on noise&lt;/strong&gt; — with 30 cases, a 0.1 bump in the average is probably random, not progress.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of those is a post of its own.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this is going
&lt;/h2&gt;

&lt;p&gt;Evals are not a dashboard you bolt on at the end. They're the discipline that lets you change an AI product without flying blind — look at your failures, measure them honestly, and gate on the result. Done right, they turn &lt;em&gt;"I think this feature is fine"&lt;/em&gt; into &lt;em&gt;"I can prove it, and I'll know the moment it stops being true."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Next in the series:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;This post&lt;/strong&gt; — what evals are and how we run them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error analysis&lt;/strong&gt; — the unglamorous superpower, and how to build a failure taxonomy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Golden datasets that don't lie&lt;/strong&gt; — curation, leakage, and the drift trap.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM-as-judge, done right&lt;/strong&gt; — rubrics, a dedicated judge, and the biases that wreck it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;From a number to a gate&lt;/strong&gt; — evals in CI and online monitoring.&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;TextStack is a reader that helps you finish the dense technical book you keep quitting — it builds every modern AI primitive (observability, evals, RAG, agents) as a real production feature on .NET, not a notebook. Try it at &lt;a href="https://textstack.app" rel="noopener noreferrer"&gt;textstack.app&lt;/a&gt;, or read the code at &lt;a href="https://github.com/mrviduus/textstack" rel="noopener noreferrer"&gt;github.com/mrviduus/textstack&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>evals</category>
      <category>llm</category>
      <category>dotnet</category>
    </item>
    <item>
      <title>I put Ollama on a 4 GB mobile GPU and got 2.5 — here's the VRAM math</title>
      <dc:creator>Vasyl</dc:creator>
      <pubDate>Wed, 13 May 2026 12:00:00 +0000</pubDate>
      <link>https://dev.to/mrviduus/i-put-ollama-on-a-4-gb-mobile-gpu-and-got-25-heres-the-vram-math-3mhk</link>
      <guid>https://dev.to/mrviduus/i-put-ollama-on-a-4-gb-mobile-gpu-and-got-25-heres-the-vram-math-3mhk</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Write About Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;📎 Companion piece to my earlier post: &lt;a href="https://dev.to/mrviduus/i-shipped-local-llm-features-two-months-ago-production-never-ran-them-once-41g7"&gt;I shipped local LLM features two months ago — production never ran them once&lt;/a&gt;. Same &lt;code&gt;gemma4:e2b&lt;/code&gt;, same box — this one is the &lt;strong&gt;GPU offload follow-up&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🔬 TL;DR
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;2.5× faster, 10°C cooler — on a 4 GB laptop GPU that "shouldn't" fit the model.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;CPU only&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;GPU hybrid&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tokens / sec&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;17&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;39&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-call latency&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~5.5 s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~2.0 s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU temp under burst&lt;/td&gt;
&lt;td&gt;hot&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;−10 °C&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Layers on GPU&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;35 / 36&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Same prompt. Same model. Same hardware. The only thing that changed was whether Ollama was allowed to touch the card.&lt;/p&gt;

&lt;p&gt;Honest take: I was hoping for more. The math at the end of this post explains exactly why &lt;strong&gt;2.5× is the ceiling&lt;/strong&gt; on 4 GB of VRAM with Gemma 4, and what it would take to push higher.&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚙️ Setup
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;gemma4:e2b&lt;/code&gt; (2 B effective params, ~7.2 GB on disk)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AMD Ryzen 5 4600H, 6 cores / 12 threads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;NVIDIA GTX 1650 Ti Mobile, &lt;strong&gt;4 GB VRAM&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OS / runtime&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ubuntu + Docker, Ollama 0.23.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Prompt&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Distractor + hint + explanation generator from my reader app — fixed across runs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Output budget&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~60 tokens per call&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Control&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;num_gpu=0&lt;/code&gt; → CPU only · &lt;code&gt;num_gpu=999&lt;/code&gt; → let Ollama auto-split&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Warm-up&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One throwaway call per mode before the timed samples&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both modes ran &lt;strong&gt;after warm-up&lt;/strong&gt;, so the numbers reflect steady-state inference, not first-load cost. Each &lt;code&gt;/api/generate&lt;/code&gt; response came back as NDJSON, so I pulled &lt;code&gt;eval_count&lt;/code&gt;, &lt;code&gt;eval_duration&lt;/code&gt;, and &lt;code&gt;total_duration&lt;/code&gt; straight from the engine — no external timing noise.&lt;/p&gt;




&lt;h2&gt;
  
  
  🎯 Why I picked E2B
&lt;/h2&gt;

&lt;p&gt;Gemma 4 ships in three flavours — the small E2B/E4B family, a 31B Dense model, and a 26B MoE. The model that runs in this benchmark is the smallest of those, and that wasn't accidental.&lt;/p&gt;

&lt;p&gt;The work is a fire-and-forget enrichment step inside a vocabulary-save flow — distractors plus a hint plus a short explanation, all generated in one call. It has to feel synchronous on a save action, and it has to run on the same commodity laptop as the rest of the app. Anything bigger is the wrong tool.&lt;/p&gt;

&lt;p&gt;The 31B Dense doesn't fit. The 26B MoE would, but its VRAM patterns on a 4 GB card are punishing. E4B is the obvious step up in quality from E2B, but its size pushes total memory over the line where Ollama has to keep more on CPU — slower for the same job at the latency profile a save action needs. E2B at Q4 lands the quality where I need it for distractor generation while leaving headroom for the KV cache and everything else.&lt;/p&gt;

&lt;p&gt;The framing that matters here isn't "the biggest model I could fit" but "the smallest model that gave me the output I needed." On constrained hardware, that distinction is the whole game — and it's what made the GPU experiment below worth running at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  📊 Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;CPU only&lt;/th&gt;
&lt;th&gt;GPU hybrid (35/36 layers on GPU)&lt;/th&gt;
&lt;th&gt;Δ&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Avg output tokens / call&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;td&gt;55&lt;/td&gt;
&lt;td&gt;~same&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Avg eval latency&lt;/strong&gt; (token gen only)&lt;/td&gt;
&lt;td&gt;3,506 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,411 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.49× faster&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Avg total latency&lt;/strong&gt; (prompt + gen)&lt;/td&gt;
&lt;td&gt;5,390 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2,174 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.48× faster&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tokens / sec&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;39&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.29× faster&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;ollama ps&lt;/code&gt; during the GPU run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAME          SIZE      PROCESSOR        CONTEXT   UNTIL
gemma4:e2b    7.8 GB    74%/26% CPU/GPU  4096      Forever
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;nvidia-smi&lt;/code&gt; during a generation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NVIDIA GTX 1650 Ti, used 1998 MiB, free 1909 MiB, util 32 %
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;&lt;code&gt;ollama ps&lt;/code&gt; lies to you.&lt;/strong&gt;&lt;br&gt;
That "74%/26% CPU/GPU" string is a memory split, &lt;strong&gt;not a layer split&lt;/strong&gt;. The Ollama server logs are the only place that tells you which layers actually moved. Mine showed &lt;code&gt;offloaded 35/36 layers to GPU&lt;/code&gt;. Almost the whole transformer — minus one layer that matters a lot. More on that in a second.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🧠 Why 2.5× and not 10×
&lt;/h2&gt;

&lt;p&gt;The model has 36 transformer layers. Ollama put &lt;strong&gt;35 of them on the GPU&lt;/strong&gt;. The lone holdout is the &lt;strong&gt;output projection layer&lt;/strong&gt; — the one that maps the final hidden state back into Gemma's vocabulary.&lt;/p&gt;

&lt;p&gt;Gemma 4's vocab is enormous (~256k tokens). That output layer is dense, fat, and would happily swallow what's left of the 4 GB after the rest of the stack moves over. So Ollama leaves it on CPU.&lt;/p&gt;

&lt;p&gt;The consequence is brutal in the steady state:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Every single generated token has to round-trip through the CPU at the end.&lt;/strong&gt; GPU is fast for the 35 layers it owns, then the pipeline stalls on the one layer the GPU couldn't take. Average across thousands of tokens and the CPU side becomes the floor.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's the whole story of 2.5× instead of 10×. Hybrid inference is gated by the slower of the two devices, and on this card the slower device is doing real work on every token.&lt;/p&gt;

&lt;p&gt;The takeaway worth bolding: &lt;strong&gt;if you only ever look at &lt;code&gt;ollama ps&lt;/code&gt;, you'll get the wrong picture of what your setup is doing.&lt;/strong&gt; The server load logs are the source of truth for which layers went where.&lt;/p&gt;




&lt;h2&gt;
  
  
  💡 What 2.5× actually buys you
&lt;/h2&gt;

&lt;p&gt;In the app, a single save — distractors + hint + short explanation, ~60 output tokens — used to take &lt;strong&gt;5.5 s&lt;/strong&gt;. Now it's &lt;strong&gt;just over 2 s&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That moves the action from the &lt;em&gt;"is this hanging?"&lt;/em&gt; zone into the &lt;em&gt;"yeah, it's working"&lt;/em&gt; zone. That's the threshold that actually matters for a save action.&lt;/p&gt;

&lt;p&gt;Five saves in a row:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Before:&lt;/strong&gt; ~30 seconds of full-tilt CPU&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;After:&lt;/strong&gt; ~10 seconds, work split between CPU and GPU&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bonus:&lt;/strong&gt; peak CPU temperature during that burst dropped &lt;strong&gt;~10 °C&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On a thin laptop in a small room, that last number is the difference between a fan you hear and a fan you don't.&lt;/p&gt;




&lt;h2&gt;
  
  
  🚀 What would push it higher
&lt;/h2&gt;

&lt;p&gt;Three options, in order of how willing I am to do them:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Smaller quant on just the output layer.&lt;/strong&gt; If that layer fit in the remaining ~1.9 GB, the whole model would run on GPU and you'd see the 10× numbers other writeups quote. The cost is real quality loss on the output distribution — worth measuring on your own prompt set rather than assuming.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A bigger GPU.&lt;/strong&gt; A 16 GB card holds the whole thing with room to spare. The point of this exercise was specifically &lt;em&gt;"what does a commodity laptop GPU do"&lt;/em&gt;, so a $500 desktop card isn't really in scope.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Swap engines.&lt;/strong&gt; llama.cpp direct, vLLM, etc. Two seconds is already inside budget for the action this model powers. Optimising past "fast enough" is how you end up with three benchmarks and zero users.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  🛠️ Reproducing this
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Pull the model&lt;/span&gt;
ollama pull gemma4:e2b

&lt;span class="c"&gt;# 2. Force CPU only&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:11434/api/generate &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "model": "gemma4:e2b",
  "prompt": "Give me 5 distractors for the word \"warehouse\".",
  "stream": false,
  "options": { "num_gpu": 0 }
}'&lt;/span&gt; | jq &lt;span class="s1"&gt;'{tokens: .eval_count, eval_ms: (.eval_duration/1e6), total_ms: (.total_duration/1e6)}'&lt;/span&gt;

&lt;span class="c"&gt;# 3. Let Ollama use the GPU&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:11434/api/generate &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
  "model": "gemma4:e2b",
  "prompt": "Give me 5 distractors for the word \"warehouse\".",
  "stream": false,
  "options": { "num_gpu": 999 }
}'&lt;/span&gt; | jq &lt;span class="s1"&gt;'{tokens: .eval_count, eval_ms: (.eval_duration/1e6), total_ms: (.total_duration/1e6)}'&lt;/span&gt;

&lt;span class="c"&gt;# 4. Check what actually landed where&lt;/span&gt;
docker logs ollama 2&amp;gt;&amp;amp;1 | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s2"&gt;"offloaded|layers"&lt;/span&gt;
nvidia-smi &lt;span class="nt"&gt;--query-gpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;name,memory.used,memory.free,utilization.gpu &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run each curl a handful of times to flush warm-up effects, then average &lt;code&gt;eval_ms&lt;/code&gt; and &lt;code&gt;total_ms&lt;/code&gt;. The interesting number is the &lt;strong&gt;ratio&lt;/strong&gt;, not the absolute timings — they'll vary with your CPU.&lt;/p&gt;




&lt;h2&gt;
  
  
  ✅ Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;4 GB VRAM is enough to be useful&lt;/strong&gt;, even on a model that "should" need more. Just don't expect 10×.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid inference is gated by the slower device.&lt;/strong&gt; If one critical layer stays on CPU, that's your floor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trust the load logs, not &lt;code&gt;ollama ps&lt;/code&gt;.&lt;/strong&gt; The pretty CPU/GPU percentage is a memory split, not a layer count.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2.5× is the difference between a UX that feels broken and one that doesn't.&lt;/strong&gt; That's enough.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stop optimising once you're inside budget.&lt;/strong&gt; "Fast enough" beats "fastest" every time.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;📖 Full write-up with all the load-log spelunking on my blog: &lt;a href="https://vasyl.blog/2026/05/12/i-put-ollama-on-a-4-gb-mobile-gpu-and-got-2-5x-heres-the-vram-math/" rel="noopener noreferrer"&gt;vasyl.blog — I put Ollama on a 4 GB mobile GPU and got 2.5×&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;⭐ The reader app this powers is open-source (AGPL-3.0): &lt;a href="https://github.com/mrviduus/textstack" rel="noopener noreferrer"&gt;github.com/mrviduus/textstack&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built with &lt;code&gt;gemma4:e2b&lt;/code&gt; for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge&lt;/a&gt;. If you're entering too, drop a link in the comments — happy to read yours.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
      <category>ollama</category>
    </item>
    <item>
      <title>I shipped local LLM features two months ago. Production never ran them once.</title>
      <dc:creator>Vasyl</dc:creator>
      <pubDate>Tue, 12 May 2026 11:23:14 +0000</pubDate>
      <link>https://dev.to/mrviduus/i-shipped-local-llm-features-two-months-ago-production-never-ran-them-once-41g7</link>
      <guid>https://dev.to/mrviduus/i-shipped-local-llm-features-two-months-ago-production-never-ran-them-once-41g7</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Build with Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Two months ago I shipped local-LLM features in &lt;a href="https://textstack.app" rel="noopener noreferrer"&gt;TextStack&lt;/a&gt; — an open-source reader for developers who want to finish dense English technical books in their native language. Yesterday I noticed something strange about the production server's RAM. 3 GB used out of 30. The model that runs all those features should be ~13 GB resident.&lt;/p&gt;

&lt;p&gt;I SSH'd in.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;docker compose &lt;span class="nb"&gt;exec &lt;/span&gt;ollama ollama list
NAME    ID    SIZE    MODIFIED
&lt;span class="err"&gt;$&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nothing. The Ollama container had been running for 60+ days without a single model pulled. Every distractor call had fired, hit the fallback path, and returned random vocabulary words. I never noticed because the failure mode is silent — the user sees distractors, just not LLM-generated ones.&lt;/p&gt;

&lt;p&gt;This is the post-mortem of that, plus the &lt;strong&gt;two model swaps&lt;/strong&gt; that finally got the features working: &lt;code&gt;qwen3:8b → gemma4:e4b&lt;/code&gt; on day one to bring local inference up at all, then &lt;code&gt;e4b → e2b&lt;/code&gt; once production load showed e4b couldn't keep up on CPU. &lt;strong&gt;Six production bugs surfaced along the way.&lt;/strong&gt; The article ends with a real 63,000-request load test on the e2b deploy: 100% success, p95 = 20.5 ms, total OpenAI cost = $0.002.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://textstack.app" rel="noopener noreferrer"&gt;&lt;strong&gt;TextStack&lt;/strong&gt;&lt;/a&gt; is an open-source (&lt;a href="https://github.com/mrviduus/textstack/blob/main/LICENSE" rel="noopener noreferrer"&gt;AGPL-3.0&lt;/a&gt;) reader for developers who keep abandoning English technical books like &lt;em&gt;Designing Data-Intensive Applications&lt;/em&gt;. Tap any term → context-aware translation that knows the book's domain ("attention" in an ML chapter gets &lt;em&gt;увага (механізм у нейромережах)&lt;/em&gt;, not the everyday meaning). Words you save feed a capped weekly SRS queue.&lt;/p&gt;

&lt;p&gt;Local &lt;strong&gt;Gemma 4 e2b&lt;/strong&gt; generates the multiple-choice distractors, hints, native-language explanations, and book metadata enrichment — four jobs that previously needed paid OpenAI calls per user. OpenAI &lt;code&gt;gpt-5-mini&lt;/code&gt; stays for translation (multilingual quality matters) and for in-reader live explanations (latency-sensitive). Everything else runs on a single-CPU 30 GB-RAM VPS, no GPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;🌐 &lt;strong&gt;Live:&lt;/strong&gt; &lt;a href="https://textstack.app" rel="noopener noreferrer"&gt;textstack.app&lt;/a&gt; — sample chapters open without signup. Tap any word in &lt;em&gt;Designing Data-Intensive Applications&lt;/em&gt;, then check the vocabulary review.&lt;/p&gt;

&lt;p&gt;🎬 &lt;strong&gt;37-second walkthrough — read → save word → MCQ with Gemma-generated distractors → answer feedback:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuf51hhn85wyb4ge1io7q.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuf51hhn85wyb4ge1io7q.gif" alt="TextStack vocabulary review demo: tap-translations in DDIA, save word to vocabulary, MCQ card with 4 Gemma-generated distractors, red/green answer feedback" width="720" height="374"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;📸 &lt;strong&gt;Single MCQ card — "___ the data from these external systems..." with 4 Gemma-generated distractors (battle / bringing / storm / courage):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw82vsnspx9gt45n9dz05.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw82vsnspx9gt45n9dz05.png" alt="Vocabulary multiple-choice card with cloze sentence from DDIA and 4 Gemma-generated distractor options" width="800" height="415"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note for judges:&lt;/strong&gt; Sample chapters are unauthenticated; the vocabulary review needs a free account because progress and SRS state are per-user. Use any throwaway email — there's no email verification gate on read.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;📦 &lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/mrviduus/textstack" rel="noopener noreferrer"&gt;github.com/mrviduus/textstack&lt;/a&gt; — AGPL-3.0, 200+ merged PRs, deployed at &lt;a href="https://textstack.app" rel="noopener noreferrer"&gt;textstack.app&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;⭐ &lt;strong&gt;&lt;a href="https://github.com/mrviduus/textstack" rel="noopener noreferrer"&gt;Star the repo on GitHub&lt;/a&gt;&lt;/strong&gt; — every star tells me one more developer wants to finish DDIA without giving up&lt;/p&gt;

&lt;p&gt;📐 &lt;strong&gt;Stack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Backend: ASP.NET Core 10 (clean architecture: Domain / Application / Infrastructure / Api / Worker)&lt;/li&gt;
&lt;li&gt;Database: PostgreSQL 16 with FTS for in-book search&lt;/li&gt;
&lt;li&gt;Frontend: React 19 + Vite, React Native 0.83 (Expo) for mobile&lt;/li&gt;
&lt;li&gt;LLM: Ollama running &lt;code&gt;gemma4:e2b&lt;/code&gt; for local jobs, OpenAI &lt;code&gt;gpt-5-mini&lt;/code&gt; for translation&lt;/li&gt;
&lt;li&gt;Deployment: docker-compose, Cloudflare Tunnel, single VPS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🔧 &lt;strong&gt;Key commits behind the story:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/mrviduus/textstack/pull/232" rel="noopener noreferrer"&gt;PR #232&lt;/a&gt; — original swap &lt;code&gt;qwen3:8b&lt;/code&gt; → &lt;code&gt;gemma4:e4b&lt;/code&gt;, image pin, memory bump&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/mrviduus/textstack/commit/3999944" rel="noopener noreferrer"&gt;&lt;code&gt;3999944&lt;/code&gt;&lt;/a&gt; — worker &lt;code&gt;Connection refused&lt;/code&gt; fix + the real timeout bump (30s → 90s after measurement)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/mrviduus/textstack/commit/966b398" rel="noopener noreferrer"&gt;&lt;code&gt;966b398&lt;/code&gt;&lt;/a&gt; — the second model swap, &lt;code&gt;e4b&lt;/code&gt; → &lt;code&gt;e2b&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/mrviduus/textstack/commit/c6db540" rel="noopener noreferrer"&gt;&lt;code&gt;c6db540&lt;/code&gt;&lt;/a&gt; — 63,000-request load test + full LoadSurge report&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Full PR/commit history for the swap arc lives in &lt;a href="https://github.com/mrviduus/textstack/blob/main/CHANGELOG.md" rel="noopener noreferrer"&gt;&lt;code&gt;CHANGELOG.md&lt;/code&gt; under &lt;code&gt;[Unreleased]&lt;/code&gt;&lt;/a&gt;. The Gemma-using code lives in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/mrviduus/textstack/blob/main/backend/src/Vocabulary/TextStack.Vocabulary/DistractorGenerator.cs" rel="noopener noreferrer"&gt;&lt;code&gt;backend/src/Vocabulary/TextStack.Vocabulary/DistractorGenerator.cs&lt;/code&gt;&lt;/a&gt; — prompt template, parser, fallback cascade&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/mrviduus/textstack/blob/main/backend/src/Worker/Services/BookMetadataGenerator.cs" rel="noopener noreferrer"&gt;&lt;code&gt;backend/src/Worker/Services/BookMetadataGenerator.cs&lt;/code&gt;&lt;/a&gt; — fire-and-forget metadata enrichment&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How I Used Gemma 4
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The model selection went through two rounds.&lt;/strong&gt; Gemma 4 ships in four sizes. The first time I built a trade-off table, I picked the wrong one — for understandable reasons. The second time I had production data and picked correctly. Both decisions live in the same article.&lt;/p&gt;

&lt;p&gt;Here's the matrix at the time of the first pick (E4B, day-one swap):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Disk&lt;/th&gt;
&lt;th&gt;RAM resident&lt;/th&gt;
&lt;th&gt;Fits on my VPS?&lt;/th&gt;
&lt;th&gt;First-pick reasoning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;E2B&lt;/strong&gt; (2B effective)&lt;/td&gt;
&lt;td&gt;7.2 GB&lt;/td&gt;
&lt;td&gt;~5 GiB&lt;/td&gt;
&lt;td&gt;✅ trivially&lt;/td&gt;
&lt;td&gt;"Too small for nuanced technical-vocab distractors" — &lt;em&gt;I'd find out this was wrong&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;E4B&lt;/strong&gt; (4B effective)&lt;/td&gt;
&lt;td&gt;9.6 GB&lt;/td&gt;
&lt;td&gt;13 GiB&lt;/td&gt;
&lt;td&gt;✅ with cgroup bump 4G → 12G&lt;/td&gt;
&lt;td&gt;"Sweet spot — strong enough on quality, fits the VPS" — &lt;em&gt;picked first&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;31B Dense&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~18 GB&lt;/td&gt;
&lt;td&gt;~24 GiB&lt;/td&gt;
&lt;td&gt;⚠️ tight, no headroom for Postgres + .NET&lt;/td&gt;
&lt;td&gt;"Overkill, no room for the rest of the stack"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;26B MoE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~15 GB&lt;/td&gt;
&lt;td&gt;~20 GiB&lt;/td&gt;
&lt;td&gt;⚠️ same constraint&lt;/td&gt;
&lt;td&gt;"MoE doesn't help short prompts here"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 31B and 26B MoE models would need either a GPU box or a much bigger VPS, neither of which fits an open-source project that has to remain deployable on a $20/month consumer host. So the real choice was between E2B and E4B. I went with E4B. I was wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Gemma 4 unlocked vs the cloud alternative.&lt;/strong&gt; Pre-swap, every distractor generation was a ~5¢ OpenAI call per word saved per user. With ~50 saved words per active reader per book, that's $2.50/book/user — fine for me running the only instance, fatal the moment someone else self-hosts it. Local Gemma 4 makes the marginal cost per distractor ~0 (just CPU on a box already running). Same for hints, explanations, and book metadata enrichment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local inference changed the economics of the feature completely.&lt;/strong&gt; That's the real reason the swap mattered — not the model quality, the cost shape.&lt;/p&gt;

&lt;h2&gt;
  
  
  What surfaced when I actually flipped it on
&lt;/h2&gt;

&lt;p&gt;The bug story isn't decoration — it's how I learned what each Gemma 4 quirk does in production. &lt;strong&gt;Six lessons.&lt;/strong&gt; The first four came from getting e4b to run at all. The last two came from staring at the production stats after it was "running".&lt;/p&gt;

&lt;h3&gt;
  
  
  Lesson 1: floating image tags lie
&lt;/h3&gt;

&lt;p&gt;Original &lt;code&gt;docker-compose.yml&lt;/code&gt; had:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;ollama&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ollama/ollama&lt;/span&gt;   &lt;span class="c1"&gt;# no version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Docker pulled &lt;code&gt;latest&lt;/code&gt; two months ago and cached it. &lt;code&gt;latest&lt;/code&gt; at that moment was 0.22.x. Gemma 4 wasn't released yet, so the binary doesn't recognize the model family. From the host's perspective, the "local Ollama" IS the latest version — &lt;code&gt;docker image ls&lt;/code&gt; shows the cached SHA, not whether upstream has moved.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gd"&gt;- image: ollama/ollama
&lt;/span&gt;&lt;span class="gi"&gt;+ image: ollama/ollama:0.23.1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pull succeeded after pinning. 9.6 GB on disk for e4b.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lesson 2: cgroup limits were a guess from the qwen3 era
&lt;/h3&gt;

&lt;p&gt;The container memory cap (4 GB) had been sized for &lt;code&gt;qwen3:8b&lt;/code&gt; and never re-evaluated. Gemma 4 e4b weights need 9.8 GiB. Inference returned &lt;code&gt;model requires more system memory (9.8 GiB) than is available&lt;/code&gt; until I bumped the limit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;  deploy:
    resources:
      limits:
&lt;span class="gd"&gt;-       memory: 4G
&lt;/span&gt;&lt;span class="gi"&gt;+       memory: 12G
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The lesson: every model swap should also re-evaluate the container resource block. Picked-once-and-forgotten limits are a category of silent drift.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lesson 3: cold load and warm latency both blew past my API timeout
&lt;/h3&gt;

&lt;p&gt;First inference call hung ~60s before the first token. Default Ollama &lt;code&gt;keep_alive&lt;/code&gt; is 5 minutes — after that the model unloads and the next cold call burns 60s again. Fix: &lt;code&gt;OLLAMA_KEEP_ALIVE=-1&lt;/code&gt;, plus bump the API timeout from 10s → 30s.&lt;/p&gt;

&lt;p&gt;I shipped it. Then watched production: &lt;strong&gt;2 distractor generations out of 13 saved words succeeded.&lt;/strong&gt; The model was resident the entire time. Every miss was a wall-clock timeout. E4B on CPU just takes more than 30 seconds for many prompts.&lt;/p&gt;

&lt;p&gt;So 30s wasn't enough either:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gd"&gt;- "TimeoutSeconds": 30
&lt;/span&gt;&lt;span class="gi"&gt;+ "TimeoutSeconds": 90
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Success rate climbed to ~100%. &lt;strong&gt;For CPU-only Gemma 4 on a 6-core consumer VPS, your timeout has to absorb 60–90 s tail latency, not 10 s.&lt;/strong&gt; That gap between toy-benchmark numbers and production reality is where most local-LLM ship-and-forget bugs live.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lesson 4: the parser silently dropped half my output
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;DistractorGenerator&lt;/code&gt;'s prompt asks for 5 wrong-answer words. Smoke test for &lt;code&gt;linearizability&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;consistency, atomicity, serialization, concurrency, visibility
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five single-word distractors. Clean. Then I tried &lt;code&gt;eventual consistency&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;strong consistency, read-after-write, data loss, causality, serialization
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now look at the parser:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="p"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Length&lt;/span&gt; &lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="p"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Length&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;50&lt;/span&gt;
    &lt;span class="p"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IsLetter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;!&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Equals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;originalWord&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StringComparison&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OrdinalIgnoreCase&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;!&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sc"&gt;' '&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;      &lt;span class="c1"&gt;// ← drops "strong consistency", "data loss"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The filter rejects multi-word entries. Three of the five gone. With the &lt;code&gt;distractors.Count &amp;gt;= 3&lt;/code&gt; requirement, the call returned &lt;code&gt;null&lt;/code&gt; and the fire-and-forget path fell back to the hardcoded random-word picker.&lt;/p&gt;

&lt;p&gt;The filter was there since the original implementation. &lt;strong&gt;qwen3 outputs single tokens by default, so the constraint was hidden. Gemma 4 prefers phrasal answers&lt;/strong&gt; — it's the most cross-model-family-sensitive parsing surface you'll hit when swapping. The fix was a single line in the prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- SINGLE WORD ONLY — no spaces, no multi-word phrases
  (use "linearizability" not "strong consistency"). Hyphens are fine.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After all four fixes, a real production save of &lt;code&gt;warehouse&lt;/code&gt; returned:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"storeroom"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"depot"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"facility"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"silo"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"loft"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five domain-adjacent single-word distractors, exactly the shape the prompt asks for. That's the moment local Gemma 4 was finally doing real work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lesson 5: the worker had been silently failing for two months
&lt;/h3&gt;

&lt;p&gt;While collecting production stats for &lt;em&gt;this article&lt;/em&gt;, I grepped the worker logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;docker compose logs worker | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"Connection refused"&lt;/span&gt;
... lots of lines ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;docker-compose.yml&lt;/code&gt; had set &lt;code&gt;Ollama__BaseUrl&lt;/code&gt; on the &lt;code&gt;api&lt;/code&gt; service but &lt;strong&gt;not on the &lt;code&gt;worker&lt;/code&gt; service&lt;/strong&gt;. The worker fell back to the default (&lt;code&gt;localhost:11434&lt;/code&gt; inside the worker container — there is nothing there) and every &lt;code&gt;BookMetadataGenerator&lt;/code&gt; call hit &lt;code&gt;Connection refused&lt;/code&gt; silently. Every user-uploaded book ended up with &lt;code&gt;genre = NULL&lt;/code&gt;, which in turn meant the domain-aware translation prompt had nothing to bias against.&lt;/p&gt;

&lt;p&gt;This was a &lt;em&gt;second&lt;/em&gt; silent fallback, completely orthogonal to the original one. Same shape, different surface. Fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;  worker:
    environment:
&lt;span class="gi"&gt;+     Ollama__BaseUrl: http://ollama:11434
+     Ollama__Model: gemma4:e2b
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plus a one-shot &lt;code&gt;MetadataBackfillWorker&lt;/code&gt; (a small &lt;code&gt;BackgroundService&lt;/code&gt; that runs on worker startup) to heal the ~10 user-uploaded books with &lt;code&gt;genre = NULL&lt;/code&gt;, idempotently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The pattern is the lesson.&lt;/strong&gt; Anywhere you distribute environment via a compose file, ask: which services &lt;em&gt;actually need this variable&lt;/em&gt; and is the variable set on each of them? "Inherits from .env" is not a thing in docker-compose service blocks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lesson 6: turn off thinking mode for structured outputs
&lt;/h3&gt;

&lt;p&gt;Modern Ollama models (including Gemma 4) default to a chain-of-thought "thinking" pass before the final answer. For freeform reasoning that's a quality win. For my use case — output a 5-element list of single words — the thinking pass is pure overhead. Every request was generating 50–200 tokens of internal reasoning the parser then threw away.&lt;/p&gt;

&lt;p&gt;In the Ollama call options:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gd"&gt;- options: { "temperature": 0.7 }
&lt;/span&gt;&lt;span class="gi"&gt;+ options: { "temperature": 0.7, "think": false }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Roughly halved the per-request token output. Roughly halved end-to-end latency. The quality of the distractors did not drop in my testing — for "give me 5 plausible wrong-answer words for &lt;code&gt;warehouse&lt;/code&gt;", chain-of-thought wasn't doing anything load-bearing.&lt;/p&gt;

&lt;p&gt;If you're using Ollama for structured outputs, this is the single biggest perf knob most people don't know about.&lt;/p&gt;

&lt;h2&gt;
  
  
  The second swap: e4b → e2b
&lt;/h2&gt;

&lt;p&gt;After all six lessons above, distractor calls were succeeding at ~100%. But end-to-end save latency was still tail-heavy. Looking at the numbers honestly: most calls landed in the 30–60 s range, and the 90 s timeout was absorbing what should have been a comfortable fit.&lt;/p&gt;

&lt;p&gt;Two things were happening at once:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;E4B's 13 GiB resident was contesting RAM with Postgres + .NET&lt;/strong&gt; on a 30 GB box. Not OOM-level, but the working set wasn't always in cache.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Even with &lt;code&gt;think=false&lt;/code&gt;, e4b is genuinely slow on a 6-core CPU.&lt;/strong&gt; I'd been benchmarking on a warm cache and short prompts; longer prompts (explanations, multi-sentence hints) routinely hit 60 s+.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I swapped to &lt;strong&gt;e2b&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;e4b (after all fixes)&lt;/th&gt;
&lt;th&gt;e2b (current prod)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Disk&lt;/td&gt;
&lt;td&gt;9.6 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7.2 GB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAM resident with &lt;code&gt;KEEP_ALIVE=-1&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;13 GiB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7.7 GB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference speed on same CPU&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~2–3× faster&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quality on single-word distractor task&lt;/td&gt;
&lt;td&gt;reference&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;comparable&lt;/strong&gt; for short structured outputs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The first-pick reasoning ("E2B's quality is too weak for technical vocabulary") had been based on a &lt;em&gt;quality&lt;/em&gt; benchmark. The real production constraint turned out to be &lt;em&gt;latency&lt;/em&gt;. &lt;strong&gt;For short structured outputs — distractor lists, single-line hints — e2b is fast enough that quality differences disappear into the prompt template&lt;/strong&gt;. The prompt was doing more work than I'd given it credit for.&lt;/p&gt;

&lt;p&gt;For longer freeform outputs (the 2–3 sentence native-language explanation), e2b is measurably less polished. Acceptable for the use case (it's a study aid, not a translation). If a future task demands better explanation quality, the path is a fine-tune of e2b on TextStack's domain corpus, not jumping back to e4b. Same hardware envelope, better domain fit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Numbers (real, post-e2b)
&lt;/h2&gt;

&lt;p&gt;The numbers below are measured on the production server: AMD Ryzen 5 4600H, 6 cores / 12 threads, 30 GiB RAM, no GPU. Same box that serves traffic to &lt;a href="https://textstack.app" rel="noopener noreferrer"&gt;textstack.app&lt;/a&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Disk (&lt;code&gt;gemma4:e2b&lt;/code&gt;)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7.2 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;RAM resident&lt;/strong&gt; with &lt;code&gt;KEEP_ALIVE=-1&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;7.7 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Cold load&lt;/strong&gt; (container restart)&lt;/td&gt;
&lt;td&gt;~10 s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Distractor cost per word&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~0¢ (CPU on existing box)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Equivalent OpenAI cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~5¢ per word at gpt-5-mini rates&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Load test: 63,000 requests, 100% success, $0.002
&lt;/h3&gt;

&lt;p&gt;After the e2b swap I stress-tested the production deploy with &lt;a href="https://github.com/mrviduus/textstack/tree/main/tests/TextStack.LoadTests" rel="noopener noreferrer"&gt;LoadSurge&lt;/a&gt;. Three scenarios — &lt;code&gt;GET /health&lt;/code&gt;, &lt;code&gt;POST /translate&lt;/code&gt;, &lt;code&gt;POST /explain&lt;/code&gt; — at 30–50 virtual users for 30–60 seconds each. Headlines:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total requests&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;63,000&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Success rate&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;100%&lt;/strong&gt; (0 failures)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Worst-case p95 latency&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;20.5 ms&lt;/strong&gt; (smoke; translate and explain were lower)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sustained RPS at 50 VU&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;500&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI cost during the run&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;$0.002&lt;/strong&gt; (10 cache-prewarm calls; zero during the stress phase)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peak temperature on the host&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;42 °C&lt;/strong&gt; (throttle threshold 95 °C)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The interesting part isn't the throughput — 500 RPS on a $20 box is real but not surprising for cached HTTP. The interesting part is that the expensive path disappeared entirely behind the cache. Translate and Explain are keyed by &lt;code&gt;(input, target_language, genre, sentence)&lt;/code&gt;; on a hot cache the LLM never enters the request lifecycle.&lt;/p&gt;

&lt;p&gt;The auth-gated &lt;code&gt;POST /me/vocabulary/words&lt;/code&gt; path that triggers actual Gemma 4 distractor generation wasn't covered by this run — that's the next test, with test-auth tokens and a bounded-concurrency queue in front of Ollama. The full per-scenario breakdown is in &lt;a href="https://github.com/mrviduus/textstack/blob/main/docs/loadtest/run-20260511-103451/REPORT.md" rel="noopener noreferrer"&gt;&lt;code&gt;docs/loadtest/run-20260511-103451/REPORT.md&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where OpenAI stays
&lt;/h2&gt;

&lt;p&gt;The split after both swaps:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Vocabulary distractors&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Local Gemma 4 e2b&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tolerable quality, fire-and-forget, no per-user cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Word hints&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Local Gemma 4 e2b&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Same&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Native-language explanations&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Local Gemma 4 e2b&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Same; acceptable on long-form quality given the use case&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Book metadata enrichment&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Local Gemma 4 e2b&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Same&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Translation (18+ langs, incl. Ukrainian)&lt;/td&gt;
&lt;td&gt;OpenAI gpt-5-mini&lt;/td&gt;
&lt;td&gt;Small-model multilingual translation is still a weak spot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;In-reader term explanation (live)&lt;/td&gt;
&lt;td&gt;OpenAI gpt-5-mini&lt;/td&gt;
&lt;td&gt;&amp;lt;1 s latency requirement during reading&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Local LLMs aren't a wholesale cloud replacement. &lt;strong&gt;They're a tool for tasks where quality is tolerant, latency is amortizable, privacy matters, or per-user cost matters.&lt;/strong&gt; When any of those breaks down — multilingual translation, latency-sensitive UI — cloud still wins.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons (for anyone shipping local LLMs)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Silent fallback is the worst kind of bug.&lt;/strong&gt; Distractor generation had been failing in production for 60+ days and I had no signal — the fallback was a hardcoded random-word picker, indistinguishable to the user. &lt;strong&gt;And it happened twice in the same system, on two different surfaces&lt;/strong&gt; (Ollama-not-installed, then Worker-can't-reach-Ollama). Next time: emit &lt;code&gt;llm.success&lt;/code&gt; and &lt;code&gt;llm.fallback&lt;/code&gt; counters per service, alert if the ratio drifts above 5%, and never make fallbacks bit-for-bit indistinguishable from the primary path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Floating image tags lie.&lt;/strong&gt; Pin Ollama, pin Postgres, pin everything. &lt;code&gt;latest&lt;/code&gt; freezes the day Docker pulls it; two months later it's lagging upstream and you have no signal until a new model breaks it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Defend at parse, always — even if your model behaved on first try.&lt;/strong&gt; Same prompt — qwen3 returns single tokens, Gemma 4 returns phrases. The parser's pre-existing &lt;code&gt;!w.Contains(' ')&lt;/code&gt; filter was correct in spirit but hidden from the model. Moved into the prompt, it became explicit and Gemma satisfied it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bench with real prompts on real hardware.&lt;/strong&gt; I tested e4b's quality on warm-cache short prompts and concluded it was the right pick. Real production tail latency on longer prompts was 3× what the smoke test suggested, and that's what forced the e2b downgrade. Toy benchmarks hide both model-family quirks (parsing) and hardware-bound failure modes (CPU latency).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Turn off thinking mode for structured outputs.&lt;/strong&gt; &lt;code&gt;think: false&lt;/code&gt; is the single biggest perf knob on Ollama for short structured tasks. Most documentation doesn't surface it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Distribute env vars deliberately across services.&lt;/strong&gt; Docker-compose service blocks don't inherit from each other. Whichever service actually needs a variable — list it explicitly in &lt;em&gt;that service's&lt;/em&gt; env block. The day you add a new service, audit every variable.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;The interesting part wasn't that the model failed. It was how long the system kept pretending it hadn't.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Fine-tune Gemma 4 e2b on TextStack's distractor task.&lt;/strong&gt; I now have a real production corpus building (a few hundred (term, distractor-list) pairs per week post-fix). The corpus that existed before the fix is gone — every distractor it produced came from the hardcoded fallback, not the model. The dataset starts fresh.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add a bounded-concurrency queue in front of Ollama for the write path.&lt;/strong&gt; From the load test recommendations: a &lt;code&gt;Channels&lt;/code&gt;-based worker with &lt;code&gt;MaxConcurrency = 2&lt;/code&gt; plus a per-&lt;code&gt;(word, language)&lt;/code&gt; shared cache. Mirrors the translate/explain caches that just held 500 RPS with zero LLM cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run a second load test against the auth-gated write path.&lt;/strong&gt; The 63k-request test only measured cached reads. Distractor generation is the actual bottleneck, and it sits behind authentication. Need test-auth tokens and 10–20 VU to bound it.&lt;/p&gt;

&lt;p&gt;The full TextStack codebase is AGPL-3.0 at &lt;a href="https://github.com/mrviduus/textstack" rel="noopener noreferrer"&gt;github.com/mrviduus/textstack&lt;/a&gt;. If you've shipped local-LLM features in production, &lt;strong&gt;run &lt;code&gt;ollama list&lt;/code&gt; on your server, then &lt;code&gt;docker compose logs worker | grep -i refused&lt;/code&gt;&lt;/strong&gt;. One of those might surprise you. Mine surprised me twice in the same codebase — same shape, different surface, two months apart. That's the part of operating local LLMs that nobody writes about, and the part that takes the longest to learn.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you found this useful, the strongest signal is a star on the &lt;a href="https://github.com/mrviduus/textstack" rel="noopener noreferrer"&gt;repo&lt;/a&gt;. Every star tells me the next person abandoning DDIA mid-way might find this tool — and that's the whole point.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
      <category>ollama</category>
    </item>
    <item>
      <title>Open-source licenses 101: which one to actually pick</title>
      <dc:creator>Vasyl</dc:creator>
      <pubDate>Thu, 07 May 2026 17:24:43 +0000</pubDate>
      <link>https://dev.to/mrviduus/open-source-licenses-101-which-one-to-actually-pick-232f</link>
      <guid>https://dev.to/mrviduus/open-source-licenses-101-which-one-to-actually-pick-232f</guid>
      <description>&lt;p&gt;Sooner or later, every developer runs into The License Question. You shipped something to GitHub, GitHub asked you to pick a license, and you scrolled the dropdown — MIT, Apache, GPL, AGPL, BUSL, MPL, ISC, Unlicense, "Other" — and picked whatever sounded least scary. That's how I did it. That's also how I ended up rewriting my LICENSE file three weeks later.&lt;/p&gt;

&lt;p&gt;Licenses are a dark forest for devs. We don't read legal docs, nothing in our day-to-day teaches us when each one matters, and most online advice is either a wall of legalese or someone's religious argument. Here's the version I wish someone had given me: a tour of the five licenses you'll actually meet, the mistakes that bite, and what changing my license did to my project's discoverability in the real world.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a license actually does
&lt;/h2&gt;

&lt;p&gt;By default, your code is "all rights reserved." That sounds like the default-est thing possible — but it means &lt;em&gt;no one&lt;/em&gt; can legally copy, fork, run, or modify your code without your written permission. Sticking your project on a public GitHub repo doesn't change that. A license is the contract you write with the world that relaxes the default.&lt;/p&gt;

&lt;p&gt;The question you're answering when you pick one: &lt;em&gt;how much can people do with this, and what do you get back?&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The five you'll actually meet
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;MIT.&lt;/strong&gt; "Use my code. Just keep my name in the file. Don't sue me." Three paragraphs long. Maximum adoption, zero protection. Most of the JavaScript ecosystem runs on MIT, and most of those projects don't have a monetization plan, which is exactly why it works for them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Apache 2.0.&lt;/strong&gt; Like MIT, but explicitly grants patent rights from contributors to users. That sounds boring until you realize half the tech world is built on patented stuff and silently assumes nobody will sue. Apache is the grown-up version of MIT — same vibe, fewer landmines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPL-3.0.&lt;/strong&gt; "Modify and distribute my code? Your modifications are also GPL." This is &lt;em&gt;copyleft&lt;/em&gt;. It infects everything downstream, which is why corporate lawyers hate it and Linux thrives on it (the kernel is GPL-2). Companies can't quietly fold GPL code into their proprietary stack — the license would force the whole stack open.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AGPL-3.0.&lt;/strong&gt; GPL with a single, brutal addition: §13. If you modify the code and run it as a network service — a SaaS, a hosted dashboard, anything users hit over the network — you have to publish your modifications. This closes the loophole that GPL leaves open, where a company can fork, modify privately, and host the modified version. AGPL says: nope, your fork has to be public the moment users touch it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BUSL-1.1.&lt;/strong&gt; Not actually open source by the OSI's definition — it's "source-available." You can read the code, fork it, run it for yourself; you can't sell it as a hosted commercial service competing with the original author. After four years it auto-converts to a real OSI license (usually Apache). Sentry, MariaDB, CockroachDB — all BUSL. It's a defensive license aimed at the "AWS forks our project and undercuts us on hosting" scenario.&lt;/p&gt;

&lt;p&gt;(There's also MPL-2.0 — file-level copyleft, used by Firefox. A reasonable middle ground if MIT feels too loose and AGPL too aggressive. Not your most-likely first encounter, so I'm leaving it as a footnote.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistakes I see all the time
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Picking MIT for a thing you might monetize.&lt;/strong&gt; The most expensive mistake. MIT lets a competitor fork your work, polish it, host it, and out-market you — with zero recourse. Fine for a library nobody wants to commercialize. Bad for a product.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Copying BUSL because Sentry uses BUSL.&lt;/strong&gt; Different threat models. Sentry has hyperscaler-competition risk; you have nobody-knows-you-exist risk. BUSL solves a problem you don't have, while costing you contributor goodwill, awesome-list eligibility, and brand clarity. I learned this one personally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Slapping GPL or AGPL on a library.&lt;/strong&gt; Copyleft on a library is contagious — anything that links to it inherits your license. Devs see it and walk away because they can't safely use your code in their proprietary or differently-licensed project. Libraries should almost always be MIT or Apache.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No license at all.&lt;/strong&gt; The silent killer. "All rights reserved" is the default, so a public repo with no LICENSE file is technically a public repo nobody can legally use. You're sending the message: &lt;em&gt;here's my code, but also nobody can touch it.&lt;/em&gt; If you want adoption, ship a license.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Picking the most "open" license to look generous.&lt;/strong&gt; MIT looks generous. It's also the easiest license to regret. The right question isn't "how open should I look" — it's &lt;em&gt;"what business model do I want to keep available?"&lt;/em&gt; Be honest with yourself before you optimize for image.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changing the license actually changed
&lt;/h2&gt;

&lt;p&gt;I shipped &lt;a href="https://textstack.app" rel="noopener noreferrer"&gt;TextStack&lt;/a&gt; — a reading tool I'm building solo — under BUSL-1.1. My reasoning was the same one MariaDB and Sentry articulated: &lt;em&gt;protect against AWS-style cloning before it happens.&lt;/em&gt; Sounded smart. Felt smart. Wasn't.&lt;/p&gt;

&lt;p&gt;The first sign was awesome-selfhosted. I went to add my project to the most-trafficked self-hosted directory on GitHub, opened the contributing guide, and saw a rule I hadn't expected: OSI-approved licenses only. BUSL doesn't qualify. The same pattern showed up across every awesome-* list I checked — awesome-react-native, awesome-dotnet-applications, awesome-llm-apps. Most either explicitly require an OSI-approved license or implicitly do. The world of curated, high-traffic developer discovery is gated by the OSI definition, and BUSL sits on the wrong side of the gate.&lt;/p&gt;

&lt;p&gt;Then the second-order effects started showing up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On GitHub Topics&lt;/strong&gt;, the license filter is how a lot of devs browse for tools — &lt;code&gt;license:agpl-3.0&lt;/code&gt; has its own discovery surface, &lt;code&gt;license:other&lt;/code&gt; is essentially invisible. Switching from BUSL to AGPL moved my repo from one bucket to the other.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On the README itself&lt;/strong&gt;, the license badge is the first thing a potential contributor reads. "BUSL-1.1" makes most devs hesitate — &lt;em&gt;what is this, can I actually contribute?&lt;/em&gt; "AGPL-3.0" is recognized instantly. For a portfolio project where you want stars, forks, contributors, and word-of-mouth, that hesitation is the whole game.&lt;/p&gt;

&lt;p&gt;And here's the kicker: AGPL didn't even cost me the protection I was after. The §13 network-copyleft clause makes most cloud-cloning impractical — the moment a competitor publishes a hosted fork, their differentiator is public. I kept the defensive moat; I shed the friction. On top of that, AGPL leaves dual-licensing on the table — the same playbook that funds Plausible, PostHog, and Cal.com (AGPL for the community, paid commercial license for clients who can't comply with §13). With BUSL, that revenue path was already pre-closed; BUSL &lt;em&gt;is&lt;/em&gt; the commercial-restricted license, there's nothing to upgrade away from.&lt;/p&gt;

&lt;p&gt;The lesson, if you're building a portfolio project, is uncomfortable: &lt;strong&gt;license choice is a discoverability decision, not just a legal one.&lt;/strong&gt; Awesome lists, GitHub Topics, contributor pipelines — all gated by the OSI definition. Pick the one that opens doors, not closes them.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to actually decide
&lt;/h2&gt;

&lt;p&gt;Three questions, in order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Library or product?&lt;/strong&gt; Library → Apache 2.0. Product → keep reading.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Will you monetize someday?&lt;/strong&gt; Yes → AGPL-3.0. No → MIT or Apache.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Are your future customers mostly enterprises with strict no-AGPL policies?&lt;/strong&gt; Some big companies (Google, famously) ban AGPL internally. If your TAM is enterprise, lean Apache.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For most solo-dev side projects: &lt;strong&gt;AGPL-3.0&lt;/strong&gt;. It's real open source, qualifies for awesome-list submissions, attracts contributors, and keeps the dual-licensing door open if you ever decide to monetize. That's the honest default.&lt;/p&gt;

&lt;p&gt;I picked BUSL-1.1 first, switched to AGPL-3.0 two weeks later, and watched the discovery dynamics flip on the same week. The shorter version of this whole post: pick AGPL, save yourself the relicensing.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://vasyl.blog/2026/05/06/open-source-licenses-101/" rel="noopener noreferrer"&gt;vasyl.blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>beginners</category>
      <category>programming</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>I Quit Designing Data-Intensive Applications (DDIA) Three Times. Here's What I Build on the Fourth Try.</title>
      <dc:creator>Vasyl</dc:creator>
      <pubDate>Wed, 22 Apr 2026 05:14:01 +0000</pubDate>
      <link>https://dev.to/mrviduus/i-quit-designing-data-intensive-applications-ddia-three-times-heres-what-i-build-on-the-fourth-5bom</link>
      <guid>https://dev.to/mrviduus/i-quit-designing-data-intensive-applications-ddia-three-times-heres-what-i-build-on-the-fourth-5bom</guid>
      <description>&lt;p&gt;In 2023 I bought DDIA on Kindle. Opened the replication chapter. Quit after 40 pages and didn't open it for six months.&lt;/p&gt;

&lt;p&gt;In 2024 I bought it again, because the book is clearly worth finishing. Got to page 80. Closed it.&lt;/p&gt;

&lt;p&gt;In 2025 I tried a third time with ChatGPT open in another tab to explain the hard terms. It got easier. But every lookup was the same loop — alt-tab, paste the sentence, wait, come back, find my place. After three chapters I wasn't really reading the book anymore. I was reading my own habit of switching tabs.&lt;/p&gt;

&lt;p&gt;The book still sits in my Kindle library, marked unfinished. If you have a book like that on your shelf, this post is for you. I finally figured out why I kept quitting, and built a tool that fixes it for me. Maybe it fixes it for you too.&lt;/p&gt;

&lt;h2&gt;
  
  
  What was actually breaking
&lt;/h2&gt;

&lt;p&gt;When I quit for the third time, I sat down and tried to be honest about what was stopping me.&lt;/p&gt;

&lt;p&gt;It wasn't that the book was too hard. I understood most of what was on the page. The problem was the rest — the unfamiliar terms.&lt;/p&gt;

&lt;p&gt;Every unknown term forced a decision between two bad options.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option one: stop and look it up.&lt;/strong&gt; Alt-tab, paste the sentence, wait, come back, find my place. Flow broken. The next paragraph is harder to hold in your head.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option two: skip it and hope context saves me.&lt;/strong&gt; Sometimes it does. But after a dozen skips in a chapter, the quality of my reading drops noticeably. And each "I'll figure it out later" turns into debt.&lt;/p&gt;

&lt;p&gt;The exhaustion wasn't coming from reading. It was coming from the constant small decisions.&lt;/p&gt;

&lt;p&gt;There was a third problem too. Even when I did look something up, a week later I'd forgotten it. ChatGPT doesn't remember you asked. Anki remembers, but making cards by hand is its own pile of friction. I was learning words in order to forget them. And reading books in order to quit them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I got wrong about AI and reading
&lt;/h2&gt;

&lt;p&gt;When ChatGPT arrived, a lot of people thought long books were dead. Why read 600 pages of DDIA when you can ask and get a summary in a minute?&lt;/p&gt;

&lt;p&gt;I believed that for about a year.&lt;/p&gt;

&lt;p&gt;Then I sat in a 2025 interview being asked about replication strategies in distributed systems, and realized I couldn't explain the difference between synchronous and asynchronous replication past surface-level buzzwords. I'd read dozens of summaries, listened to podcasts, watched YouTube breakdowns. I knew things on the surface. I didn't understand any of them deeply.&lt;/p&gt;

&lt;p&gt;For staying current, summaries are fine. For real understanding, nothing replaces sitting with a book that someone spent years structuring. Those are exactly the books I kept quitting around page 40.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I built
&lt;/h2&gt;

&lt;p&gt;In January 2026 I started building what became TextStack — a reader where I could read technical books without the tab switching.&lt;/p&gt;

&lt;p&gt;The idea is simple. Tap a word you don't know. An explanation appears inline — not a dictionary entry, but a short concept explanation from Claude that takes into account what the book is about and what the sentence is doing. For everyday words, a short translation. For technical terms like RLHF, attention mechanism, or eventual consistency — two or three sentences on what it is and why it matters, with links to related ideas and common confusions.&lt;/p&gt;

&lt;p&gt;The word goes into a personal dictionary automatically. But not the way LingQ does it, where your review queue grows to hundreds of items and you quit the app. I built a filter — only words from roughly the top 15,000 English words by frequency, or technical terms, enter spaced repetition. The rest are saved as reference. The weekly review queue is capped, so it never spirals.&lt;/p&gt;

&lt;p&gt;Over three and a half months I put together a working version on .NET 10, React, and React Native. PostgreSQL, Claude API for explanations, Edge TTS for audio, offline PWA. It ingests EPUB, PDF, and FB2. The catalog started wide, but I'm pruning it hard — I'm realizing focus matters more than I thought.&lt;/p&gt;

&lt;p&gt;It lives at &lt;a href="https://textstack.app" rel="noopener noreferrer"&gt;textstack.app&lt;/a&gt; — full pitch at the end of this post.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I got wrong for three months
&lt;/h2&gt;

&lt;p&gt;For the first three months I was building for an abstract "non-native English speaker who wants to read books." Nobody needs that.&lt;/p&gt;

&lt;p&gt;In April I looked at it honestly and asked who I'd actually built it for. The answer was: a developer trying to read AI engineering books. Because that's what I'd been trying to read for two years. Chip Huyen's &lt;em&gt;AI Engineering&lt;/em&gt;. &lt;em&gt;Hands-On Large Language Models&lt;/em&gt;. &lt;em&gt;Designing Machine Learning Systems&lt;/em&gt;. &lt;em&gt;Building Agentic AI Systems&lt;/em&gt;. &lt;em&gt;Prompt Engineering for LLMs&lt;/em&gt;. I bought all of them. I finished none.&lt;/p&gt;

&lt;p&gt;When I looked at other developers' reading lists online, I saw I wasn't alone. A lot of developers are trying to move into AI engineering right now. We're all reading the same books, and a lot of us aren't finishing them.&lt;/p&gt;

&lt;p&gt;This isn't a generic "non-native English" problem. It's a specific problem for a specific group going through a specific career transition.&lt;/p&gt;

&lt;p&gt;So I'm pivoting. Not "a reader for everyone." A reader for developers learning AI engineering. A narrow niche where I'm already the user.&lt;/p&gt;

&lt;h2&gt;
  
  
  The next six months
&lt;/h2&gt;

&lt;p&gt;Four things.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Rebuild the product around the AI angle.&lt;/strong&gt; Trim the catalog to 15–20 AI engineering books. Rewrite the homepage. Shift the framing from translation to explanation. Improve the prompts for technical terms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Actually start reading.&lt;/strong&gt; &lt;em&gt;Hands-On LLMs&lt;/em&gt; in May. &lt;em&gt;AI Engineering&lt;/em&gt; in June and July. &lt;em&gt;Building Agentic AI Systems&lt;/em&gt; in August. Not as a task — as something I want. I want to work as an AI engineer in two years, and the only way there is through these books. I'll read them inside TextStack, because if it doesn't work for me, it won't work for anyone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Write about the process.&lt;/strong&gt; This is the first post. If you want to follow along, the blog has RSS.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Find the first paying customer.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I'll say it openly:&lt;/strong&gt; if in six months there's one stranger paying for TextStack, I'll consider this project a success regardless of the other numbers. The first dollar from someone you don't know is a threshold most solo devs never cross. Crossing it is a big part of the work of leaving employment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;Live at &lt;a href="https://textstack.app" rel="noopener noreferrer"&gt;textstack.app&lt;/a&gt; — you can open a sample chapter of &lt;em&gt;Pragmatic Programmer&lt;/em&gt; or &lt;em&gt;Hands-On LLMs&lt;/em&gt; without signing up.&lt;/p&gt;

&lt;p&gt;If you're in a similar spot — non-native dev, bought the AI engineering books, didn't finish them — send me a note. Twitter: &lt;a href="https://x.com/Rexetdeus" rel="noopener noreferrer"&gt;@Rexetdeus&lt;/a&gt;. Email on the site. I'll give you early access and listen to what works and what doesn't. In exchange I need honest feedback.&lt;/p&gt;

&lt;p&gt;If it's not your thing, thanks for reading this far. If someone you know is stuck on Chapter 3 of &lt;em&gt;AI Engineering&lt;/em&gt;, maybe forward them this post.&lt;/p&gt;

&lt;h2&gt;
  
  
  P.S.
&lt;/h2&gt;

&lt;p&gt;One more thing. This problem — quitting hard books at page 40 — isn't really about English and isn't really about AI. It's that reading tools are stuck in the early 2010s while content has gotten much denser.&lt;/p&gt;

&lt;p&gt;Kindle Word Wise is from 2014, and it still shows single-word definitions that can't handle &lt;em&gt;eventual consistency&lt;/em&gt; or &lt;em&gt;attention mechanism&lt;/em&gt;. LingQ has been showing translations and adding words to SRS for close to two decades, and the core experience hasn't really changed. Readlang was a clever browser extension in 2013; development stopped when the founder went to Duolingo.&lt;/p&gt;

&lt;p&gt;Modern books need different tools. Not dictionaries — explanations. Not infinite queues — capped ones. Not one experience for everyone — context-aware understanding.&lt;/p&gt;

&lt;p&gt;That's the opening I'm walking into. I'll let you know in six months how it went.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;First post in a series about building TextStack as an AI engineering books reader. Star the repo if you want to follow along: &lt;a href="https://github.com/mrviduus/textstack" rel="noopener noreferrer"&gt;github.com/mrviduus/textstack&lt;/a&gt; · &lt;a href="https://textstack.app" rel="noopener noreferrer"&gt;textstack.app&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>showdev</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>beginners</category>
    </item>
    <item>
      <title>How We Made Our React SPA Visible to Google Without Rewriting Everything</title>
      <dc:creator>Vasyl</dc:creator>
      <pubDate>Sat, 17 Jan 2026 23:31:51 +0000</pubDate>
      <link>https://dev.to/mrviduus/how-we-made-our-react-spa-visible-to-google-without-rewriting-everything-1916</link>
      <guid>https://dev.to/mrviduus/how-we-made-our-react-spa-visible-to-google-without-rewriting-everything-1916</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; We needed Google to index 500+ book pages on our SPA. Instead of migrating to Next.js or building a complex SSR solution, we added dynamic rendering with Prerender in 3 files. Here's exactly how.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: Google Can't See Your Beautiful SPA
&lt;/h2&gt;

&lt;p&gt;We built &lt;a href="https://textstack.app" rel="noopener noreferrer"&gt;TextStack&lt;/a&gt; — a free online library with a Kindle-like reader. React frontend, ASP.NET Core API, PostgreSQL. Classic stack, works great.&lt;/p&gt;

&lt;p&gt;One problem: &lt;strong&gt;Google saw nothing.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="c"&gt;&amp;lt;!-- What users see --&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;title&amp;gt;&lt;/span&gt;As I Lay Dying by William Faulkner | TextStack&lt;span class="nt"&gt;&amp;lt;/title&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;h1&amp;gt;&lt;/span&gt;As I Lay Dying&lt;span class="nt"&gt;&amp;lt;/h1&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;p&amp;gt;&lt;/span&gt;After a woman in rural Mississippi dies...&lt;span class="nt"&gt;&amp;lt;/p&amp;gt;&lt;/span&gt;
&lt;span class="c"&gt;&amp;lt;!-- 98 chapters, rich metadata, Schema.org markup --&amp;gt;&lt;/span&gt;

&lt;span class="c"&gt;&amp;lt;!-- What Googlebot saw --&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;title&amp;gt;&lt;/span&gt;Free Online Library | TextStack&lt;span class="nt"&gt;&amp;lt;/title&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"root"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="c"&gt;&amp;lt;!-- Empty. Nothing. Void. --&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We had 500+ books with beautiful SEO metadata, Schema.org structured data, Open Graph tags — all generated client-side. Googlebot executes JavaScript, but it's inconsistent and slow. Our pages weren't getting indexed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Options We Considered
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Option 1: Server-Side Rendering (Next.js/Remix)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Industry standard, great DX, built-in optimizations&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt; Complete frontend rewrite. Our React app was ~50 components, custom reader with offline sync, complex state management. Estimated time: 3-4 weeks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 2: Static Site Generation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Fastest possible page loads, works everywhere&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt; We tried this. Built a Next.js SSG version. It worked... until we opened it in the browser. The reader was broken. Styles were wrong. We'd essentially need to maintain two frontends.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 3: Dynamic Rendering (Prerender)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Zero changes to existing React app. Add a service, configure nginx, done.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt; Additional infrastructure, slight latency for first bot request.&lt;/p&gt;

&lt;p&gt;We chose &lt;strong&gt;Option 3&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Dynamic Rendering?
&lt;/h2&gt;

&lt;p&gt;Dynamic rendering means serving different content based on who's asking:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Regular User (Chrome, Safari, Firefox):
  User → nginx → SPA (index.html + JS) → JS renders in browser

Search Bot (Googlebot, Bingbot):
  Bot → nginx → Prerender → Headless Chrome renders page → HTML response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Google &lt;a href="https://developers.google.com/search/docs/crawling-indexing/javascript/dynamic-rendering" rel="noopener noreferrer"&gt;officially supports this approach&lt;/a&gt; and doesn't consider it cloaking (as long as the content is the same).&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;Here's our setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────┐
│                         nginx                                │
│  ┌─────────────────────────────────────────────────────────┐│
│  │ Check User-Agent                                        ││
│  │ Is it Googlebot/Bingbot/etc?                           ││
│  └──────────────┬────────────────────┬────────────────────┘│
│                 │ YES               │ NO                    │
│                 ▼                   ▼                       │
│  ┌──────────────────┐    ┌──────────────────┐              │
│  │ Prerender        │    │ Static Files /   │              │
│  │ (Headless Chrome)│    │ Vite Dev Server  │              │
│  └────────┬─────────┘    └──────────────────┘              │
│           │                                                 │
│           ▼                                                 │
│  ┌──────────────────┐                                      │
│  │ Fetch &amp;amp; Render   │                                      │
│  │ React App        │──────► API                           │
│  └──────────────────┘                                      │
└─────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Implementation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Add Prerender Service
&lt;/h3&gt;

&lt;p&gt;We used — a lightweight Docker image with headless Chrome:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# docker-compose.yml&lt;/span&gt;
&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;prerender&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tvanro/prerender-alpine:7.2.0&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;books_prerender&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;MEMORY_CACHE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;      &lt;span class="c1"&gt;# Enable in-memory cache&lt;/span&gt;
      &lt;span class="na"&gt;CACHE_MAXSIZE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt;   &lt;span class="c1"&gt;# Cache up to 500 pages&lt;/span&gt;
      &lt;span class="na"&gt;CACHE_TTL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3600&lt;/span&gt;      &lt;span class="c1"&gt;# 1 hour cache&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3030:3000"&lt;/span&gt;
    &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1G&lt;/span&gt;       &lt;span class="c1"&gt;# Chrome is hungry&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Configure nginx Bot Detection
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Bot detection map&lt;/span&gt;
&lt;span class="k"&gt;map&lt;/span&gt; &lt;span class="nv"&gt;$http_user_agent&lt;/span&gt; &lt;span class="nv"&gt;$prerender_ua&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;default&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;"~*googlebot"&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;"~*bingbot"&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;"~*yandex"&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;"~*facebookexternalhit"&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;"~*twitterbot"&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;"~*linkedinbot"&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;"~*slackbot"&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;"~*whatsapp"&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;"~*applebot"&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="c1"&gt;# Add more as needed&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Route Bots to Prerender
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;server&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;listen&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;server_name&lt;/span&gt; &lt;span class="s"&gt;textstack.app&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;# Internal prerender location&lt;/span&gt;
    &lt;span class="kn"&gt;location&lt;/span&gt; &lt;span class="n"&gt;/prerender-internal/&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kn"&gt;internal&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_pass&lt;/span&gt; &lt;span class="s"&gt;http://prerender:3000/&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_set_header&lt;/span&gt; &lt;span class="s"&gt;Host&lt;/span&gt; &lt;span class="nv"&gt;$host&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_connect_timeout&lt;/span&gt; &lt;span class="s"&gt;60s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_read_timeout&lt;/span&gt; &lt;span class="s"&gt;60s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kn"&gt;location&lt;/span&gt; &lt;span class="n"&gt;/&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;# Check if bot&lt;/span&gt;
        &lt;span class="kn"&gt;set&lt;/span&gt; &lt;span class="nv"&gt;$prerender&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;if&lt;/span&gt; &lt;span class="s"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;$prerender_ua&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="kn"&gt;set&lt;/span&gt; &lt;span class="nv"&gt;$prerender&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c1"&gt;# Don't prerender static files&lt;/span&gt;
        &lt;span class="kn"&gt;if&lt;/span&gt; &lt;span class="s"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;$uri&lt;/span&gt; &lt;span class="p"&gt;~&lt;/span&gt;&lt;span class="sr"&gt;*&lt;/span&gt; &lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="s"&gt;.(js|css|png|jpg|svg|woff2)&lt;/span&gt;$&lt;span class="s"&gt;")&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="kn"&gt;set&lt;/span&gt; &lt;span class="nv"&gt;$prerender&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c1"&gt;# Route bots to prerender&lt;/span&gt;
        &lt;span class="kn"&gt;if&lt;/span&gt; &lt;span class="s"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;$prerender&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="kn"&gt;rewrite&lt;/span&gt; &lt;span class="s"&gt;^(.*)&lt;/span&gt;$ &lt;span class="n"&gt;/prerender-internal/http://&lt;/span&gt;&lt;span class="nv"&gt;$host$1&lt;/span&gt; &lt;span class="s"&gt;last&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c1"&gt;# Normal users get SPA&lt;/span&gt;
        &lt;span class="kn"&gt;try_files&lt;/span&gt; &lt;span class="nv"&gt;$uri&lt;/span&gt; &lt;span class="nv"&gt;$uri&lt;/span&gt;&lt;span class="n"&gt;/&lt;/span&gt; &lt;span class="n"&gt;/index.html&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Challenge: API Calls Inside Prerender
&lt;/h2&gt;

&lt;p&gt;Here's where it got interesting. After setting everything up, our book detail pages showed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;h1&amp;gt;&lt;/span&gt;Error&lt;span class="nt"&gt;&amp;lt;/h1&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;p&amp;gt;&lt;/span&gt;Failed to fetch&lt;span class="nt"&gt;&amp;lt;/p&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The problem:&lt;/strong&gt; Our SPA makes API calls to fetch book data. The API URL was configured as &lt;code&gt;http://localhost:8080&lt;/code&gt;. Inside the Prerender container, &lt;code&gt;localhost&lt;/code&gt; is... the container itself. Not our API.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The solution:&lt;/strong&gt; Vite's dev server proxy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// vite.config.ts&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="nf"&gt;defineConfig&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/api&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;http://api:8080&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// Docker service name&lt;/span&gt;
        &lt;span class="na"&gt;changeOrigin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;rewrite&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/^&lt;/span&gt;&lt;span class="se"&gt;\/&lt;/span&gt;&lt;span class="sr"&gt;api/&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="c1"&gt;// Allow prerender to access via Docker network&lt;/span&gt;
    &lt;span class="na"&gt;allowedHosts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;web&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;localhost&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then we changed our API base URL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Before&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;API_BASE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;http://localhost:8080&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;

&lt;span class="c1"&gt;// After&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;API_BASE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/api&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;  &lt;span class="c1"&gt;// Relative, works everywhere&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now when Prerender's Chrome loads our SPA:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;JS executes and calls &lt;code&gt;/api/en/books/some-book&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Vite proxies to &lt;code&gt;http://api:8080/en/books/some-book&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;API returns data&lt;/li&gt;
&lt;li&gt;React renders the page&lt;/li&gt;
&lt;li&gt;Prerender captures the HTML&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Before (what Googlebot saw):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;title&amp;gt;&lt;/span&gt;Free Online Library | TextStack&lt;span class="nt"&gt;&amp;lt;/title&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"root"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;After:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;title&amp;gt;&lt;/span&gt;As I Lay Dying by William Faulkner | TextStack&lt;span class="nt"&gt;&amp;lt;/title&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;meta&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"description"&lt;/span&gt; &lt;span class="na"&gt;content=&lt;/span&gt;&lt;span class="s"&gt;"After a woman in rural Mississippi dies,
her husband and five children begin an arduous journey..."&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;script &lt;/span&gt;&lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;"application/ld+json"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@context&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://schema.org&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@type&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Book&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;name&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;As I Lay Dying&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;author&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@type&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Person&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;name&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;William Faulkner&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;description&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;...&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;inLanguage&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;en&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/script&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;h1&amp;gt;&lt;/span&gt;As I Lay Dying&lt;span class="nt"&gt;&amp;lt;/h1&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;p&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"book-detail__author"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;William Faulkner&lt;span class="nt"&gt;&amp;lt;/p&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;ul&amp;gt;&lt;/span&gt;&lt;span class="c"&gt;&amp;lt;!-- 98 chapters with links --&amp;gt;&lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;/ul&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Performance:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First request (cold): ~3-5 seconds (Chrome needs to render)&lt;/li&gt;
&lt;li&gt;Cached requests: ~50ms&lt;/li&gt;
&lt;li&gt;Cache hit rate: ~95% (bots recrawl the same pages)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick Test
&lt;/h2&gt;

&lt;p&gt;Want to see what Googlebot sees on your site?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Your site as a regular user&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="s2"&gt;"https://yoursite.com/page"&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"&amp;lt;title&amp;gt;"&lt;/span&gt;

&lt;span class="c"&gt;# Your site as Googlebot&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-A&lt;/span&gt; &lt;span class="s2"&gt;"Googlebot"&lt;/span&gt; &lt;span class="s2"&gt;"https://yoursite.com/page"&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"&amp;lt;title&amp;gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the titles are different (or the second one is empty), you have an SEO problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Files Changed
&lt;/h2&gt;

&lt;p&gt;The entire implementation touched &lt;strong&gt;5 files&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Lines&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;docker-compose.yml&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;+20&lt;/td&gt;
&lt;td&gt;Add prerender service&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;nginx.conf&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;+85&lt;/td&gt;
&lt;td&gt;Bot detection &amp;amp; routing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vite.config.ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;+10&lt;/td&gt;
&lt;td&gt;API proxy for prerender&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;docker-compose.prod.yml&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;+18&lt;/td&gt;
&lt;td&gt;Production prerender&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;nginx-prod.conf&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;+118&lt;/td&gt;
&lt;td&gt;Production bot routing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;No React components changed. No business logic touched. The SPA remains exactly as it was.&lt;/p&gt;

&lt;h2&gt;
  
  
  Should You Use This?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Yes, if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have an existing SPA that works well&lt;/li&gt;
&lt;li&gt;You need SEO but can't justify a rewrite&lt;/li&gt;
&lt;li&gt;Your content doesn't change every second&lt;/li&gt;
&lt;li&gt;You're comfortable with Docker/nginx&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;No, if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're starting a new project (just use Next.js)&lt;/li&gt;
&lt;li&gt;You need real-time SEO updates&lt;/li&gt;
&lt;li&gt;Your pages are highly personalized&lt;/li&gt;
&lt;li&gt;You can't add infrastructure&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Have questions about implementing this for your SPA? Drop a comment below or open an issue on GitHub!&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; &lt;code&gt;#react&lt;/code&gt; &lt;code&gt;#seo&lt;/code&gt; &lt;code&gt;#docker&lt;/code&gt; &lt;code&gt;#nginx&lt;/code&gt; &lt;code&gt;#webdev&lt;/code&gt;&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>react</category>
      <category>tutorial</category>
      <category>webdev</category>
    </item>
    <item>
      <title>How to Expose Your Local Server to the Internet (Without Port Forwarding)</title>
      <dc:creator>Vasyl</dc:creator>
      <pubDate>Fri, 02 Jan 2026 17:37:18 +0000</pubDate>
      <link>https://dev.to/mrviduus/how-to-expose-your-local-server-to-the-internet-without-port-forwarding-3f3h</link>
      <guid>https://dev.to/mrviduus/how-to-expose-your-local-server-to-the-internet-without-port-forwarding-3f3h</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Use Cloudflare Tunnel to make your home server accessible from anywhere. Free, secure, no router configuration needed.&lt;/p&gt;




&lt;p&gt;I recently deployed a web app running on my laptop to a real domain. No cloud hosting, no VPS, no monthly bills. Just my laptop, a domain, and Cloudflare Tunnel.&lt;/p&gt;

&lt;p&gt;Here's exactly how I did it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;I wanted to host my side project on my own hardware. Sounds simple, right?&lt;/p&gt;

&lt;p&gt;But my setup had issues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No access to the main router's admin panel&lt;/li&gt;
&lt;li&gt;Dynamic IP address&lt;/li&gt;
&lt;li&gt;Port forwarding wasn't an option&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Traditional solutions like opening ports 80/443 weren't going to work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: Cloudflare Tunnel
&lt;/h2&gt;

&lt;p&gt;Cloudflare Tunnel (formerly Argo Tunnel) creates an &lt;strong&gt;outbound connection&lt;/strong&gt; from your server to Cloudflare's edge. No incoming ports needed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Internet → Cloudflare (SSL) → Tunnel → Your laptop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's free, handles SSL automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;A domain name (any registrar works)&lt;/li&gt;
&lt;li&gt;A Cloudflare account (free tier is fine)&lt;/li&gt;
&lt;li&gt;A Linux/Mac/Windows machine running your app&lt;/li&gt;
&lt;li&gt;~10 minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 1: Add Your Domain to Cloudflare
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Go to &lt;a href="https://dash.cloudflare.com" rel="noopener noreferrer"&gt;dash.cloudflare.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add a site&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Enter your domain (e.g., &lt;code&gt;myapp.com&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Select the &lt;strong&gt;Free&lt;/strong&gt; plan&lt;/li&gt;
&lt;li&gt;Cloudflare will show you new nameservers&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Step 2: Update Nameservers
&lt;/h2&gt;

&lt;p&gt;Go to your domain registrar and replace the nameservers with Cloudflare's.&lt;/p&gt;

&lt;p&gt;For example, if you're using Porkbun:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to Domain Management → Your domain&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Nameservers&lt;/strong&gt; → &lt;strong&gt;Edit&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Replace with Cloudflare nameservers:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   carter.ns.cloudflare.com
   vita.ns.cloudflare.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Save and wait 5-30 minutes for propagation&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Step 3: Create a Tunnel
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;In Cloudflare, go to &lt;strong&gt;Zero Trust&lt;/strong&gt; (left sidebar)&lt;/li&gt;
&lt;li&gt;Navigate to &lt;strong&gt;Networks&lt;/strong&gt; → &lt;strong&gt;Tunnels&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Create a tunnel&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Name it something like &lt;code&gt;my-server&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;You'll get an installation command with a token&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Step 4: Install Cloudflared
&lt;/h2&gt;

&lt;p&gt;On your server, install the &lt;code&gt;cloudflared&lt;/code&gt; daemon:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ubuntu/Debian:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-L&lt;/span&gt; &lt;span class="nt"&gt;--output&lt;/span&gt; cloudflared.deb https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb
&lt;span class="nb"&gt;sudo &lt;/span&gt;dpkg &lt;span class="nt"&gt;-i&lt;/span&gt; cloudflared.deb
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;macOS:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;cloudflared
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Windows:&lt;/strong&gt;&lt;br&gt;
Download from &lt;a href="https://github.com/cloudflare/cloudflared/releases" rel="noopener noreferrer"&gt;GitHub releases&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 5: Connect the Tunnel
&lt;/h2&gt;

&lt;p&gt;Run the command Cloudflare gave you:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;cloudflared service &lt;span class="nb"&gt;install&lt;/span&gt; &amp;lt;YOUR_TOKEN&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This installs cloudflared as a system service that starts automatically on boot.&lt;/p&gt;

&lt;p&gt;Check it's running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl status cloudflared
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 6: Add a Public Hostname
&lt;/h2&gt;

&lt;p&gt;Back in Cloudflare Dashboard:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to your tunnel → &lt;strong&gt;Configure&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Public Hostname&lt;/strong&gt; → &lt;strong&gt;Add a public hostname&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Fill in:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Subdomain:&lt;/strong&gt; leave empty (or use &lt;code&gt;www&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Domain:&lt;/strong&gt; select your domain&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service Type:&lt;/strong&gt; &lt;code&gt;HTTP&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;code&gt;localhost:80&lt;/code&gt; (or whatever port your app runs on)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Save&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; If you get "DNS record already exists" error, delete the existing A record for your domain in &lt;strong&gt;DNS&lt;/strong&gt; → &lt;strong&gt;Records&lt;/strong&gt; first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 7: Test It
&lt;/h2&gt;

&lt;p&gt;Open your domain in a browser. It should load your local app with HTTPS!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Or test from command line&lt;/span&gt;
curl https://myapp.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Running Multiple Services
&lt;/h2&gt;

&lt;p&gt;You can route different domains or paths to different local services:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Domain&lt;/th&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;myapp.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;localhost:3000&lt;/code&gt; (frontend)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;api.myapp.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;localhost:8080&lt;/code&gt; (API)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Just add multiple public hostnames in the tunnel configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Setup
&lt;/h2&gt;

&lt;p&gt;Here's what I'm running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;textstack.app     → localhost:80 → nginx → React frontend
textstack.app/api → localhost:80 → nginx → Docker API (port 8080)
textstack.dev     → localhost:80 → nginx → Same app, different site
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All from a laptop sitting in my living room.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security Tips
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Firewall:&lt;/strong&gt; Only allow necessary ports locally
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   &lt;span class="nb"&gt;sudo &lt;/span&gt;ufw allow 22/tcp   &lt;span class="c"&gt;# SSH&lt;/span&gt;
   &lt;span class="nb"&gt;sudo &lt;/span&gt;ufw allow 80/tcp   &lt;span class="c"&gt;# For tunnel (local only)&lt;/span&gt;
   &lt;span class="nb"&gt;sudo &lt;/span&gt;ufw &lt;span class="nb"&gt;enable&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Don't expose admin panels&lt;/strong&gt; to the internet. Keep them on &lt;code&gt;localhost&lt;/code&gt; only.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cloudflare adds SSL automatically&lt;/strong&gt; - no need for Let's Encrypt.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Use Access Policies&lt;/strong&gt; (in Zero Trust) to require authentication for sensitive routes.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Site not loading?&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check tunnel is running&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl status cloudflared

&lt;span class="c"&gt;# Check your app is running&lt;/span&gt;
curl localhost:80

&lt;span class="c"&gt;# Check DNS propagation&lt;/span&gt;
dig myapp.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;"Address Not Found" on mobile?&lt;/strong&gt;&lt;br&gt;
DNS might not have propagated to your carrier yet. Try:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Toggle airplane mode on/off&lt;/li&gt;
&lt;li&gt;Use a different DNS (1.1.1.1)&lt;/li&gt;
&lt;li&gt;Wait 15-30 minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;502 Bad Gateway?&lt;/strong&gt;&lt;br&gt;
Your local app isn't responding. Check it's running on the correct port.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Cloudflare Tunnel: &lt;strong&gt;Free&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Domain: ~$10/year&lt;/li&gt;
&lt;li&gt;Hosting: &lt;strong&gt;$0&lt;/strong&gt; (your own hardware)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When NOT to Use This
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;High-traffic production apps (your home internet has limits)&lt;/li&gt;
&lt;li&gt;Apps requiring 99.99% uptime (your laptop can crash)&lt;/li&gt;
&lt;li&gt;Sensitive data without proper security measures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For serious production workloads, use proper cloud hosting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;Cloudflare Tunnel is perfect for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Side projects&lt;/li&gt;
&lt;li&gt;Development/staging environments&lt;/li&gt;
&lt;li&gt;Self-hosted apps&lt;/li&gt;
&lt;li&gt;Home automation dashboards&lt;/li&gt;
&lt;li&gt;Personal APIs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It took me about 15 minutes to go from "app running locally" to "app accessible worldwide with HTTPS."&lt;/p&gt;

&lt;p&gt;No cloud bills. No DevOps complexity. Just your code, running on your hardware, accessible to the world.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Have questions?&lt;/strong&gt; Drop a comment below or find me on &lt;a href="https://x.com/rexetdeus" rel="noopener noreferrer"&gt;Twitter/X&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Building something cool with this setup?&lt;/strong&gt; I'd love to hear about it!&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Tags: cloudflare, self-hosting, devops, tutorial, web-development&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>networking</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
