DEV Community: Vasyl

Your RAG Eval Isn't Flaky. Your Retrieval Is Non-Deterministic.

Vasyl — Tue, 14 Jul 2026 14:48:00 +0000

Same query.
Same documents.
Same model.
And the RAG eval can still hand back a different Recall@8.

Not because the model is flaky. Because of an ORDER BY clause.

I didn't find this by watching a metric wobble. I found it reading the retrieval code, and realized the score would drift run to run even if the model never changed.

This came out of a habit I've adopted recently: I write the eval before the feature. Reviewing the retrieval pipeline behind my "Ask this Book" feature, I saw it: the retrieval layer wasn't deterministic.

Order isn't presentation. It's part of the input.

My RAG implementation is intentionally simple: plain PostgreSQL and .NET. Two retrieval strategies over the same table:

semantic search using pgvector
lexical search using PostgreSQL full-text search

The results are merged with Reciprocal Rank Fusion (RRF).

Here's the important part: RRF doesn't care about the retrieval scores. It only cares about rank.

If one retriever returns

A
B
C

instead of

B
A
C

RRF produces different fused scores. Different fused scores mean a different Top-K. Different Top-K means different Recall@K.

In RRF, order isn't a display detail. Order is data.

The bug

My lexical query ended like this:

ORDER BY score DESC

Looks perfectly reasonable. Except ts_rank_cd produces ties surprisingly often. Multiple chunks can have exactly the same score.

And SQL only guarantees the ordering you explicitly request. If multiple rows compare equal, PostgreSQL is free to return them in any order.

Nothing changed. Same database. Same query. Same model. Only the order of equally-ranked rows. Yet that's enough for RRF to assign different ranks, producing different fused scores and a different evaluation result.

The semantic retrieval had the same issue. Distance ties are much rarer than lexical ties, but "rare" isn't good enough for an evaluation pipeline.

The fix

The fix was almost embarrassingly small.

Before:

ORDER BY score DESC

After:

ORDER BY score DESC, id

A deterministic tie-breaker on both retrieval queries. Now equal-scoring rows always appear in the same order, RRF receives the same input every run, and the Top-K stays identical.

Notice what didn't happen. The retrieval didn't become better. It became reproducible.

Why this matters

We spend a lot of effort making the model deterministic during evaluation: temperature 0, fixed datasets, golden answers, reproducible prompts.

But it's easy to assume everything underneath the model is already deterministic. Often it isn't. Retrieval. Ranking. Sampling. Data loading. Any non-deterministic stage in the pipeline can quietly invalidate your eval.

A fluctuating eval isn't just annoying. It's dangerous. Eventually you stop trusting the number, even when it's pointing at a real problem.

The lesson I took away

Before debugging the model, debug determinism. An evaluation can only be as deterministic as the pipeline feeding it. Same query. Same rows. Same order. Only then can you trust what your eval is telling you.

I build TextStack, an open-source reader for technical books, in .NET. This is from the retrieval layer behind its "Ask this Book" feature. Code on github.com/mrviduus/textstack.

AI Wrote a Thread-Safe Counter. The CPU Made It 5x Slower.

Vasyl — Tue, 07 Jul 2026 13:00:00 +0000

Meet the cache line.

I asked an AI assistant for a simple thing: per-thread counters. Four threads, each incrementing its own slot in an array. No shared variables. No locks needed. The code it wrote was clean and correct:

long[] counters = new long[threadCount];
// thread t does: counters[t]++

Every thread writes only to its own element. There is no race here. Any code review would pass it. Every test would pass too.

Then I measured it against a version that does exactly the same work, and the correct code lost by 5x.

The numbers

This is from my machine (Apple Silicon, .NET 10, 4 threads, 200 million increments per thread):

Round 1:  adjacent (one cache line):  464 ms   padded (line per thread):  85 ms   ratio: 5.4x
Round 2:  adjacent (one cache line):  397 ms   padded (line per thread):  85 ms   ratio: 4.7x

Same loop. Same number of increments. Same "each thread touches only its own counter". The only difference between the two versions is where the counters live in memory.

What the hardware actually does

A CPU never reads one byte from memory. It moves data in fixed blocks called cache lines — 64 bytes on x86, 128 bytes on Apple Silicon. Ask for one long and the whole block it lives in travels into the core's cache. Think of a cook whose ingredients are in a basement fridge: going downstairs is expensive, so you never carry one carrot — you carry the whole crate.

(Don't take my word for the 128: run sysctl hw.cachelinesize on an M-series Mac.)

Usually this works for you. Array elements sit side by side, so scanning an array is fast: you touch one element and the next fifteen arrive in the same crate for free.

But with multiple cores there is a rule: to write into a cache line, a core must own it exclusively. The moment core 1 writes, every other core's copy of that line is declared stale. And the unit of ownership is not your variable. It is the whole line.

Now look at my four counters. Four long values, 8 bytes each, side by side — 32 bytes. They all fit in one cache line. Thread 1 increments its counter and takes ownership of the line. A nanosecond later thread 2 increments its own, different counter — and has to rip the same line back. The line ping-pongs between cores on every single write. Four threads that share nothing in the code are fighting over one crate in the hardware.

This is called false sharing. False, because no data is actually shared. The fight is real anyway.

Four counters in one cache line vs. one counter per line — same code, 5x difference.

The demo fix is one attribute

Give every thread its own cache line. Pad each counter so the next one starts in a different line:

[StructLayout(LayoutKind.Explicit, Size = 128)]
struct PaddedCounter
{
    [FieldOffset(64)] public long Value;
}

That is the entire difference between 464 ms and 85 ms. We pay a little memory — 128 bytes per counter instead of 8 — and get back the parallelism we thought we already had. (I use 128, not 64, for two reasons: Apple Silicon lines are 128 bytes, and on x86 the adjacent-line prefetcher likes to drag neighboring lines along.)

Run it yourself

The full demo is about 70 lines, no project file needed — with the .NET 10 SDK you can run a single file directly:

dotnet run -c Release FalseSharingDemo.cs

// FalseSharingDemo.cs — .NET 10 file-based app
using System.Diagnostics;
using System.Runtime.InteropServices;

const long Iterations = 200_000_000;
int threadCount = Math.Min(Environment.ProcessorCount, 4);

for (int round = 0; round < 2; round++)
{
    var slow = MeasureAdjacent(threadCount);
    var fast = MeasurePadded(threadCount);
    Console.WriteLine($"adjacent: {slow.TotalMilliseconds:F0} ms   padded: {fast.TotalMilliseconds:F0} ms   ratio: {slow / fast:F1}x");
}

static TimeSpan MeasureAdjacent(int threadCount)
{
    long[] counters = new long[threadCount];
    return RunThreads(threadCount, t =>
    {
        for (long i = 0; i < Iterations; i++)
            counters[t]++;
        return counters[t];
    });
}

static TimeSpan MeasurePadded(int threadCount)
{
    var counters = new PaddedCounter[threadCount];
    return RunThreads(threadCount, t =>
    {
        for (long i = 0; i < Iterations; i++)
            counters[t].Value++;
        return counters[t].Value;
    });
}

static TimeSpan RunThreads(int threadCount, Func<int, long> body)
{
    long sink = 0;
    var threads = new Thread[threadCount];
    var sw = Stopwatch.StartNew();
    for (int t = 0; t < threadCount; t++)
    {
        int id = t;
        threads[t] = new Thread(() => Interlocked.Add(ref sink, body(id)));
        threads[t].Start();
    }
    foreach (var th in threads) th.Join();
    sw.Stop();
    GC.KeepAlive(sink);
    return sw.Elapsed;
}

[StructLayout(LayoutKind.Explicit, Size = 128)]
struct PaddedCounter
{
    [FieldOffset(64)] public long Value;
}

Where this hides in real code

You will not write four counters in a loop at work. But you will write, or an AI will write for you:

Cache statistics. Almost every in-memory cache keeps hits, misses, evictions. The natural implementation is fields next to each other, or a long[] with one slot per shard, updated with Interlocked.Increment from every thread. Fields next to each other means one cache line. This is literally the demo above, running in your production service.

Sharded counters. The cruel version: you sharded a counter specifically to make it parallel, put the shards in one array — and they still share lines. You did the architecture work and the hardware quietly undid it.

LRU metadata. A compact long[] lastAccessTicks per cache slot means every cache read becomes a write into a hot shared array. A read-heavy cache that is slow because of writes is a fun thing to debug.

And sometimes the problem ships inside the library. ConcurrentDictionary — the base of most homemade .NET caches — internally keeps a counter per lock stripe in a plain array (_countPerLock in the source). Under very hot multi-threaded writes those neighbors can end up sharing lines. I have not benchmarked the real-world impact — but the layout is right there in the source, and now you know what to look for.

When you should NOT care

Honesty section. False sharing hurts when the writes are hot — millions of updates per second from several threads. If your cache updates its stats a thousand times per second, you will never notice, and padding everything "just in case" is cargo cult. The rule is the same as always: measure first. The demo above is the measurement; adapt it to your data layout and see if your ratio is 1.0x or 5x.

The actual point

The AI-generated code was not wrong. It compiled, it was race-free, it passed every test I could write for its correctness. An entire code review process could bless it. The 5x was invisible at every layer we normally check.

That is what changed with AI-assisted coding, and it is why hardware fundamentals became more valuable, not less. The model will happily generate a thread-safe counter, a sharded cache, an LRU eviction policy — and none of its correctness guarantees say anything about cache lines. Correctness and mechanical sympathy are different layers. Tests catch the first. Only understanding catches the second.

You do not need to memorize cache sizes. You need to know the crate exists. Keep data that is used together close. Keep data that is written by different threads apart. That one rule, read in both directions, is most of "cache-aware" programming.

The tools write the code now. Knowing why it is slow is still our job.

Originally published on vasyl.blog.

An AI Feature Has No "Tests Pass" Moment. So I Write the Eval First.

Vasyl — Tue, 23 Jun 2026 12:00:00 +0000

I was building an "Ask This Book" feature: readers can ask questions about a book while they're reading it.

One requirement sounded simple:

A reader on chapter 3 must never receive spoilers from chapter 30.

My first instinct was the same as everyone else's: tell the model not to spoil future chapters. Something like:

"Please don't reveal information from chapters the reader hasn't reached yet."

And honestly, it mostly worked.

The problem is that "mostly" is useless. A user only needs one spoiler.

That was the moment I realized the feature had no definition of done.

With normal software, something pushes back. The compiler complains. The tests fail. The types don't line up.

With an LLM feature, none of that happens. The output looks plausible by default — fluent, confident, well formatted — even when it's wrong.

So "it looked right in the demo" quietly becomes the finish line.

That's exactly why I write the eval before I write the feature.

The Eval Is the Specification

Most teams treat evals as QA. Build the feature, ship something that works, add evals later.

I increasingly think that's backwards. For AI systems, the eval is often the only concrete definition of success.

The moment I wrote the spoiler eval, I had to define failure: spoiler leakage must be zero. Not low. Not acceptable. Zero.

And that requirement immediately exposed a problem. No prompt can guarantee zero.

Prompts are probabilistic. Users can phrase questions differently. Models can interpret instructions differently. Future model updates can behave differently. You cannot get a hard guarantee from a soft instruction.

The Eval Changed the Architecture

Once the eval demanded zero spoilers, the solution stopped being a prompt problem. It became a retrieval problem.

Instead of telling the model not to reveal future chapters, I prevented future chapters from entering the context at all:

WHERE chapter_ord <= @maxChapterOrd```

Anything beyond the reader's progress never enters the retrieval set. The model can't leak information it never saw.

And the eval that checks it is just as blunt — a retrieved chunk past the reader's progress is a leak:



```csharp
// One retrieved chunk past the reader's progress = one spoiler leak.
public static int LeakCount(IEnumerable<RetrievedChunk> retrieved, int gateChapterOrd) =>
    retrieved.Count(c => c.ChapterOrd > gateChapterOrd);

Across the adversarial test cases, that number has to be zero. That's the moment the idea really clicked for me: the eval didn't test the design. It produced the design.

A measurable failure condition forced a better architecture than I would have built if I had started with prompt engineering.

The Same Thing Happened to Retrieval Quality

The spoiler requirement wasn't the only eval. I also defined two other targets before building the feature:

Retrieval must surface the correct passage near the top of the results.
Answers must remain grounded in the passages they cite.

Because those requirements were measurable, every change received a verdict instead of an opinion.

A single semantic search wasn't clearing the bar. So I ended up combining two retrieval approaches:

vector search for semantic similarity
full-text search for exact names, phrases, and quotations

The results are fused using Reciprocal Rank Fusion — less mysterious than it sounds. Each chunk scores the sum of 1/(k+rank) across the lists it appears in, so anything ranked highly by both retrievers floats to the top:

// ranked highly by both vector AND lexical -> floats to the top.
scores[item] += 1.0 / (k + i + 1); // i is 0-based; RRF rank is 1-based

I didn't choose hybrid retrieval because it's fashionable. I chose it because it moved the number. The eval said the system wasn't good enough. The architecture changed until it was.

A Note on the Stack

None of this is a no-dependencies flex. The judge that scores grounding is a custom evaluator on Microsoft.Extensions.AI.Evaluation:

public sealed class RubricEvaluator(string id, Rubric rubric) : IEvaluator

I lean on the Microsoft stack on purpose. What I keep hand-rolled is the part that decides quality — the retrieval, the fusion, the spoiler gate. The line I draw isn't "no libraries." It's no agent framework hiding the parts that determine whether the thing actually works.

Eval-First Development

Traditional software development gives us confidence almost for free. Compilers. Type systems. Unit tests. Integration tests.

AI systems don't. The difficult part isn't implementing the feature. The difficult part is defining what "correct" means.

That's why I increasingly think of eval-first development as the AI equivalent of TDD. With traditional software, tests verify the implementation. With AI systems, evals often define the implementation.

Build the feature first and the eval later, and the eval can only grade what you've already built. Build the eval first and it starts shaping the system itself.

It defines done. It tells you when you've regressed. And sometimes it forces a better architecture than the one you originally had in mind.

Otherwise you're not shipping a feature. You're shipping a guess that happened to demo well.

Want to go deeper on evals? I've written a separate, more hands-on series on building production AI on .NET: what evals actually are, error analysis, golden datasets, LLM-as-judge, and evals in CI and production. This post was originally published on my blog.

AI Evals, Part 5: From a Number to a Gate Evals in CI and Production

Vasyl — Wed, 17 Jun 2026 17:43:25 +0000

Part 5, the finale, of a series on building production AI on .NET. We've built the pieces — what evals are, error analysis, golden datasets, and a trustworthy judge. Now we make them earn their keep.

By now you can produce a defensible quality score for an AI feature. But a score you only look at is a vanity metric. The entire point of all that work is to make quality something your engineering process acts on automatically — the same way a failing unit test stops a bad commit. That means two homes for your evals: a gate before you ship, and monitoring after.

Home 1: CI — a safety net against regressions

Because TextStack's judge is a custom IEvaluator on Microsoft.Extensions.AI.Evaluation, an eval is just a dotnet test. The MEAI evaluator emits the rubric's axes plus an overall as numeric metrics, and a quality floor is expressed as a Pass/Fail interpretation on the overall:

// In the evaluator: the overall metric is interpreted Pass/Fail against a floor.
if (overallFloor is { } floor)
    overall.Interpretation = new EvaluationMetricInterpretation(
        RatingFor(score.Mean),
        failed: score.Mean < floor,
        reason: $"floor {floor:0.0} (mean {score.Mean:0.00})");

That catches gross breakage — "something is badly wrong." But the more valuable gate is relative: store a baseline score per feature, and fail the build when a change drops quality by more than a threshold versus that baseline. That turns "did this prompt change help?" into a red/green answer and makes improving a prompt a tight loop — change, run, compare, keep or revert. It's the AI equivalent of TDD.

Honest status from our codebase: the floor and on-demand runs exist today; the automatic baseline-versus-regression gate is the next step. I'm flagging that deliberately, because plenty of "we do eval-driven development" claims are really "we have a number nobody gates on." The hard 80% — the measuring instrument — is built; wiring the ratchet is the lighter remaining 20%.

The constraint CI forces: evals cost money

Every eval case is a real generation plus a real judge call. Running the full suite on every commit is slow and expensive, so evals have to be deliberate. TextStack's are opt-in: tagged so default CI skips them, and they self-skip when the provider isn't configured.

OPENAI_API_KEY=… dotnet test tests/TextStack.AiEvals --filter Category=Eval

Default CI stays green and free; the expensive truth runs on purpose. The pragmatic pattern: a small, cheap subset on pull requests for a fast signal, and the full suite nightly or pre-release. Treat eval spend like any cloud cost — budget it, don't let it run unbounded.

Home 2: Production — monitoring and guardrails

A curated golden set, however good, is a snapshot of inputs you imagined. Production sends inputs you didn't. So the offline gate is only half the system; the other half runs against live traffic.

This is where evals and observability become one thing. Every AI call in TextStack is tagged with its feature and recorded — cost, latency, tokens, errors — and runs persist to an eval_runs table surfaced on an internal /ai-quality dashboard (Traces and Evals tabs), with an admin "Run evals" button to trigger the suite on demand. Because the judge is the same component offline and online, you can sample real outputs per feature and score them with the identical rubric. Two modes fall out of that:

Background monitoring — sample a slice of live outputs, judge them, and watch the score over time to catch drift before users complain.
Guardrails — for high-stakes outputs, judge in the critical path and block, retry, or fall back when a result fails. (Use sparingly: it adds a judge call's worth of latency and cost to the request.)

The flywheel

Put the two homes together and you get a loop that compounds. Production surfaces a new failure mode → you do error analysis on it → it becomes a new golden case → your gate now defends against it → quality climbs → cleaner output produces cleaner traffic. Each turn makes the next regression harder to ship. That continuous-improvement flywheel — not any single dashboard — is the real product of an eval system.

The pitfalls

A number nobody gates on — if a bad score can't fail a build or page someone, it's decoration.
A fixed floor mistaken for a regression gate — a floor catches breakage, not a 2%-worse change. You want both.
Evals on every commit — the bill and the wait will kill the habit; subset on PRs, full suite nightly.
Offline-only — you'll ship regressions from inputs your golden set never imagined.
Guardrails everywhere — judging in the critical path is powerful but costs latency; reserve it for outputs that matter.
Online scores you never read — monitoring you don't look at is just a more expensive log.

The series, in one line each

That's the whole discipline, start to finish:

Evals are the test suite for non-deterministic code — graded judgement over a representative sample.
Error analysis comes first — read your failures and name them; the taxonomy decides what to measure.
The golden set is the ruler — representative, leak-free, fresh, and run through the real prompt and gateway.
The judge is a model too — defensive, dedicated, routed, and validated against humans with Cohen's κ.
A score must become a gate — CI to catch regressions before ship, monitoring to catch drift after.

None of it requires Python or a heavyweight platform. On .NET it's an ILlmService seam, a golden dataset in JSON, a custom IEvaluator on Microsoft.Extensions.AI.Evaluation, and an opt-in test category — built on a real product, in production. Done right, evals turn "I think this AI feature is fine" into "I can prove it, and I'll know the moment it stops being true." That's the difference between shipping AI and gambling with it.

TextStack is a reader that helps you finish the dense technical book you keep quitting — it builds every modern AI primitive (observability, evals, RAG, agents) as a real production feature on .NET. Try it at textstack.app, or read the code at github.com/mrviduus/textstack.

AI Evals, Part 4: LLM-as-Judge, Done Right

Vasyl — Wed, 17 Jun 2026 17:28:22 +0000

Part 4 of a series on building production AI on .NET. We've covered what evals are, error analysis, and golden datasets. Now: how do you turn a paragraph into a number you can trust?

You have a golden dataset and your feature's real output for each case. Now you need a score. But you can't assert == two paragraphs — there's no single right answer, and exact-match comparison is meaningless for prose. String-similarity metrics (BLEU, ROUGE) don't help either; they reward overlapping words, not correct meaning.

The pragmatic answer the field has converged on is LLM-as-judge: use a second, capable model to read the reference and the actual output and score it against a rubric. It's powerful, it scales, and — handled carelessly — it will hand you confident, biased numbers that feel rigorous and aren't. This post is about doing it right.

The basic shape

A judge takes the rubric and an evidence block (the inputs, the reference answer, and the model's actual output), and returns a structured verdict. In TextStack the judge is one feature-agnostic component built on Microsoft.Extensions.AI.Evaluation — Microsoft's official .NET evaluation library — implemented as a custom IEvaluator. The core is a single judge call asking for strict JSON:

var system =
    "You are a strict, fair evaluator of an AI feature's output. " +
    "Score each of three dimensions on an integer scale 1-5 (5 = excellent, 1 = poor):\n" +
    $"- d1 = {rubric.Dim1}\n- d2 = {rubric.Dim2}\n- d3 = {rubric.Dim3}\n" +
    "Return ONLY strict JSON: {\"d1\": int, \"d2\": int, \"d3\": int, \"rationale\": \"...\"}";

The rubric is a parameter, not a hardcode — three named axes passed in per feature. That's what lets one judge score Explain, Translate, distractors, and book metadata, each on the dimensions its own error analysis surfaced (Explain → accuracy / conciseness / usefulness; Translate → accuracy / fluency / register; and so on). One judge, many rubrics.

Three things that separate a toy judge from a production one

Parse defensively. Judges wrap their JSON in prose or code fences no matter how firmly you forbid it. Don't trust the whole string — extract the first {…} span:

var start = raw.IndexOf('{');
var end = raw.LastIndexOf('}');
if (start < 0 || end <= start)
    return new JudgeScore(0, 0, 0, "unparseable: no JSON object");

Fail to a number, not an exception. An unparseable or failed judge call returns a zero score with the reason attached, which drags the run's mean down instead of crashing it. A judge that silently throws is worse than one that scores zero — the zero is a visible signal you can investigate.

Use a dedicated, stronger judge — and route it like everything else. The model that judges should be more capable than the models that generate. TextStack generates features on small, cheap models but judges with a gpt-4.1-class model. And the judge call carries the same eval.judge feature tag and flows through the same gateway as production traffic, so it's traced and cost-accounted like any other call. Evaluating is itself an AI feature; treat it like one.

The biases that quietly wreck your judge

This is the part that separates people who use an LLM judge from people who can trust one. A judge is a language model, and it brings model-shaped biases to grading. Ignore them and your scores are precise and wrong.

Position bias. In pairwise comparisons ("is A or B better?"), judges favour whichever answer appears first (sometimes second) regardless of content. Mitigation: run each comparison both ways and average, or randomise order and watch the swap rate.

Verbosity bias. Judges reliably prefer longer, more elaborate answers even when the extra words add nothing — actively harmful for a feature like Explain whose rubric demands conciseness. Mitigation: name length explicitly in the rubric and watch for score creeping up with token count.

Self-preference bias. A judge scores text from its own model family higher. I'll be concrete about where TextStack sits here: features generated on a local model (distractors, book metadata) are judged cross-family by OpenAI — good, that's independent. But Explain and Translate are generated and judged within the OpenAI family (different sizes — gpt-4.1-nano to generate, gpt-4.1 to judge — but the same lineage), so some self-preference is still in play. The honest read: the absolute number is treated as soft; the deltas between runs are what we trust. A fully independent second judge is on the roadmap.

Sycophancy and scale compression. Judges drift toward agreeable, middling scores, clustering around 3–4 on a 1–5 scale and flattening your signal. Mitigation: anchor each dimension with a concrete description (not just a one-word label), always give the judge the reference answer as a yardstick, and consider a coarser scale if the judge can't use the full range reliably.

Your judge needs its own eval

Here's the step almost everyone skips: validate the judge against humans. You wouldn't ship a feature on an unvalidated model, and a judge is a model — so prove it agrees with human judgement before you trust its scores.

Hand-label a sample of outputs yourself, then measure agreement between you and the judge. The right metric is inter-rater agreement — Cohen's κ (kappa), which corrects for the agreement you'd get by chance — not raw percent-agreement, which flatters you when scores cluster. A judge around κ ≥ 0.6 against human labels is usable; near zero means it's rolling dice and your whole pipeline is theatre. Re-check it whenever you change the judge model or the rubric.

There's a design subtlety worth applying here: treat the judge prompt itself as something you iterate on against a labelled split. Tune the judge prompt on one slice of human-labelled cases, validate κ on a held-out slice — exactly the train/test discipline from the last post, applied one level up. The judge is software; it deserves the same rigour as the feature it grades.

This closes a loop people miss. The golden set evaluates the feature; a human-labelled slice evaluates the judge. Skip the second and you've just moved your trust problem one level up and hidden it from yourself.

The pitfalls

Trusting an unvalidated judge — measure κ against human labels or it's theatre.
Same model generating and judging — self-preference inflates the score; prefer a different (ideally cross-family) judge.
A weak judge model — the judge should be more capable than the generator, not the same one.
Ignoring position/verbosity bias — randomise order, penalise padding, anchor the rubric.
One-word rubric axes — "accuracy" alone means different things to the model each run; describe it concretely.
Throwing on a bad verdict — score it zero and surface it; don't let one parse failure kill the run.

The takeaway

LLM-as-judge is the only practical way to score prose at scale, but a judge is a model with a model's biases — so build it like production code (defensive parsing, a dedicated stronger model, routed and traced) and validate it like a model (human labels, Cohen's κ, a tuned-and-tested judge prompt). Do that and your scores mean something. Skip it and you've automated the production of confident nonsense.

Next, and last in the series: from a number to a gate — wiring evals into CI and online monitoring so quality regressions turn the build red, on Microsoft.Extensions.AI.Evaluation, without bankrupting your pipeline.

AI Evals, Part 3: Golden Datasets That Dont Lie

Vasyl — Tue, 16 Jun 2026 21:28:24 +0000

Part 3 of a series on building production AI on .NET. Part 1 was the overview; Part 2 was error analysis. Now we turn the failure taxonomy you built into something you can measure against — without quietly fooling yourself.

A golden dataset is a set of representative inputs, each paired with a reference answer a knowledgeable human would accept. It's the ruler you hold every model output against. And it is, in my experience, the single most important and most neglected asset in an eval pipeline — because a sloppy ruler doesn't announce itself. Your scores still come out green. They're just measuring the wrong thing.

This post is about building a golden set that tells the truth.

What it looks like in practice

In TextStack, each AI feature has ~30 hand-curated cases stored as plain JSON, loaded at runtime into a typed record that mirrors exactly what the production endpoint receives:

public record ExplainGolden(
    string Word,
    string Sentence,
    string? Genre,
    string TargetLang,
    string ExpectedExplanation);

Plain JSON on disk, deserialised case-insensitively. No database, no platform lock-in — the dataset is a checked-in artifact you can diff in code review:

var goldens = GoldenData.Load<ExplainGolden>("explain.json");

The format is the easy part. The honesty is in four properties of the content.

1. Representativeness — mirror reality, not the demo

Your set should reflect the real distribution of inputs your feature meets in production, including the hard, weird, and adversarial cases. This is where Part 2 pays off: the failure taxonomy tells you which kinds of input break things, so you deliberately stock the set with them.

The opposite — a set of only easy, happy-path cases — is the most common way an eval lies. The model aces them, your average climbs, and meanwhile the inputs that actually matter never get measured. Stratify on purpose: domains, lengths, languages, edge cases. For TextStack's Explain set that means technical passages and casual prose, common words and rare ones, several target languages — not thirty variations of the same easy lookup.

2. Reference quality — the ceiling you measure against

The reference answer defines what "good" means for that case, so a lazy reference caps the meaning of your whole score. If the reference for explaining idempotent is a paraphrased dictionary entry, your judge will happily reward dictionary entries — the exact failure mode you were trying to eliminate.

References should be written or vetted by someone who understands the domain. For Explain, that means genuinely good in-context explanations: what the word means here, in this sentence, the way you'd want it explained to you. The reference is the bar; set it where you actually want the product.

3. Leakage — keep a real train/test split

Here's the subtle statistical sin. If you tune your prompt against the same cases you score against, you're overfitting to the test, and your number is fiction — you've optimised for those thirty examples, not for the feature. It's the prompt-engineering version of training on your test set.

Keep a slice you never look at while iterating. Tune on one part; report on the held-out part. This feels heavy for thirty cases, but the discipline is what keeps the score meaningful as you iterate. The split is just as real for prompts as it is for model weights.

4. Size and freshness — a floor, and a living asset

Thirty cases is a deliberate floor, not a target: enough to catch gross regressions cheaply, small enough to run often and to keep every reference high quality. (It's statistically thin for detecting small changes — that's the next post's problem.) More important than size is that the set is alive: every new failure mode you find in production should earn a new case. A golden set that never changes slowly stops resembling reality, and a stale ruler is a lying ruler.

When you genuinely lack real examples — a brand-new feature with no traffic — you can bootstrap with synthetic cases (have a strong model generate realistic inputs across your taxonomy's dimensions). It's a legitimate starting point, but treat it as scaffolding: replace synthetic cases with real ones as traffic arrives, because real users are more creative than any generator.

The silent killer: dataset drift from production

Now the trap that quietly invalidates an otherwise perfect golden set, and the one I'd most want a reviewer to check for.

You write your feature's prompt in the API endpoint. You write the eval, and — naturally — you write the prompt again in the test. Two copies. Someone tweaks the production prompt for a hotfix and doesn't touch the test copy. From that moment your eval measures a prompt that no longer exists in production. The score stays green; the product changed underneath it. Nobody notices, because the test reports with total confidence.

The fix is structural, not disciplinary: extract the prompt into one builder that both production and the eval call. There is no second copy to drift.

// Built once, called by BOTH the endpoint and the eval — they cannot disagree.
public static class ExplainPrompt
{
    public static string BuildSystemPrompt(string? genre, string targetLang) => /* ... */;
    public static string BuildUserPrompt(string word, string sentence) => /* ... */;
}

The eval's case-to-request mapping wires that shared builder straight in, and crucially the request goes through the same model gateway production uses, selected by the feature's tag:

private static LlmRequest ToRequest(ExplainGolden g) => new(
    SystemPrompt: ExplainPrompt.BuildSystemPrompt(g.Genre, g.TargetLang),
    Messages: [new LlmMessage("user", ExplainPrompt.BuildUserPrompt(g.Word, g.Sentence))],
    MaxOutputTokens: 500,
    FeatureTag: "explain"); // same routing, same model, same path as prod

If you remember one thing from this post: an eval that runs a copy of the prompt is worse than no eval, because it manufactures false confidence. Same prompt, same gateway, same path — or you're measuring a ghost.

The pitfalls

A happy-path-only set — the score rises while the product falls. Stock it from your failure taxonomy.
Weak reference answers — they cap your score's meaning and can reward the very failure you're chasing.
Train/test leakage — tuning and scoring on the same cases overfits to fiction.
A frozen set — inputs drift; a dataset that never grows slowly measures a product that no longer exists.
Synthetic-forever — fine to bootstrap, dangerous to rely on; real traffic is weirder.
A duplicated prompt — the drift trap. One shared builder, through the real gateway.

The takeaway

A golden dataset is not a formality you generate once and forget. It's a carefully curated, honestly-split, continuously-refreshed ruler — and it has to run the real prompt through the real path or it measures nothing. Get the dataset right and every downstream number means something. Get it wrong and you've built an instrument that lies to you in green.

Next in the series: LLM-as-judge, done right — how to turn a paragraph into a trustworthy number, the biases that wreck judges, and why your judge needs its own eval.

AI Evals, Part 2: Error Analysis The Unglamorous Superpower Behind Good Evals

Vasyl — Fri, 12 Jun 2026 22:46:23 +0000

Part 2 of a series on building production AI on .NET. Part 1 covered what evals are and the Analyze → Measure → Improve lifecycle. This post is about the step everyone wants to skip: **Analyze.

When a team decides to "take evals seriously," the first thing they usually do is wrong. They open a dashboard tool, wire up a generic "correctness" score, and watch a number. It feels productive. It produces a chart. And it tells them almost nothing, because they skipped the step that decides what the chart should even measure.

That step is error analysis: reading your AI's actual outputs and naming, precisely, the ways they go wrong. It's unglamorous — no library, no dashboard, just you and a few dozen real examples. It is also, by a wide margin, the highest-leverage thing you will do in evals: error analysis is where the signal comes from. Everything downstream is just operationalising what you find here.

Why you can't skip straight to metrics

There's a gap between you and your running system that's easy to underestimate. Thousands of inputs flow through your AI feature daily, in shapes you never anticipated, and you have no realistic way to see them at scale. Call it the comprehension gap — the distance between the developer and a true understanding of what the data and the model are actually doing.

Metrics don't bridge that gulf; they presuppose it's already bridged. To measure "conciseness" you must first have noticed that verbosity is a failure mode worth caring about. If you pick your metrics before you've read your data, you're measuring your assumptions, not your product. The classic result: a dashboard glowing green while users quietly churn over a problem your metrics were never designed to catch.

Error analysis is how you cross the gulf. You trade scale for truth — you can't read everything, so you read a sample, carefully.

How error analysis actually works

It's a three-move loop, and the moves are deliberately low-tech.

1. Get a starting dataset and read it. Pull a sample of real (or realistic) outputs — 50 to 100 is plenty to start. Not the happy-path demo cases; the real distribution, including the weird inputs. Then actually read them. Slowly.

2. Open-code the failures. For each output that's wrong, write a short, free-text note describing what specifically is wrong — in your own words, no fixed categories yet. "Explained the word using a dictionary definition instead of the meaning it has in this sentence." "Translation is correct but the tone is far too formal for a casual chat." "The quiz distractor is so obviously wrong it gives the answer away." This is open coding: you're labelling reality, not forcing it into boxes.

3. Cluster the notes into a taxonomy. Once you have 40–50 notes, patterns emerge. Group them. Those groups are your failure taxonomy — a ranked list of how your feature fails, with rough frequencies. Now you know what to fix first (the common, severe modes) and, crucially, what your metrics should measure.

That's the whole secret. The taxonomy is the output, and it's worth more than any single score, because every later step — the rubric, the golden set, the judge — is downstream of it.

A mindset note: be a detective, not a judge (yet)

The hard part of error analysis isn't mechanical, it's psychological. You will be tempted to immediately assign a 1–5 score, or to jump to "the fix is to add a line to the prompt." Resist both. Scoring too early collapses rich information ("it's a 2") into a number that hides why. Fixing too early means you patch the first failure you see instead of the most common one.

Stay descriptive for as long as you can. Your only job in this phase is to understand and categorise. Judgement and repair come later.

A second trap is doing it alone. When two people label the same outputs, they disagree — and the disagreements are gold, because they reveal that "good" isn't actually defined yet. A short alignment session to resolve them sharpens your definition of quality before you bake it into a rubric. (Solo founders can approximate this by labelling, sleeping on it, and re-labelling cold.)

How error analysis shaped TextStack's evals

This isn't abstract for us. TextStack has seven AI surfaces, and every rubric we score against came directly out of reading failures, not out of a generic template.

Take Explain (tap a word, get a short in-context explanation). Reading real outputs surfaced a recurring failure: the model would produce a competent dictionary definition while ignoring the sentence the reader was actually looking at — useless for someone trying to understand this passage. That single observation is why the Explain rubric scores accuracy in context and usefulness to a learner as distinct axes, and explicitly penalises dictionary boilerplate under conciseness. The rubric is a direct transcription of the taxonomy.

Other surfaces produced different taxonomies, and therefore different axes:

Translate kept failing on register — accurate but wrong formality — so register became its own scored dimension alongside accuracy and fluency.
Vocabulary distractors (wrong answers in a quiz) failed by being implausible (too obviously wrong) or too similar to the right answer, so the rubric scores plausibility, distinctness, and difficulty.

We didn't invent those dimensions in a meeting. We read outputs until the dimensions were obvious. And because every AI call is traced and viewable on an internal /ai-quality page, error analysis isn't a one-time exercise — new production failures keep feeding new categories back into the taxonomy.

The pitfalls

Scoring before describing. A number erases the why. Open-code in words first.
Vague categories. "Bad output" isn't a category; "ignored the sentence context" is. Specific enough to act on.
Too small a sample, or only the easy cases. If you only read successes, you'll conclude everything is fine.
Fixing during analysis. Note the failure, move on. Triage after you can see the whole picture.
Labelling solo with no calibration. Disagreement is information; surface it before it hardens into a bad rubric.
Doing it once. Inputs drift. The taxonomy is a living document, refreshed from real traffic.

The takeaway

Error analysis is the part of evals with no tooling, no dashboard, and the highest payoff — and that's exactly why it gets skipped. Read your failures, name them in plain language, and cluster them into a taxonomy. That taxonomy tells you what to fix and what to measure. Skip it and you'll build a beautiful measurement system pointed at the wrong target.

Next in the series: golden datasets that don't lie — turning your taxonomy into a curated set of cases you can score against, without quietly fooling yourself.

AI Evals, Explained: How We Actually Know Our AI Is Any Good

Vasyl — Wed, 10 Jun 2026 15:10:15 +0000

Part 1 of a series on building production AI on .NET — drawn from TextStack, a reader with seven shipping AI features.

You can build an AI feature in an afternoon. Wiring up an API call and a prompt is genuinely easy now. The hard part — the part that separates a demo from a product — is answering one deceptively simple question:

Is it any good? And did my last change make it better or worse?

For normal code, that question has a normal answer: a test suite. Add(2, 2) should return 4; if it doesn't, the build goes red. But an AI feature doesn't return 4. Ask it to explain a word and it returns a paragraph — a slightly different paragraph every single time, and "correct" is a whole range of good answers, not one. You cannot write Assert.Equal against prose. The thing software engineering relies on most — a fast, automatic signal that something broke — is gone.

Evals are how you get that signal back. This post is a plain-English introduction to what they are and how we actually run them in production. No hype, no notebooks — just the mental model and a real implementation.

So what is an eval?

Strip away the jargon and an eval is just a systematic way to measure the quality of an AI output. Where a unit test gives you pass/fail by exact match, an eval gives you a graded judgement over a representative sample of inputs. Instead of "is this exactly right?" it asks "across 30 realistic cases, how good is this, on the axes I care about?"

That measurement gets used in three different places, and it helps to keep them separate:

As monitoring — you score a sample of real traffic over time, to catch quality silently drifting downward.
As a guardrail — you score an output before the user sees it, and block or retry if it fails.
As a ruler for improvement — you score before and after a change, so "did this prompt edit help?" finally has an answer.

Most teams want the third one first and never build it. That's the gap this series is about.

The lifecycle: Analyze → Measure → Improve

The most useful framing I've found is to treat evaluation as a loop of Analyze, Measure, Improve. It's worth internalising because it stops you from doing the steps in the wrong order — which is the single most common mistake.

1. Analyze — look at your failures before you measure anything.
The instinct is to jump straight to a metrics dashboard. Resist it. The highest-leverage activity in all of evals is boring: take 50–100 real outputs, read them, and label how each one is wrong. Not a score — a category. "Restated the dictionary definition instead of using the sentence's context." "Translation was accurate but too formal." You cluster these into a failure taxonomy, and that's what tells you which dimensions are even worth measuring. Skip this and you'll confidently measure the wrong things while users churn.

2. Measure — turn those failure modes into a repeatable number.
This is where the golden dataset and the LLM judge come in (the next two posts go deep on each). In short: you assemble a set of representative inputs with reference answers, run your feature over them, and have a second, stronger model score each output against a rubric built from your taxonomy.

3. Improve — change something, re-run, and trust the delta.
Now you can edit a prompt, swap a model, or restructure a pipeline, run the eval, and see whether quality moved. When you wire that comparison into CI, a quality regression turns the build red — the same safety net you have for ordinary code, finally extended to the non-deterministic part.

It's a flywheel: production traffic reveals new failure modes → you analyze them → they become new measured cases → improvements get gated → better output produces cleaner traffic. Round and round.

How we run evals at TextStack

Theory is cheap, so here's the concrete version. TextStack is an ASP.NET Core reading app with seven AI surfaces — Explain a word in context, Translate, generate vocabulary quiz distractors, book metadata, an audio podcast, and more. One rule sits above all of them:

Every AI feature ships with its own eval suite from day one. Eval is part of the pull request, not a follow-up.

Concretely, for each feature there's:

A golden dataset. ~30 hand-curated cases per feature, stored as plain JSON, each pairing a realistic input with a reference answer a human would accept.

Generation through the real path. The eval runs each case through the same code production uses — the same prompt, the same model gateway — so the test can never quietly drift away from what users actually get. (That drift is a classic, silent way to make an eval lie; more on it in the golden-dataset post.)

A dedicated judge. A second, stronger model (we use a gpt-4.1-class model, deliberately separate from the small, cheap models that generate the features) scores each output 1–5 on a short, feature-specific rubric — for Explain that's accuracy / conciseness / usefulness.

The judge runs on Microsoft.Extensions.AI.Evaluation — Microsoft's official, open-source evaluation library for .NET. This is a deliberate choice: most of the eval ecosystem assumes you're in Python (Braintrust, Phoenix, LangSmith), but a .NET shop doesn't have to leave the platform to do this properly. Our judge is implemented as a custom IEvaluator, so it slots into the same harness as Microsoft's built-in evaluators and runs as an ordinary dotnet test. The whole pipeline is plain C# — no Python bridge, no LangChain. The library is young and moving fast, which also makes it one of the more approachable corners of the .NET AI stack to contribute back to.

// A custom IEvaluator on Microsoft.Extensions.AI.Evaluation.
// One judge, many features: the rubric is a parameter, not hardcoded.
public sealed record Rubric(string Dim1, string Dim2, string Dim3);

var explain = new Rubric(
    "accuracy: matches the meaning the word carries in THIS sentence",
    "conciseness: 2-3 sentences, no dictionary boilerplate",
    "usefulness: would a learner find it genuinely helpful");

Persistence and a dashboard. Every run is stored, and an internal /ai-quality page shows scores and traces per feature, so quality is something we can actually watch over time — not a number that scrolls past in a CI log.

The honest status: we can run the full suite on demand and gate individual features against a quality floor; turning that into an automatic "fail the build if we regress more than X% versus last week" ratchet is the next step. The measuring instrument is built — and building the instrument is the hard 80%.

The traps (so you don't learn them the expensive way)

A quick preview of what the rest of the series unpacks, because these are where eval setups quietly break:

Metrics before error analysis — you measure what was easy to imagine, not what actually fails.
An easy golden set — the score goes up while the product goes down.
A judge you never validated — an LLM grading prose is itself a model; if it doesn't agree with human judgement, your whole pipeline is theatre.
Judge bias — judges quietly prefer longer answers, the first option shown, and text from their own model family.
Shipping on noise — with 30 cases, a 0.1 bump in the average is probably random, not progress.

Each of those is a post of its own.

Where this is going

Evals are not a dashboard you bolt on at the end. They're the discipline that lets you change an AI product without flying blind — look at your failures, measure them honestly, and gate on the result. Done right, they turn "I think this feature is fine" into "I can prove it, and I'll know the moment it stops being true."

Next in the series:

This post — what evals are and how we run them.
Error analysis — the unglamorous superpower, and how to build a failure taxonomy.
Golden datasets that don't lie — curation, leakage, and the drift trap.
LLM-as-judge, done right — rubrics, a dedicated judge, and the biases that wreck it.
From a number to a gate — evals in CI and online monitoring.

TextStack is a reader that helps you finish the dense technical book you keep quitting — it builds every modern AI primitive (observability, evals, RAG, agents) as a real production feature on .NET, not a notebook. Try it at textstack.app, or read the code at github.com/mrviduus/textstack.

I put Ollama on a 4 GB mobile GPU and got 2.5 — here's the VRAM math

Vasyl — Wed, 13 May 2026 12:00:00 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

📎 Companion piece to my earlier post: I shipped local LLM features two months ago — production never ran them once. Same gemma4:e2b, same box — this one is the GPU offload follow-up.

🔬 TL;DR

2.5× faster, 10°C cooler — on a 4 GB laptop GPU that "shouldn't" fit the model.

	CPU only	GPU hybrid
Tokens / sec	17	39
Per-call latency	~5.5 s	~2.0 s
CPU temp under burst	hot	−10 °C
Layers on GPU	0	35 / 36

Same prompt. Same model. Same hardware. The only thing that changed was whether Ollama was allowed to touch the card.

Honest take: I was hoping for more. The math at the end of this post explains exactly why 2.5× is the ceiling on 4 GB of VRAM with Gemma 4, and what it would take to push higher.

⚙️ Setup


Model	`gemma4:e2b` (2 B effective params, ~7.2 GB on disk)
CPU	AMD Ryzen 5 4600H, 6 cores / 12 threads
GPU	NVIDIA GTX 1650 Ti Mobile, 4 GB VRAM
OS / runtime	Ubuntu + Docker, Ollama 0.23.1
Prompt	Distractor + hint + explanation generator from my reader app — fixed across runs
Output budget	~60 tokens per call
Control	`num_gpu=0` → CPU only · `num_gpu=999` → let Ollama auto-split
Warm-up	One throwaway call per mode before the timed samples

Both modes ran after warm-up, so the numbers reflect steady-state inference, not first-load cost. Each /api/generate response came back as NDJSON, so I pulled eval_count, eval_duration, and total_duration straight from the engine — no external timing noise.

🎯 Why I picked E2B

Gemma 4 ships in three flavours — the small E2B/E4B family, a 31B Dense model, and a 26B MoE. The model that runs in this benchmark is the smallest of those, and that wasn't accidental.

The work is a fire-and-forget enrichment step inside a vocabulary-save flow — distractors plus a hint plus a short explanation, all generated in one call. It has to feel synchronous on a save action, and it has to run on the same commodity laptop as the rest of the app. Anything bigger is the wrong tool.

The 31B Dense doesn't fit. The 26B MoE would, but its VRAM patterns on a 4 GB card are punishing. E4B is the obvious step up in quality from E2B, but its size pushes total memory over the line where Ollama has to keep more on CPU — slower for the same job at the latency profile a save action needs. E2B at Q4 lands the quality where I need it for distractor generation while leaving headroom for the KV cache and everything else.

The framing that matters here isn't "the biggest model I could fit" but "the smallest model that gave me the output I needed." On constrained hardware, that distinction is the whole game — and it's what made the GPU experiment below worth running at all.

📊 Results

Metric	CPU only	GPU hybrid (35/36 layers on GPU)	Δ
Avg output tokens / call	60	55	~same
Avg eval latency (token gen only)	3,506 ms	1,411 ms	2.49× faster
Avg total latency (prompt + gen)	5,390 ms	2,174 ms	2.48× faster
Tokens / sec	17	39	2.29× faster

ollama ps during the GPU run:

NAME          SIZE      PROCESSOR        CONTEXT   UNTIL
gemma4:e2b    7.8 GB    74%/26% CPU/GPU  4096      Forever

nvidia-smi during a generation:

NVIDIA GTX 1650 Ti, used 1998 MiB, free 1909 MiB, util 32 %

⚠️ ollama ps lies to you.
That "74%/26% CPU/GPU" string is a memory split, not a layer split. The Ollama server logs are the only place that tells you which layers actually moved. Mine showed offloaded 35/36 layers to GPU. Almost the whole transformer — minus one layer that matters a lot. More on that in a second.

🧠 Why 2.5× and not 10×

The model has 36 transformer layers. Ollama put 35 of them on the GPU. The lone holdout is the output projection layer — the one that maps the final hidden state back into Gemma's vocabulary.

Gemma 4's vocab is enormous (~256k tokens). That output layer is dense, fat, and would happily swallow what's left of the 4 GB after the rest of the stack moves over. So Ollama leaves it on CPU.

The consequence is brutal in the steady state:

💡 Every single generated token has to round-trip through the CPU at the end. GPU is fast for the 35 layers it owns, then the pipeline stalls on the one layer the GPU couldn't take. Average across thousands of tokens and the CPU side becomes the floor.

That's the whole story of 2.5× instead of 10×. Hybrid inference is gated by the slower of the two devices, and on this card the slower device is doing real work on every token.

The takeaway worth bolding: if you only ever look at ollama ps, you'll get the wrong picture of what your setup is doing. The server load logs are the source of truth for which layers went where.

💡 What 2.5× actually buys you

In the app, a single save — distractors + hint + short explanation, ~60 output tokens — used to take 5.5 s. Now it's just over 2 s.

That moves the action from the "is this hanging?" zone into the "yeah, it's working" zone. That's the threshold that actually matters for a save action.

Five saves in a row:

Before: ~30 seconds of full-tilt CPU
After: ~10 seconds, work split between CPU and GPU
Bonus: peak CPU temperature during that burst dropped ~10 °C

On a thin laptop in a small room, that last number is the difference between a fan you hear and a fan you don't.

🚀 What would push it higher

Three options, in order of how willing I am to do them:

Smaller quant on just the output layer. If that layer fit in the remaining ~1.9 GB, the whole model would run on GPU and you'd see the 10× numbers other writeups quote. The cost is real quality loss on the output distribution — worth measuring on your own prompt set rather than assuming.
A bigger GPU. A 16 GB card holds the whole thing with room to spare. The point of this exercise was specifically "what does a commodity laptop GPU do", so a $500 desktop card isn't really in scope.
Swap engines. llama.cpp direct, vLLM, etc. Two seconds is already inside budget for the action this model powers. Optimising past "fast enough" is how you end up with three benchmarks and zero users.

🛠️ Reproducing this

# 1. Pull the model
ollama pull gemma4:e2b

# 2. Force CPU only
curl -s http://localhost:11434/api/generate -d '{
  "model": "gemma4:e2b",
  "prompt": "Give me 5 distractors for the word \"warehouse\".",
  "stream": false,
  "options": { "num_gpu": 0 }
}' | jq '{tokens: .eval_count, eval_ms: (.eval_duration/1e6), total_ms: (.total_duration/1e6)}'

# 3. Let Ollama use the GPU
curl -s http://localhost:11434/api/generate -d '{
  "model": "gemma4:e2b",
  "prompt": "Give me 5 distractors for the word \"warehouse\".",
  "stream": false,
  "options": { "num_gpu": 999 }
}' | jq '{tokens: .eval_count, eval_ms: (.eval_duration/1e6), total_ms: (.total_duration/1e6)}'

# 4. Check what actually landed where
docker logs ollama 2>&1 | grep -E "offloaded|layers"
nvidia-smi --query-gpu=name,memory.used,memory.free,utilization.gpu --format=csv

Run each curl a handful of times to flush warm-up effects, then average eval_ms and total_ms. The interesting number is the ratio, not the absolute timings — they'll vary with your CPU.

✅ Takeaways

4 GB VRAM is enough to be useful, even on a model that "should" need more. Just don't expect 10×.
Hybrid inference is gated by the slower device. If one critical layer stays on CPU, that's your floor.
Trust the load logs, not ollama ps. The pretty CPU/GPU percentage is a memory split, not a layer count.
2.5× is the difference between a UX that feels broken and one that doesn't. That's enough.
Stop optimising once you're inside budget. "Fast enough" beats "fastest" every time.

📖 Full write-up with all the load-log spelunking on my blog: vasyl.blog — I put Ollama on a 4 GB mobile GPU and got 2.5×

⭐ The reader app this powers is open-source (AGPL-3.0): github.com/mrviduus/textstack

Built with gemma4:e2b for the Gemma 4 Challenge. If you're entering too, drop a link in the comments — happy to read yours.

I shipped local LLM features two months ago. Production never ran them once.

Vasyl — Tue, 12 May 2026 11:23:14 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

Two months ago I shipped local-LLM features in TextStack — an open-source reader for developers who want to finish dense English technical books in their native language. Yesterday I noticed something strange about the production server's RAM. 3 GB used out of 30. The model that runs all those features should be ~13 GB resident.

I SSH'd in.

$ docker compose exec ollama ollama list
NAME    ID    SIZE    MODIFIED
$

Nothing. The Ollama container had been running for 60+ days without a single model pulled. Every distractor call had fired, hit the fallback path, and returned random vocabulary words. I never noticed because the failure mode is silent — the user sees distractors, just not LLM-generated ones.

This is the post-mortem of that, plus the two model swaps that finally got the features working: qwen3:8b → gemma4:e4b on day one to bring local inference up at all, then e4b → e2b once production load showed e4b couldn't keep up on CPU. Six production bugs surfaced along the way. The article ends with a real 63,000-request load test on the e2b deploy: 100% success, p95 = 20.5 ms, total OpenAI cost = $0.002.

What I Built

TextStack is an open-source (AGPL-3.0) reader for developers who keep abandoning English technical books like Designing Data-Intensive Applications. Tap any term → context-aware translation that knows the book's domain ("attention" in an ML chapter gets увага (механізм у нейромережах), not the everyday meaning). Words you save feed a capped weekly SRS queue.

Local Gemma 4 e2b generates the multiple-choice distractors, hints, native-language explanations, and book metadata enrichment — four jobs that previously needed paid OpenAI calls per user. OpenAI gpt-5-mini stays for translation (multilingual quality matters) and for in-reader live explanations (latency-sensitive). Everything else runs on a single-CPU 30 GB-RAM VPS, no GPU.

Demo

🌐 Live: textstack.app — sample chapters open without signup. Tap any word in Designing Data-Intensive Applications, then check the vocabulary review.

🎬 37-second walkthrough — read → save word → MCQ with Gemma-generated distractors → answer feedback:

📸 Single MCQ card — "___ the data from these external systems..." with 4 Gemma-generated distractors (battle / bringing / storm / courage):

Note for judges: Sample chapters are unauthenticated; the vocabulary review needs a free account because progress and SRS state are per-user. Use any throwaway email — there's no email verification gate on read.

Code

📦 Repository: github.com/mrviduus/textstack — AGPL-3.0, 200+ merged PRs, deployed at textstack.app

⭐ Star the repo on GitHub — every star tells me one more developer wants to finish DDIA without giving up

📐 Stack:

Backend: ASP.NET Core 10 (clean architecture: Domain / Application / Infrastructure / Api / Worker)
Database: PostgreSQL 16 with FTS for in-book search
Frontend: React 19 + Vite, React Native 0.83 (Expo) for mobile
LLM: Ollama running gemma4:e2b for local jobs, OpenAI gpt-5-mini for translation
Deployment: docker-compose, Cloudflare Tunnel, single VPS

🔧 Key commits behind the story:

PR #232 — original swap qwen3:8b → gemma4:e4b, image pin, memory bump
3999944 — worker Connection refused fix + the real timeout bump (30s → 90s after measurement)
966b398 — the second model swap, e4b → e2b
c6db540 — 63,000-request load test + full LoadSurge report

Full PR/commit history for the swap arc lives in CHANGELOG.md under [Unreleased]. The Gemma-using code lives in:

backend/src/Vocabulary/TextStack.Vocabulary/DistractorGenerator.cs — prompt template, parser, fallback cascade
backend/src/Worker/Services/BookMetadataGenerator.cs — fire-and-forget metadata enrichment

How I Used Gemma 4

The model selection went through two rounds. Gemma 4 ships in four sizes. The first time I built a trade-off table, I picked the wrong one — for understandable reasons. The second time I had production data and picked correctly. Both decisions live in the same article.

Here's the matrix at the time of the first pick (E4B, day-one swap):

Model	Disk	RAM resident	Fits on my VPS?	First-pick reasoning
E2B (2B effective)	7.2 GB	~5 GiB	✅ trivially	"Too small for nuanced technical-vocab distractors" — I'd find out this was wrong
E4B (4B effective)	9.6 GB	13 GiB	✅ with cgroup bump 4G → 12G	"Sweet spot — strong enough on quality, fits the VPS" — picked first
31B Dense	~18 GB	~24 GiB	⚠️ tight, no headroom for Postgres + .NET	"Overkill, no room for the rest of the stack"
26B MoE	~15 GB	~20 GiB	⚠️ same constraint	"MoE doesn't help short prompts here"

The 31B and 26B MoE models would need either a GPU box or a much bigger VPS, neither of which fits an open-source project that has to remain deployable on a $20/month consumer host. So the real choice was between E2B and E4B. I went with E4B. I was wrong.

What Gemma 4 unlocked vs the cloud alternative. Pre-swap, every distractor generation was a ~5¢ OpenAI call per word saved per user. With ~50 saved words per active reader per book, that's $2.50/book/user — fine for me running the only instance, fatal the moment someone else self-hosts it. Local Gemma 4 makes the marginal cost per distractor ~0 (just CPU on a box already running). Same for hints, explanations, and book metadata enrichment.

Local inference changed the economics of the feature completely. That's the real reason the swap mattered — not the model quality, the cost shape.

What surfaced when I actually flipped it on

The bug story isn't decoration — it's how I learned what each Gemma 4 quirk does in production. Six lessons. The first four came from getting e4b to run at all. The last two came from staring at the production stats after it was "running".

Lesson 1: floating image tags lie

Original docker-compose.yml had:

ollama:
  image: ollama/ollama   # no version

Docker pulled latest two months ago and cached it. latest at that moment was 0.22.x. Gemma 4 wasn't released yet, so the binary doesn't recognize the model family. From the host's perspective, the "local Ollama" IS the latest version — docker image ls shows the cached SHA, not whether upstream has moved.

- image: ollama/ollama
+ image: ollama/ollama:0.23.1

Pull succeeded after pinning. 9.6 GB on disk for e4b.

Lesson 2: cgroup limits were a guess from the qwen3 era

The container memory cap (4 GB) had been sized for qwen3:8b and never re-evaluated. Gemma 4 e4b weights need 9.8 GiB. Inference returned model requires more system memory (9.8 GiB) than is available until I bumped the limit:

  deploy:
    resources:
      limits:
-       memory: 4G
+       memory: 12G

The lesson: every model swap should also re-evaluate the container resource block. Picked-once-and-forgotten limits are a category of silent drift.

Lesson 3: cold load and warm latency both blew past my API timeout

First inference call hung ~60s before the first token. Default Ollama keep_alive is 5 minutes — after that the model unloads and the next cold call burns 60s again. Fix: OLLAMA_KEEP_ALIVE=-1, plus bump the API timeout from 10s → 30s.

I shipped it. Then watched production: 2 distractor generations out of 13 saved words succeeded. The model was resident the entire time. Every miss was a wall-clock timeout. E4B on CPU just takes more than 30 seconds for many prompts.

So 30s wasn't enough either:

- "TimeoutSeconds": 30
+ "TimeoutSeconds": 90

Success rate climbed to ~100%. For CPU-only Gemma 4 on a 6-core consumer VPS, your timeout has to absorb 60–90 s tail latency, not 10 s. That gap between toy-benchmark numbers and production reality is where most local-LLM ship-and-forget bugs live.

Lesson 4: the parser silently dropped half my output

DistractorGenerator's prompt asks for 5 wrong-answer words. Smoke test for linearizability:

consistency, atomicity, serialization, concurrency, visibility

Five single-word distractors. Clean. Then I tried eventual consistency:

strong consistency, read-after-write, data loss, causality, serialization

Now look at the parser:

.Where(w => w.Length > 1
    && w.Length < 50
    && w.Any(char.IsLetter)
    && !w.Equals(originalWord, StringComparison.OrdinalIgnoreCase)
    && !w.Contains(' '))      // ← drops "strong consistency", "data loss"

The filter rejects multi-word entries. Three of the five gone. With the distractors.Count >= 3 requirement, the call returned null and the fire-and-forget path fell back to the hardcoded random-word picker.

The filter was there since the original implementation. qwen3 outputs single tokens by default, so the constraint was hidden. Gemma 4 prefers phrasal answers — it's the most cross-model-family-sensitive parsing surface you'll hit when swapping. The fix was a single line in the prompt:

- SINGLE WORD ONLY — no spaces, no multi-word phrases
  (use "linearizability" not "strong consistency"). Hyphens are fine.

After all four fixes, a real production save of warehouse returned:

["storeroom", "depot", "facility", "silo", "loft"]

Five domain-adjacent single-word distractors, exactly the shape the prompt asks for. That's the moment local Gemma 4 was finally doing real work.

Lesson 5: the worker had been silently failing for two months

While collecting production stats for this article, I grepped the worker logs:

$ docker compose logs worker | grep "Connection refused"
... lots of lines ...

docker-compose.yml had set Ollama__BaseUrl on the api service but not on the worker service. The worker fell back to the default (localhost:11434 inside the worker container — there is nothing there) and every BookMetadataGenerator call hit Connection refused silently. Every user-uploaded book ended up with genre = NULL, which in turn meant the domain-aware translation prompt had nothing to bias against.

This was a second silent fallback, completely orthogonal to the original one. Same shape, different surface. Fix:

  worker:
    environment:
+     Ollama__BaseUrl: http://ollama:11434
+     Ollama__Model: gemma4:e2b

Plus a one-shot MetadataBackfillWorker (a small BackgroundService that runs on worker startup) to heal the ~10 user-uploaded books with genre = NULL, idempotently.

The pattern is the lesson. Anywhere you distribute environment via a compose file, ask: which services actually need this variable and is the variable set on each of them? "Inherits from .env" is not a thing in docker-compose service blocks.

Lesson 6: turn off thinking mode for structured outputs

Modern Ollama models (including Gemma 4) default to a chain-of-thought "thinking" pass before the final answer. For freeform reasoning that's a quality win. For my use case — output a 5-element list of single words — the thinking pass is pure overhead. Every request was generating 50–200 tokens of internal reasoning the parser then threw away.

In the Ollama call options:

- options: { "temperature": 0.7 }
+ options: { "temperature": 0.7, "think": false }

Roughly halved the per-request token output. Roughly halved end-to-end latency. The quality of the distractors did not drop in my testing — for "give me 5 plausible wrong-answer words for warehouse", chain-of-thought wasn't doing anything load-bearing.

If you're using Ollama for structured outputs, this is the single biggest perf knob most people don't know about.

The second swap: e4b → e2b

After all six lessons above, distractor calls were succeeding at ~100%. But end-to-end save latency was still tail-heavy. Looking at the numbers honestly: most calls landed in the 30–60 s range, and the 90 s timeout was absorbing what should have been a comfortable fit.

Two things were happening at once:

E4B's 13 GiB resident was contesting RAM with Postgres + .NET on a 30 GB box. Not OOM-level, but the working set wasn't always in cache.
Even with think=false, e4b is genuinely slow on a 6-core CPU. I'd been benchmarking on a warm cache and short prompts; longer prompts (explanations, multi-sentence hints) routinely hit 60 s+.

I swapped to e2b:

Metric	e4b (after all fixes)	e2b (current prod)
Disk	9.6 GB	7.2 GB
RAM resident with `KEEP_ALIVE=-1`	13 GiB	7.7 GB
Inference speed on same CPU	baseline	~2–3× faster
Quality on single-word distractor task	reference	comparable for short structured outputs

The first-pick reasoning ("E2B's quality is too weak for technical vocabulary") had been based on a quality benchmark. The real production constraint turned out to be latency. For short structured outputs — distractor lists, single-line hints — e2b is fast enough that quality differences disappear into the prompt template. The prompt was doing more work than I'd given it credit for.

For longer freeform outputs (the 2–3 sentence native-language explanation), e2b is measurably less polished. Acceptable for the use case (it's a study aid, not a translation). If a future task demands better explanation quality, the path is a fine-tune of e2b on TextStack's domain corpus, not jumping back to e4b. Same hardware envelope, better domain fit.

Numbers (real, post-e2b)

The numbers below are measured on the production server: AMD Ryzen 5 4600H, 6 cores / 12 threads, 30 GiB RAM, no GPU. Same box that serves traffic to textstack.app.

Metric	Value
Disk (`gemma4:e2b`)	7.2 GB
RAM resident with `KEEP_ALIVE=-1`	7.7 GB
Cold load (container restart)	~10 s
Distractor cost per word	~0¢ (CPU on existing box)
Equivalent OpenAI cost	~5¢ per word at gpt-5-mini rates

Load test: 63,000 requests, 100% success, $0.002

After the e2b swap I stress-tested the production deploy with LoadSurge. Three scenarios — GET /health, POST /translate, POST /explain — at 30–50 virtual users for 30–60 seconds each. Headlines:


Total requests	63,000
Success rate	100% (0 failures)
Worst-case p95 latency	20.5 ms (smoke; translate and explain were lower)
Sustained RPS at 50 VU	500
OpenAI cost during the run	$0.002 (10 cache-prewarm calls; zero during the stress phase)
Peak temperature on the host	42 °C (throttle threshold 95 °C)

The interesting part isn't the throughput — 500 RPS on a $20 box is real but not surprising for cached HTTP. The interesting part is that the expensive path disappeared entirely behind the cache. Translate and Explain are keyed by (input, target_language, genre, sentence); on a hot cache the LLM never enters the request lifecycle.

The auth-gated POST /me/vocabulary/words path that triggers actual Gemma 4 distractor generation wasn't covered by this run — that's the next test, with test-auth tokens and a bounded-concurrency queue in front of Ollama. The full per-scenario breakdown is in docs/loadtest/run-20260511-103451/REPORT.md.

Where OpenAI stays

The split after both swaps:

Task	Provider	Why
Vocabulary distractors	Local Gemma 4 e2b	Tolerable quality, fire-and-forget, no per-user cost
Word hints	Local Gemma 4 e2b	Same
Native-language explanations	Local Gemma 4 e2b	Same; acceptable on long-form quality given the use case
Book metadata enrichment	Local Gemma 4 e2b	Same
Translation (18+ langs, incl. Ukrainian)	OpenAI gpt-5-mini	Small-model multilingual translation is still a weak spot
In-reader term explanation (live)	OpenAI gpt-5-mini	<1 s latency requirement during reading

Local LLMs aren't a wholesale cloud replacement. They're a tool for tasks where quality is tolerant, latency is amortizable, privacy matters, or per-user cost matters. When any of those breaks down — multilingual translation, latency-sensitive UI — cloud still wins.

Lessons (for anyone shipping local LLMs)

Silent fallback is the worst kind of bug. Distractor generation had been failing in production for 60+ days and I had no signal — the fallback was a hardcoded random-word picker, indistinguishable to the user. And it happened twice in the same system, on two different surfaces (Ollama-not-installed, then Worker-can't-reach-Ollama). Next time: emit llm.success and llm.fallback counters per service, alert if the ratio drifts above 5%, and never make fallbacks bit-for-bit indistinguishable from the primary path.

Floating image tags lie. Pin Ollama, pin Postgres, pin everything. latest freezes the day Docker pulls it; two months later it's lagging upstream and you have no signal until a new model breaks it.

Defend at parse, always — even if your model behaved on first try. Same prompt — qwen3 returns single tokens, Gemma 4 returns phrases. The parser's pre-existing !w.Contains(' ') filter was correct in spirit but hidden from the model. Moved into the prompt, it became explicit and Gemma satisfied it.

Bench with real prompts on real hardware. I tested e4b's quality on warm-cache short prompts and concluded it was the right pick. Real production tail latency on longer prompts was 3× what the smoke test suggested, and that's what forced the e2b downgrade. Toy benchmarks hide both model-family quirks (parsing) and hardware-bound failure modes (CPU latency).

Turn off thinking mode for structured outputs. think: false is the single biggest perf knob on Ollama for short structured tasks. Most documentation doesn't surface it.

Distribute env vars deliberately across services. Docker-compose service blocks don't inherit from each other. Whichever service actually needs a variable — list it explicitly in that service's env block. The day you add a new service, audit every variable.

The interesting part wasn't that the model failed. It was how long the system kept pretending it hadn't.

What's next

Fine-tune Gemma 4 e2b on TextStack's distractor task. I now have a real production corpus building (a few hundred (term, distractor-list) pairs per week post-fix). The corpus that existed before the fix is gone — every distractor it produced came from the hardcoded fallback, not the model. The dataset starts fresh.

Add a bounded-concurrency queue in front of Ollama for the write path. From the load test recommendations: a Channels-based worker with MaxConcurrency = 2 plus a per-(word, language) shared cache. Mirrors the translate/explain caches that just held 500 RPS with zero LLM cost.

Run a second load test against the auth-gated write path. The 63k-request test only measured cached reads. Distractor generation is the actual bottleneck, and it sits behind authentication. Need test-auth tokens and 10–20 VU to bound it.

The full TextStack codebase is AGPL-3.0 at github.com/mrviduus/textstack. If you've shipped local-LLM features in production, run ollama list on your server, then docker compose logs worker | grep -i refused. One of those might surprise you. Mine surprised me twice in the same codebase — same shape, different surface, two months apart. That's the part of operating local LLMs that nobody writes about, and the part that takes the longest to learn.

If you found this useful, the strongest signal is a star on the repo. Every star tells me the next person abandoning DDIA mid-way might find this tool — and that's the whole point.

Open-source licenses 101: which one to actually pick

Vasyl — Thu, 07 May 2026 17:24:43 +0000

Sooner or later, every developer runs into The License Question. You shipped something to GitHub, GitHub asked you to pick a license, and you scrolled the dropdown — MIT, Apache, GPL, AGPL, BUSL, MPL, ISC, Unlicense, "Other" — and picked whatever sounded least scary. That's how I did it. That's also how I ended up rewriting my LICENSE file three weeks later.

Licenses are a dark forest for devs. We don't read legal docs, nothing in our day-to-day teaches us when each one matters, and most online advice is either a wall of legalese or someone's religious argument. Here's the version I wish someone had given me: a tour of the five licenses you'll actually meet, the mistakes that bite, and what changing my license did to my project's discoverability in the real world.

What a license actually does

By default, your code is "all rights reserved." That sounds like the default-est thing possible — but it means no one can legally copy, fork, run, or modify your code without your written permission. Sticking your project on a public GitHub repo doesn't change that. A license is the contract you write with the world that relaxes the default.

The question you're answering when you pick one: how much can people do with this, and what do you get back?

The five you'll actually meet

MIT. "Use my code. Just keep my name in the file. Don't sue me." Three paragraphs long. Maximum adoption, zero protection. Most of the JavaScript ecosystem runs on MIT, and most of those projects don't have a monetization plan, which is exactly why it works for them.

Apache 2.0. Like MIT, but explicitly grants patent rights from contributors to users. That sounds boring until you realize half the tech world is built on patented stuff and silently assumes nobody will sue. Apache is the grown-up version of MIT — same vibe, fewer landmines.

GPL-3.0. "Modify and distribute my code? Your modifications are also GPL." This is copyleft. It infects everything downstream, which is why corporate lawyers hate it and Linux thrives on it (the kernel is GPL-2). Companies can't quietly fold GPL code into their proprietary stack — the license would force the whole stack open.

AGPL-3.0. GPL with a single, brutal addition: §13. If you modify the code and run it as a network service — a SaaS, a hosted dashboard, anything users hit over the network — you have to publish your modifications. This closes the loophole that GPL leaves open, where a company can fork, modify privately, and host the modified version. AGPL says: nope, your fork has to be public the moment users touch it.

BUSL-1.1. Not actually open source by the OSI's definition — it's "source-available." You can read the code, fork it, run it for yourself; you can't sell it as a hosted commercial service competing with the original author. After four years it auto-converts to a real OSI license (usually Apache). Sentry, MariaDB, CockroachDB — all BUSL. It's a defensive license aimed at the "AWS forks our project and undercuts us on hosting" scenario.

(There's also MPL-2.0 — file-level copyleft, used by Firefox. A reasonable middle ground if MIT feels too loose and AGPL too aggressive. Not your most-likely first encounter, so I'm leaving it as a footnote.)

Mistakes I see all the time

Picking MIT for a thing you might monetize. The most expensive mistake. MIT lets a competitor fork your work, polish it, host it, and out-market you — with zero recourse. Fine for a library nobody wants to commercialize. Bad for a product.

Copying BUSL because Sentry uses BUSL. Different threat models. Sentry has hyperscaler-competition risk; you have nobody-knows-you-exist risk. BUSL solves a problem you don't have, while costing you contributor goodwill, awesome-list eligibility, and brand clarity. I learned this one personally.

Slapping GPL or AGPL on a library. Copyleft on a library is contagious — anything that links to it inherits your license. Devs see it and walk away because they can't safely use your code in their proprietary or differently-licensed project. Libraries should almost always be MIT or Apache.

No license at all. The silent killer. "All rights reserved" is the default, so a public repo with no LICENSE file is technically a public repo nobody can legally use. You're sending the message: here's my code, but also nobody can touch it. If you want adoption, ship a license.

Picking the most "open" license to look generous. MIT looks generous. It's also the easiest license to regret. The right question isn't "how open should I look" — it's "what business model do I want to keep available?" Be honest with yourself before you optimize for image.

What changing the license actually changed

I shipped TextStack — a reading tool I'm building solo — under BUSL-1.1. My reasoning was the same one MariaDB and Sentry articulated: protect against AWS-style cloning before it happens. Sounded smart. Felt smart. Wasn't.

The first sign was awesome-selfhosted. I went to add my project to the most-trafficked self-hosted directory on GitHub, opened the contributing guide, and saw a rule I hadn't expected: OSI-approved licenses only. BUSL doesn't qualify. The same pattern showed up across every awesome-* list I checked — awesome-react-native, awesome-dotnet-applications, awesome-llm-apps. Most either explicitly require an OSI-approved license or implicitly do. The world of curated, high-traffic developer discovery is gated by the OSI definition, and BUSL sits on the wrong side of the gate.

Then the second-order effects started showing up.

On GitHub Topics, the license filter is how a lot of devs browse for tools — license:agpl-3.0 has its own discovery surface, license:other is essentially invisible. Switching from BUSL to AGPL moved my repo from one bucket to the other.

On the README itself, the license badge is the first thing a potential contributor reads. "BUSL-1.1" makes most devs hesitate — what is this, can I actually contribute? "AGPL-3.0" is recognized instantly. For a portfolio project where you want stars, forks, contributors, and word-of-mouth, that hesitation is the whole game.

And here's the kicker: AGPL didn't even cost me the protection I was after. The §13 network-copyleft clause makes most cloud-cloning impractical — the moment a competitor publishes a hosted fork, their differentiator is public. I kept the defensive moat; I shed the friction. On top of that, AGPL leaves dual-licensing on the table — the same playbook that funds Plausible, PostHog, and Cal.com (AGPL for the community, paid commercial license for clients who can't comply with §13). With BUSL, that revenue path was already pre-closed; BUSL is the commercial-restricted license, there's nothing to upgrade away from.

The lesson, if you're building a portfolio project, is uncomfortable: license choice is a discoverability decision, not just a legal one. Awesome lists, GitHub Topics, contributor pipelines — all gated by the OSI definition. Pick the one that opens doors, not closes them.

How to actually decide

Three questions, in order:

Library or product? Library → Apache 2.0. Product → keep reading.
Will you monetize someday? Yes → AGPL-3.0. No → MIT or Apache.
Are your future customers mostly enterprises with strict no-AGPL policies? Some big companies (Google, famously) ban AGPL internally. If your TAM is enterprise, lean Apache.

For most solo-dev side projects: AGPL-3.0. It's real open source, qualifies for awesome-list submissions, attracts contributors, and keeps the dual-licensing door open if you ever decide to monetize. That's the honest default.

I picked BUSL-1.1 first, switched to AGPL-3.0 two weeks later, and watched the discovery dynamics flip on the same week. The shorter version of this whole post: pick AGPL, save yourself the relicensing.

Originally published on vasyl.blog.

I Quit Designing Data-Intensive Applications (DDIA) Three Times. Here's What I Build on the Fourth Try.

Vasyl — Wed, 22 Apr 2026 05:14:01 +0000

In 2023 I bought DDIA on Kindle. Opened the replication chapter. Quit after 40 pages and didn't open it for six months.

In 2024 I bought it again, because the book is clearly worth finishing. Got to page 80. Closed it.

In 2025 I tried a third time with ChatGPT open in another tab to explain the hard terms. It got easier. But every lookup was the same loop — alt-tab, paste the sentence, wait, come back, find my place. After three chapters I wasn't really reading the book anymore. I was reading my own habit of switching tabs.

The book still sits in my Kindle library, marked unfinished. If you have a book like that on your shelf, this post is for you. I finally figured out why I kept quitting, and built a tool that fixes it for me. Maybe it fixes it for you too.

What was actually breaking

When I quit for the third time, I sat down and tried to be honest about what was stopping me.

It wasn't that the book was too hard. I understood most of what was on the page. The problem was the rest — the unfamiliar terms.

Every unknown term forced a decision between two bad options.

Option one: stop and look it up. Alt-tab, paste the sentence, wait, come back, find my place. Flow broken. The next paragraph is harder to hold in your head.

Option two: skip it and hope context saves me. Sometimes it does. But after a dozen skips in a chapter, the quality of my reading drops noticeably. And each "I'll figure it out later" turns into debt.

The exhaustion wasn't coming from reading. It was coming from the constant small decisions.

There was a third problem too. Even when I did look something up, a week later I'd forgotten it. ChatGPT doesn't remember you asked. Anki remembers, but making cards by hand is its own pile of friction. I was learning words in order to forget them. And reading books in order to quit them.

What I got wrong about AI and reading

When ChatGPT arrived, a lot of people thought long books were dead. Why read 600 pages of DDIA when you can ask and get a summary in a minute?

I believed that for about a year.

Then I sat in a 2025 interview being asked about replication strategies in distributed systems, and realized I couldn't explain the difference between synchronous and asynchronous replication past surface-level buzzwords. I'd read dozens of summaries, listened to podcasts, watched YouTube breakdowns. I knew things on the surface. I didn't understand any of them deeply.

For staying current, summaries are fine. For real understanding, nothing replaces sitting with a book that someone spent years structuring. Those are exactly the books I kept quitting around page 40.

What I built

In January 2026 I started building what became TextStack — a reader where I could read technical books without the tab switching.

The idea is simple. Tap a word you don't know. An explanation appears inline — not a dictionary entry, but a short concept explanation from Claude that takes into account what the book is about and what the sentence is doing. For everyday words, a short translation. For technical terms like RLHF, attention mechanism, or eventual consistency — two or three sentences on what it is and why it matters, with links to related ideas and common confusions.

The word goes into a personal dictionary automatically. But not the way LingQ does it, where your review queue grows to hundreds of items and you quit the app. I built a filter — only words from roughly the top 15,000 English words by frequency, or technical terms, enter spaced repetition. The rest are saved as reference. The weekly review queue is capped, so it never spirals.

Over three and a half months I put together a working version on .NET 10, React, and React Native. PostgreSQL, Claude API for explanations, Edge TTS for audio, offline PWA. It ingests EPUB, PDF, and FB2. The catalog started wide, but I'm pruning it hard — I'm realizing focus matters more than I thought.

It lives at textstack.app — full pitch at the end of this post.

What I got wrong for three months

For the first three months I was building for an abstract "non-native English speaker who wants to read books." Nobody needs that.

In April I looked at it honestly and asked who I'd actually built it for. The answer was: a developer trying to read AI engineering books. Because that's what I'd been trying to read for two years. Chip Huyen's AI Engineering. Hands-On Large Language Models. Designing Machine Learning Systems. Building Agentic AI Systems. Prompt Engineering for LLMs. I bought all of them. I finished none.

When I looked at other developers' reading lists online, I saw I wasn't alone. A lot of developers are trying to move into AI engineering right now. We're all reading the same books, and a lot of us aren't finishing them.

This isn't a generic "non-native English" problem. It's a specific problem for a specific group going through a specific career transition.

So I'm pivoting. Not "a reader for everyone." A reader for developers learning AI engineering. A narrow niche where I'm already the user.

The next six months

Four things.

1. Rebuild the product around the AI angle. Trim the catalog to 15–20 AI engineering books. Rewrite the homepage. Shift the framing from translation to explanation. Improve the prompts for technical terms.

2. Actually start reading. Hands-On LLMs in May. AI Engineering in June and July. Building Agentic AI Systems in August. Not as a task — as something I want. I want to work as an AI engineer in two years, and the only way there is through these books. I'll read them inside TextStack, because if it doesn't work for me, it won't work for anyone.

3. Write about the process. This is the first post. If you want to follow along, the blog has RSS.

4. Find the first paying customer.

I'll say it openly: if in six months there's one stranger paying for TextStack, I'll consider this project a success regardless of the other numbers. The first dollar from someone you don't know is a threshold most solo devs never cross. Crossing it is a big part of the work of leaving employment.

Try it

Live at textstack.app — you can open a sample chapter of Pragmatic Programmer or Hands-On LLMs without signing up.

If you're in a similar spot — non-native dev, bought the AI engineering books, didn't finish them — send me a note. Twitter: @Rexetdeus. Email on the site. I'll give you early access and listen to what works and what doesn't. In exchange I need honest feedback.

If it's not your thing, thanks for reading this far. If someone you know is stuck on Chapter 3 of AI Engineering, maybe forward them this post.

P.S.

One more thing. This problem — quitting hard books at page 40 — isn't really about English and isn't really about AI. It's that reading tools are stuck in the early 2010s while content has gotten much denser.

Kindle Word Wise is from 2014, and it still shows single-word definitions that can't handle eventual consistency or attention mechanism. LingQ has been showing translations and adding words to SRS for close to two decades, and the core experience hasn't really changed. Readlang was a clever browser extension in 2013; development stopped when the founder went to Duolingo.

Modern books need different tools. Not dictionaries — explanations. Not infinite queues — capped ones. Not one experience for everyone — context-aware understanding.

That's the opening I'm walking into. I'll let you know in six months how it went.

First post in a series about building TextStack as an AI engineering books reader. Star the repo if you want to follow along: github.com/mrviduus/textstack · textstack.app