DEV Community: Jurij Tokarski

Null Bytes, Dead Streams, Last Chunk

Jurij Tokarski — Fri, 24 Apr 2026 08:37:00 +0000

Streaming LLM output to a browser means wiring together SSE, TCP, fetch, and browser lifecycle APIs that weren't designed for this combination. Each one has constraints that only surface when you integrate them.

The Parser That Choked

Server-Sent Events is the natural choice for streaming. SSE supports multiple event types via the event: field and handles multiline JSON by splitting across data: lines. But when every chunk needs an event: line, one or more data: lines, and a blank line delimiter — and you're sending hundreds of small text fragments interleaved with structured tool call events — the framing adds up and the parser becomes more complex than the problem requires.

A null byte as the delimiter is simpler. \0 is rare enough in practice — it can appear as \u0000 in JSON but almost never does in LLM output or natural language — that it works as a reliable record separator without escaping.

// Server: wrap each event
function sendEvent(stream, event) {
  stream.write(JSON.stringify(event) + '\0');
}

// Client: split and route
let buffer = '';
decoder.on('data', (chunk) => {
  buffer += chunk;
  const parts = buffer.split('\0');
  buffer = parts.pop(); // keep the incomplete trailing segment
  for (const part of parts) {
    if (part) handleEvent(JSON.parse(part));
  }
});

Each event is a JSON object with a type field — text_chunk, tool_call, tool_result, done. The client splits on null bytes, parses each segment, routes by type. Text chunks accumulate in the UI. Tool events trigger loading states or commit structured data.

The Stream That Stopped Talking

TCP keepalive keeps a connection open. It doesn't tell you the connection has gone silent at the application level. Occasionally — maybe once every few hundred sessions — a stream stops mid-sentence. No error event. No close event. The connection is alive, the response is still "streaming," and the user is staring at a half-finished message with a spinner that will never resolve.

The LLM API hasn't errored — it just stopped sending chunks.

An idle timer catches this. Reset it on every incoming chunk. Fire it if silence crosses a threshold.

let idleTimer;

function resetIdleTimer(controller) {
  clearTimeout(idleTimer);
  idleTimer = setTimeout(() => {
    controller.abort();
  }, 30_000);
}

stream.on('data', (chunk) => {
  resetIdleTimer(controller);
  processChunk(chunk);
});

stream.on('end', () => {
  clearTimeout(idleTimer);
});

Thirty seconds is generous for interactive chat — users notice after five. The threshold isn't the important part. The pattern is: connection-level timeouts don't catch application-level silence. You need to track it yourself.

The Chunk That Vanished

Browsers kill in-flight fetch() calls during page unload. If you stream audio in chunks via POST, the final chunk — whatever is still buffered when the user stops recording or closes the tab — lives in memory until the next flush. That flush never happens. The final segment of every session is silently dropped.

// Killed on page close:
await fetch('/v3/audio/stream_chunk', { method: 'POST', body: chunk });

// Survives:
fetch('/v3/audio/stream_chunk', {
  method: 'POST',
  body: chunk,
  headers: { Authorization: `Bearer ${token}` },
  keepalive: true,
});

No await. No .then(). You can't await a response during unload — any result is swallowed. Fire and forget. The browser queues the request and completes it even after the page is gone, as long as the total payload is under ~64KB.

navigator.sendBeacon() survives unload too, but it doesn't support custom headers. If your backend expects an auth header, fetch({ keepalive: true }) gives you the full request API.

The Gaps Between Protocols

Every integration has these. You wire together two or three tools that work fine on their own, but nobody tested them together — and no documentation covers the seams. The workarounds aren't published as best practices. They accumulate as know-how, one project at a time. These are three I've accumulated for LLM streaming.

200 OK, Data Wrong

Jurij Tokarski — Tue, 21 Apr 2026 12:23:00 +0000

The Production Bugs That Never Threw an Error was about systems that reported success while running the wrong thing — stale tokens, cached artifacts, stripped paths. These five are different. Every one is an API that accepted valid input, returned a clean 200, and delivered the wrong output. The call worked. The result didn't.

The Image That Wasn't What I Asked For

Imagen has a prompt rewriter enabled by default — an LLM that rewrites your prompt before generation to "add more detail and deliver higher quality images." The rewritten version is only returned in the API response if your original prompt is under 30 words. Above that threshold, you get an image generated from a prompt you never see.

The image I got back was valid, well-composed, and completely wrong. The main subject was replaced by something adjacent. The response was 200. No flag, no warning, no indication that the input was rewritten. I assumed the safety filter had intervened — but the documentation says safety filters either block with an error or omit images entirely. They don't silently substitute. The prompt rewriter does.

Setting enhancePrompt: false in the request disables it. After that, the images matched the prompt.

The File Search That Found Nothing

For retrieval I was uploading source documents via the Files API with file search enabled. One batch worked correctly. Another batch would upload without error but return no results in search.

The difference was the filename. The batch that failed was uploaded with a generic name — something like upload_1 — with no extension. File search uses the filename to infer content type before indexing. A file without a recognized extension gets indexed as an unknown type, and the error is generic enough that it looks like a search quality issue rather than an upload problem.

Adding .pdf, .txt, or the correct extension to every filename at upload time fixed retrieval immediately.

The Transcription That Came Back as Garbage

A voice dictation feature recorded audio in the browser and sent the blob to a Lambda Function URL behind CloudFront. The Lambda passed it to Whisper. The transcription came back — but it was nonsense. No error, no rejection, just wrong text.

Lambda Function URLs base64-encode binary request bodies at the HTTP interface layer. The event includes an isBase64Encoded flag, but if you treat the body as raw bytes in all cases, the buffer is silently corrupted. Whisper doesn't throw on bad audio — it produces garbage.

const audioBuffer = event.isBase64Encoded
  ? Buffer.from(event.body, 'base64')
  : Buffer.from(event.body);

Any Lambda that accepts binary payloads — audio, images, PDFs — needs to check that flag before consuming the body. The cost of missing it is not an error. It's wrong output that looks like a model quality issue.

The Search Console That Had No Traffic

I wired up Google Search Console data fetching for a site with real traffic — I could see it in the GSC web UI. The API call went through, no errors, no 403. It returned zero rows.

The site was registered as a domain property. Domain properties require sc-domain:example.com as the siteUrl, not https://example.com. The API doesn't say "wrong format" or "property not found." It returns empty data as if the site has zero search traffic.

// Returns empty data, no error
const res = await webmasters.searchanalytics.query({
  siteUrl: 'https://example.com',
  requestBody: { startDate, endDate, dimensions: ['query', 'page'] },
});

// Returns actual data
const res = await webmasters.searchanalytics.query({
  siteUrl: 'sc-domain:example.com',
  requestBody: { startDate, endDate, dimensions: ['query', 'page'] },
});

Calling sites.list() shows the exact format the API expects. I spent time checking date ranges and service account permissions before running that call and seeing sc-domain: staring back at me.

Silent Truncation in Structured Outputs

With json_schema and strict: true, OpenAI guarantees valid JSON — except when the response hits max_output_tokens. When that happens, the stream ends with truncated JSON and response.status set to 'incomplete'. This is not surfaced as an error. response.completed still fires normally.

if (event.type === 'response.completed' && 'response' in event) {
  if (event.response.status === 'incomplete') {
    const reason = event.response.incomplete_details?.reason || 'unknown';
    log.error('Response truncated', { reason });
    await params.onChunk('internal.error', 'The AI response was too long and got cut off.');
    await params.onChunk('internal.finished', '');
    return;
  }
  captureUsageStats(event.response.usage);
}

This one had a bonus bug that made it harder to find. Before OpenAI supported structured output streaming, I used XML-like tags in the prompt to get parseable responses — <next_action>, <message>, that kind of thing. When structured outputs shipped, I switched to json_schema but left the XML parser in the catch branch:

try {
  return JSON.parse(input);
} catch (jsonError) {
  return parseXMLResponse(input); // left in "just in case"
}

When a truncated JSON response hit this code, JSON.parse failed, the catch branch fired, and the XML parser found no tags. It returned nextAction: null with the entire raw JSON string stuffed into the message field. The failure surfaced as a null-check bug three layers downstream — not as a parser problem. Dead code from a previous architecture, silently eating every truncation error.

What These Five Have in Common

Every failure surfaced downstream as something that didn't look like an API problem. Corrupted audio looked like a model quality issue. Empty GSC results looked like a permissions problem. Truncation looked like a null-check bug three layers away. The API boundary said success, and the real problem hid behind that signal.

The only reliable defense is asserting on the output, not the status code — the kind of thing a code audit catches systematically. Check that the image matches the prompt. Check that the buffer is actually binary. Check that the response has rows. Check that the JSON is complete. If you only verify that the call succeeded, you'll find the failure when your users do.

Filling Forms No Tool Can Template

Jurij Tokarski — Thu, 16 Apr 2026 10:11:00 +0000

It Works, But You Can't Ship It covered the compliance wall — code execution sandboxes can fill DOCX forms, but they only exist in regions that don't match every customer's data residency policy. This post covers what I learned building the feature before that discovery.

Every Form Is Different

Tender response forms have nothing in common with each other. One agency sends a form with merged table cells and numbered question blocks. The next sends checkboxes inside conditional formatting with section breaks in unexpected places. There is no shared structure, no recurring field names, no predictable layout.

Every DOCX templating tool I evaluated — docx-templates, easy-template-x, docxtemplater — works the same way: you prepare a template with {variable_name} placeholders, pass in data, get a rendered document. That assumes you control the template. Tender forms come from government agencies. You don't control anything. You can't insert placeholders into a form you receive the day the tender opens.

Filling these forms requires understanding an arbitrary document's structure, finding the insertion points, and knowing what content goes where. That's not a deterministic templating problem. It's a comprehension problem.

DOCX Is a ZIP of XML

My first PoC tried the next obvious thing: extract the DOCX to markdown, send it to the model with draft content, get back filled markdown, convert to a new DOCX. Clean pipeline, completely useless output. The regenerated document lost every merged cell, every checkbox, every conditional format. The output was a different document that happened to contain similar text.

A DOCX file is a ZIP archive of XML. word/document.xml holds the content in OOXML format. The correct approach is to give the model the original binary, let it read the XML, find the insertion points, write modifications back, and save the modified ZIP. XML surgery on the original file — not regeneration.

That's the only reliable way to fill DOCX forms with AI — operate on the XML, not on a lossy text conversion. And the only way to do this through an API, without deploying a separate Python service, is a code execution sandbox. Both OpenAI's code_interpreter and Anthropic's code execution tool provide a sandboxed Python environment where python-docx is available and the model can operate on the file directly.

Once that architecture clicked, the API quirks started.

OpenAI: The File Goes in the Container

My first attempt passed the uploaded DOCX as an input_file content block in the user message — the pattern you'd use for images or PDFs.

Expected context stuffing file type to be a supported format... but got .docx

Context stuffing only supports PDFs, images, and plain text. The file has to go into the code_interpreter container instead:

tools: [
  {
    type: 'code_interpreter',
    container: { type: 'auto', file_ids: [uploadedFile.id] }
  }
]

The message itself is plain text — you tell the model the filename so it knows what to look for in the sandbox. No file reference in the content block at all.

Getting the filled file back had its own problem. The SDK exposes client.containers.files.content, which looks callable. It isn't — it's a resource object. The working call is client.containers.files.content.retrieve(containerId, fileId). Neither the types nor the error message make this obvious. I found it by running Object.getOwnPropertyNames on the object at runtime.

Anthropic: Five Things at Once

Claude has the same capability, but it requires five specific pieces in a single request. Miss any one and you get a cryptic failure.

The file upload needs an explicit MIME type — not inferred from the extension. The API call needs two beta flags active simultaneously: files-api-2025-04-14 and code-execution-2025-08-25. The file must be referenced as container_upload in the content block — not document, not file. The tool declaration needs the full versioned type string code_execution_20250825. And the download call needs the same beta flags passed again.

const response = await client.beta.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 16384,
  betas: ['files-api-2025-04-14', 'code-execution-2025-08-25'],
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: userMessage },
      { type: 'container_upload', file_id: uploaded.id },
    ],
  }],
  tools: [{ type: 'code_execution_20250825', name: 'code_execution' }],
});

No single documentation page covers all five requirements together. Each piece is documented somewhere. The combination isn't.

GPT-5.1 Broke the Document

I tested the form-filling flow across GPT-5.1, 5.2, and 5.4, on both Azure OpenAI and the public API.

GPT-5.1 on Azure — our production deployment — wrote code that opened the DOCX but ignored formatting preservation entirely. Merged cells collapsed, checkboxes vanished, section breaks shifted. The output was a broken document. Same result on the public API — not an infrastructure issue, a model capability issue. GPT-5.2 was inconsistent: partially filled on one test, failed on the next. GPT-5.4 was the first in the lineup that reliably understood the OOXML structure, applied targeted modifications with python-docx, and returned a valid binary with all formatting preserved.

Every Claude Model Could Do It

After the GPT results I tested Claude 4.6 — Opus, Sonnet, and Haiku — through Anthropic's code execution sandbox. Opus and Sonnet completed the task cleanly. The OOXML structure stayed intact, insertions landed in the right cells, formatting survived the round trip. Haiku was inconsistent — similar to GPT-5.2, partially filling on some runs and failing on others.

The gap between the top-performing models was stark. GPT-5.1 couldn't preserve the structure at all. Claude Opus and Sonnet preserved it reliably. The model version matters more than the provider for this task. But which model you can actually deploy depends on where your customer's data is allowed to live — a tech strategy decision, not a code decision. I didn't.

SVG Animation Is Not DOM Animation

Jurij Tokarski — Mon, 13 Apr 2026 10:00:00 +0000

I had a bar chart race sitting in a private repo for over five years. A coding challenge from 2020 or so — built it, moved on, forgot about it. When I started building the toolkit on varstatt.com — free browser-based dev tools — it seemed like an obvious candidate to resurrect.

The new version would be React with SVG, part of a suite: bar chart race, line chart race, area chart race, bubble chart race. Same idea, four visualizations. Upload a CSV, watch the data animate.

Every animation technique I reached for broke in a way I didn't expect.

CSS Transitions Do Nothing on Geometric Attributes

First attempt on the line chart: CSS transitions on SVG elements. transition: cx 300ms ease, cy 300ms ease on the <circle> dots tracking data points. Expected smooth interpolation between positions.

The dots snapped. No animation. Chrome, Firefox, same result.

/* This does nothing useful */
circle {
  transition: cx 300ms ease, cy 300ms ease;
}

CSS transitions animate CSS properties. cx, cy, r, points are not CSS properties — they're SVG attributes. They live in the DOM, but the browser's animation engine doesn't see them. You can change them from JavaScript and the element moves, but there's no interpolation. It jumps.

transform and opacity work because those are actual CSS properties that SVG elements happen to support. Everything that describes SVG geometry — positions, sizes, path data — sits outside that system.

Two Animation Systems on the Same Property

The bar chart race had horizontal bars with CSS transitions on top and width. I set transition: top 1000ms ease-out, width 1000ms ease-out and advanced frames with setInterval. That worked.

Then I switched playback to requestAnimationFrame for continuous interpolation — a float position updating at ~60fps instead of integer jumps every second.

The bars turned jittery. Every RAF tick (~16ms) set a new top value. Each value restarted the 1000ms CSS transition before the previous one completed. The browser's transition engine was fighting the RAF loop. Two animation systems controlling the same property, neither finishing.

// RAF updates position every ~16ms
const top = rank * BAR_HEIGHT;
// CSS transition: "top 1000ms ease-out"
// Every 16ms: cancel current transition, start new 1000ms transition
// Result: jittery mess

The fix was to remove every CSS transition from every element that RAF touches. Bar positions, widths, SVG coordinates, label positions — all computed directly from the playback float.

// RAF computes position directly — no CSS transition
const interpolatedRank = prevRank + (nextRank - prevRank) * frac;
const top = interpolatedRank * BAR_HEIGHT;
// style={{ top, transition: 'none' }}

CSS transitions and requestAnimationFrame are competing strategies. They solve the same problem differently. Layering both on the same property means neither works. I ended up with zero CSS transitions on animated properties across all four chart types.

Colors That Follow Position Instead of Identity

Bubble chart. Bubbles sorted by value each frame so the largest renders on top (correct z-order). Colors assigned by array index after sorting.

Frame 1: Python is biggest, gets index 0, gets blue. Frame 2: JavaScript overtakes Python, gets index 0, gets blue. Python drops to index 1, turns orange. Every frame where the lead changes, half the bubbles swap colors.

// before: color by sorted position
const sorted = [...items].sort((a, b) => b.value - a.value);
sorted.forEach((item, i) => {
  item.color = palette[i % palette.length];
});

The fix: build a color map keyed by series name at parse time, before any sorting happens.

const colorMap = useMemo(() => {
  const map = {};
  data.seriesNames.forEach((name, i) => {
    map[name] = palette[i % palette.length];
  });
  return map;
}, [data.seriesNames, palette]);

data.seriesNames preserves the original CSV column order. It never changes during playback. Sorting for z-order still happens, but it only affects render order, not color. Any visualization where items reorder needs visual properties assigned by identity, never by current array position.

Ranks That Re-Sort Every Tick

Same bar chart race. I was re-sorting bars by their interpolated value on every RAF tick. Values cross each other mid-frame constantly — Python at 11.83 overtakes Java at 11.81 for one tick, then Java is back on top the next. The bars flickered between positions 60 times a second.

The fix: compute sort order only at whole frame boundaries, store it in a pre-computed array, then interpolate rank positions as floats between frames.

const frameRanks = useMemo(() =>
  data.frames.map(frame => {
    const sorted = data.seriesNames
      .map(name => ({ name, value: frame[name] }))
      .sort((a, b) => b.value - a.value);
    const ranks = {};
    sorted.forEach((s, i) => { ranks[s.name] = i; });
    return ranks;
  }), [data]);

// interpolate rank as a float
const rank = (currentRanks[name] ?? idx) * (1 - frac)
           + (nextRanks[name] ?? idx) * frac;
const top = rank * barHeight;

A bar sliding from position 3 to position 1 moves smoothly over the full frame duration instead of jumping. No flickering.

Curves That Rewrite Their Own History

The line and area charts used Catmull-Rom splines. The animation draws a line progressively — like a pen moving across the screen. Curves looked great.

The problem showed up immediately: as the animation advanced and new points entered the spline, the entire line wiggled. Segments already "drawn" shifted into new positions on every frame.

Catmull-Rom computes each segment's control points from the tangent at its endpoints, and the tangent at any point depends on its neighbors. Add a new neighbor, all the tangents change. Feed completed points into the spline function as the animation progresses and every frame recalculates every segment. The old part of the curve is never stable.

The fix split the work into two memos with different dependency arrays.

// Phase 1: compute ALL segments from full dataset, once
const stableGeometry = useMemo(() => {
  const allPoints = data.frames.map((f, i) => ({
    x: xScale(i), y: yScale(f.values[seriesName])
  }));
  const segments = precomputeSegments(allPoints);
  return { allPoints, segments };
}, [data, width, height]);

// Phase 2: reveal progressively, split active segment with de Casteljau
const chartData = useMemo(() => {
  const { segments } = stableGeometry;
  const wholeIdx = Math.floor(position);
  const frac = position - wholeIdx;

  let d = segments.slice(0, wholeIdx).map(s => s.pathData).join('');
  if (frac > 0 && segments[wholeIdx]) {
    const partial = splitBezierAt(segments[wholeIdx], frac);
    d += partial.pathData;
  }
  return d;
}, [stableGeometry, position]);

Pre-compute the full curve from all data points. Completed segments render byte-identical every frame. The active segment gets split at the exact fractional position using de Casteljau subdivision. Historical geometry never depends on current playback position.

The Common Assumption

Each problem came from expecting SVG elements to behave like DOM elements when animated. CSS transitions ignore geometric attributes. RAF and transitions fight over the same values. Array indices aren't stable identifiers when sort order changes. Spline algorithms that look local are global.

The fix was the same every time: compute everything yourself, from one source of truth. Zero CSS transitions on animated properties, all positions derived from a single playback float.

The four chart tools are part of the varstatt.com/toolkit — free, browser-based, no sign-up: bar chart race, line chart race, area chart race, bubble chart race. The old repo from 2020 bears no resemblance to what shipped.

45 Tabs I Stopped Opening

Jurij Tokarski — Thu, 09 Apr 2026 14:45:00 +0000

The JWT decoder I used to reach for sent the token to a server. I noticed because I had DevTools open for something else and saw the POST. A JWT often carries user IDs, emails, roles, expiration data. I'd been pasting production tokens into a stranger's endpoint for months.

That was the first tool I built for the toolkit. The rest followed the same pattern: I needed something, the available options were ad-heavy or required sign-up or made network calls that didn't need to happen. A Base64 encoder doesn't need a backend. Neither does a regex tester, a color converter, or a hash generator.

There are 45 tools now. No sign-up, no tracking, no data collection. Most run entirely in the browser — a few like DNS Lookup and SSL Checker need a server call by nature.

The Catalogue

Encoding — Base64, JWT Decoder, Image to Base64, Encrypt / Decrypt, Hash Generator

JSON & YAML — JSON Formatter, JSON ↔ YAML, YAML Validator

Markdown — Markdown Preview, Text Diff, HTML ↔ Markdown, Markdown to PDF, Markdown to DOCX, CSV Editor

Images — QR Code, Barcode, Image Converter, Favicon Generator, SVG Optimizer, Placeholder Images, Aspect Ratio

Design — Mesh Gradient, CSS Cover Art, Color Converter, Text to Gradient

Charts — Bar Chart Race, Line Chart Race, Bubble Chart Race, Area Chart Race

Network — DNS Lookup, CORS Tester, SSL Checker, OG Tag Validator, HTTP Status Codes, Robots.txt Validator, Sitemap Validator, User Agent Parser

Text — Regex Tester, Case Converter, Slug Generator, Word Counter, Copy Paste Characters

Generators — UUID, Password, Crontab, Unix Timestamp

Most are straightforward. Three outgrew the toolkit and became standalone npm packages.

Text to Gradient

The Text to Gradient tool and the Mesh Gradient Generator both needed the same thing: a way to turn an arbitrary input into a unique, stable visual. Same input, same gradient, every time. No database, no storage.

A djb2-style 32-bit hash is all it takes:

function textHash(str) {
  let hash = 5381;
  for (let i = 0; i < str.length; i++) {
    hash = ((hash << 5) + hash) + str.charCodeAt(i);
    hash = hash >>> 0;
  }
  return hash;
}

Everything derives from that number. hash % palettes.length selects the color palette. seededRandom(hash + layerIndex * 1000) generates position and opacity variation per layer. The same string always produces the same gradient — looks hand-crafted, costs nothing to store.

The gradients themselves are layered radial-gradient() calls. There's no mesh-gradient() in CSS. What works is stacking 6-8 radial gradients positioned at organic spots — 15%, 37%, 63%, 82% — not pure corners or centers, which look algorithmic. Each one uses a 0px first stop for a crisp center and transparent at 50% for soft falloff. The browser composites them in layer order.

background:
  radial-gradient(ellipse at 15% 20%, rgba(120, 40, 200, 0.9) 0px, transparent 60%),
  radial-gradient(circle at 80% 10%, rgba(40, 180, 220, 0.8) 0px, transparent 50%),
  radial-gradient(ellipse at 55% 75%, rgba(200, 60, 120, 0.85) 0px, transparent 55%),
  #1a0a2e;

For tinting — hover states, borders, soft fills — color-mix() handles it without any HSL arithmetic:

background-color: color-mix(in srgb, var(--accent) 12%, white);
border-color: color-mix(in srgb, var(--accent) 25%, transparent);

One thing that cost me time: making these dynamic in Tailwind. A template literal like bg-[color-mix(in_srgb,${color}_12%,white)] silently produces nothing. Tailwind's compiler scans source files for complete static strings at build time. A class assembled from a variable doesn't exist as a string when the scanner runs — it gets skipped with no warning. Inline styles are the fallback for truly dynamic values.

Text to Gradient is now an npm package. It powers the default cover images across the site when a page has no custom visual. Those covers are also animated — which is where the next package came from.

Loopkit

Every tool, blog post, landing page, and discovery step on varstatt.com has an animated SVG cover — all powered by Loopkit. I had ~35 cover designs already in JSX when I started building the engine underneath them. The first decision was whether to keep composable React components or switch to schema-driven JSON.

JSON won because of output flexibility. A React component locks you into JSX. A schema is data — it can render to HTML for OG images, to SVG for exports, to CSS for emails, or to React for the live site. The core engine has no React dependency.

const cover = createCover(schema);
cover.html       // full HTML with inline styles
cover.style      // React style objects
cover.innerHtml  // just the elements
cover.hoverCss   // raw CSS rules

Phase ordering. I had the cycle structured as: animate forward, hold final frame, fade out, loop. Loop restarts were smooth, but the first play() call snapped instantly from the held frame to frame 0. Moving the fade to the beginning of the cycle fixed it — every iteration, including the first, starts with a reverse interpolation from wherever the animation sits, then plays forward.

Hover exits. mouseenter called play(), mouseleave called reset(). The reset snapped to the static frame — functional but mechanical. A settle() method reads the live position and interpolates smoothly from there to the end state over a capped duration. The key: tracking currentAnimElapsed during active animation is what makes settle() possible. Without it, mouseleave can only snap.

Stagger math. In a staggered loop where each element has its own delay, the cycle duration isn't animDuration. It's the time until the last element finishes, plus hold time. Using just animDuration cuts off late-starting elements before they complete.

let lastFinish = 0;
for (const el of schema.elements) {
  const delay = computeDelay(el.animate.sequence ?? 0, schema.stagger ?? 0);
  const duration = el.animate.duration ?? schema.duration ?? 1;
  lastFinish = Math.max(lastFinish, delay + duration);
}
const cycleDuration = lastFinish + holdDuration;

Re-centering all 48 schemas programmatically surfaced one more problem. The centering script computes a bounding box, then shifts coordinates to align with the canvas center. Loopkit schemas use [from, to] arrays for animated values — a bar animates with y: [247, 87]. The bbox script was reading [0], the start value. A bar starting at y=247 with height 180 gave a 427px bounding box on a 280px canvas. The fix was one index: read [1], the end state, because that's the visual rest position.

Loopkit is under 5KB with zero dependencies. It's an npm package now.

Markdown Repository

Markdown Repository began as a utility function inside this site. I query .md and .mdx files by frontmatter — filter by tags, sort by date, paginate. The API looks like Firestore's where/orderBy/limit chain. Once three of my projects used the same copy-pasted code, I extracted it into an npm package. The publish pipeline — trusted publishing with OIDC, no stored tokens — turned into its own post.

The Full List

45 tools, three npm packages. The full list is at varstatt.com/toolkit.

npm Publish Without Tokens

Jurij Tokarski — Tue, 07 Apr 2026 10:35:00 +0000

I published an npm package last week — markdown-repository, a Firestore-style query builder for markdown files. The code worked. The tests passed. The release pipeline took longer to get right than the package itself.

The Old Way

The standard npm publishing workflow uses a long-lived access token. You generate it on npmjs.com, store it as a GitHub Actions secret, and reference it in your workflow:

- run: npm publish
  env:
    NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}

It works, but the token never expires, has write access to your packages, and lives in plain text in your CI secrets. If it leaks — through a copied workflow file or a careless log — anyone can publish under your name.

npm's granular tokens improved this slightly. You can scope them to specific packages and set a 90-day expiration. But you still have to rotate them manually.

Trusted Publishing

npm now supports trusted publishing with OIDC. Instead of a stored token, your GitHub Actions workflow proves its identity to npm using a short-lived OpenID Connect credential. npm verifies the credential against the workflow you've authorized, and accepts the publish.

No token to store. No token to rotate. No token to leak.

First Publish Is Manual

Before you can configure trusted publishing, the package must already exist on the registry. npm has no "pending publisher" feature — you can't set up OIDC for a package that doesn't exist yet.

For the very first version, publish from your machine:

npm login
npm publish --access public

I spent a while debugging my workflow before realizing trusted publishing only works from the second release onward. Once the package exists on npmjs.com, go to its settings and add a trusted publisher. From that point, the workflow handles everything.

Setting Up the Workflow

The setup has two parts.

On npmjs.com: go to your package settings, add a trusted publisher. Specify the GitHub org/user, repository, workflow filename, and optionally an environment name.

In the workflow: add id-token: write permission and an environment that matches what you configured on npm.

name: Release

on:
  release:
    types: [published]

permissions:
  contents: read
  id-token: write

jobs:
  publish:
    runs-on: ubuntu-latest
    environment: release
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 24.x
          registry-url: https://registry.npmjs.org
          cache: npm
      - run: npm ci
      - run: npm test
      - run: npm run build
      - run: npm publish --provenance --access public

Provenance attestation is automatic with trusted publishing. The --provenance flag is redundant but makes the intent explicit.

The Misleading 404

My first three releases failed with this error:

npm error 404 Not Found - PUT https://registry.npmjs.org/markdown-repository
npm error 404 'markdown-repository@1.1.0' is not in this registry.

The package existed. The version was correct. The OIDC token exchange succeeded — I could see the signed provenance statement in Rekor's transparency log. Everything worked except the actual publish.

The problem: Node 22 ships with npm 10.x. Trusted publishing requires npm 11.5.1 or later.

npm's documentation mentions this requirement. The error message doesn't. A 404 on PUT looks like a registry problem or a package name conflict. Nothing points you toward an npm version mismatch.

The Fix

Use Node 24.x in your workflow. On GitHub Actions, node-version: 24.x resolves to a recent patch that includes npm 11.5.1+ — markdown-repository publishes this way without an explicit npm upgrade.

- uses: actions/setup-node@v4
  with:
    node-version: 24.x

If you're stuck on an older Node version, upgrade npm explicitly:

- run: npm install -g npm@latest

With npm 11.5.1+, the same workflow publishes successfully. No tokens needed.

The Environment Mismatch

The same 404 shows up when the environment name on npmjs.com doesn't match the environment field in your workflow job. If your workflow says environment: release but npm has the environment field blank (or vice versa), the OIDC claims don't match and npm rejects the publish — with a 404, not a meaningful error.

What the Pipeline Looks Like Now

The full workflow for markdown-repository runs lint, tests, and build on every commit. On a GitHub release, it publishes to npm with provenance — no secrets configured anywhere in the repository.

Three Ways the Wrong Value Won

Jurij Tokarski — Tue, 31 Mar 2026 00:00:00 +0000

A user created a tender and immediately couldn't edit it. Not after a day, not after some permission change — immediately. They hit "Create," the page loaded, and the edit button was grayed out.

That was the first bug. It took three fixes across two projects before I understood what connected them: in each case, the value that reached the client wasn't the value I'd computed. Something else got there first — by being faster, by being stale, or by being last in the object literal.

The Value That Arrived Too Early

I pulled up the tender document in Firestore. The ai_driver field was missing entirely. The frontend created tenders like this:

const tenderData = {
  title,
  company_id: companyId,
  ...(companyData?.ai_driver && { ai_driver: companyData.ai_driver }),
};

New companies had no ai_driver set. The conditional spread evaluated to falsy, so the field was never written. That was supposed to be fine — a Cloud Function trigger would set the default after creation.

The Firestore snapshot listener had other plans. It fired before the Cloud Function, saw no ai_driver, and ran this check:

const isDiscontinuedDriver =
  !tender.ai_driver || DISCONTINUED_AI_DRIVERS.includes(tender.ai_driver);

Missing field. Falsy. "Discontinued." Read-only. The user just watched their tender lock itself. Every single tender created by a new company since this code shipped had been born locked.

The fix had two parts. The frontend writes every field it reads immediately after creation — no delegating defaults to triggers:

const tenderData = {
  title,
  company_id: companyId,
  ai_driver: companyData?.ai_driver || DEFAULT_AI_DRIVER,
};

And the discontinuation check had to distinguish "missing" from "actively deprecated":

const isDiscontinuedDriver =
  tender.ai_driver && DISCONTINUED_AI_DRIVERS.includes(tender.ai_driver);

Deployed both. Bug reports kept coming.

The Value That Outlived Its Meaning

Different users, same symptom. Tenders locked on creation. But these companies had ai_driver explicitly set in Firestore — set to assistants-api-gpt4o, a driver I'd discontinued months earlier.

I traced it to the organization settings form:

aiDriver: company.ai_driver || "assistants-api-gpt4o",

That hardcoded fallback was a leftover from migration. New companies had no ai_driver in Firestore, so the form loaded with a dead value nobody could see. The field wasn't even visible on the settings page — it was an internal config, not a user-facing dropdown.

The form submitted its entire state on every save. A user enables a jurisdiction toggle, hits save, and the payload includes ai_driver: "assistants-api-gpt4o". The backend guard:

if (payload.ai_driver) {
  update.ai_driver = payload.ai_driver;
}

Truthy string passes. The discontinued driver gets written to Firestore. Every tender created after that inherits it. The user who toggled a jurisdiction setting three weeks ago has no idea they just broke tender creation for their entire organization.

I dropped the hardcoded fallback. Deployed. Reports kept coming — users had the old bundle cached. Every save from a cached session re-wrote the stale value, undoing any Firestore cleanup I ran manually.

The frontend fix wasn't the real fix. The real fix was backend enum validation:

if (payload.ai_driver && Object.values(AIDriver).includes(payload.ai_driver)) {
  update.ai_driver = payload.ai_driver;
}

The backend rejects any value not in the current enum. Cached bundles, stale defaults, garbage input — all dropped. The frontend can send whatever it wants; the backend is the last line, and it has to act like it.

That stopped the bleeding. But the pattern was already in my head when I opened a different codebase weeks later.

The Value That Was Always Last

I was reviewing a feature flag called ai_chat_enabled. The backend computed it from the user's subscription plan — a careful if/else chain that looked up the plan, checked edge cases, and resolved to a boolean. Solid logic. Well-tested in isolation.

Then I looked at the response builder:

return {
  statusCode: 200,
  body: {
    email: email_address,
    name: name,
    plan: plan,
    ai_chat_enabled: ai_chat_enabled,
    ...customerPreferences,
  },
};

customerPreferences came from DynamoDB. It contained its own ai_chat_enabled key — the raw stored preference, not the computed one. The spread came after the explicit assignment.

JavaScript object literals follow last-writer-wins. The spread silently overwrote the computed value with whatever was sitting in the database. The entire plan-based computation — the lookup, the edge cases, the if/else chain — never reached the client. Not once. Not since the day this code shipped.

The tests checked that the computation logic returned the right boolean. They never checked that the response builder actually used it.

The fix was one line — move the spread before the explicit fields:

return {
  statusCode: 200,
  body: {
    ...customerPreferences,
    email: email_address,
    name: name,
    plan: plan,
    ai_chat_enabled: ai_chat_enabled,
  },
};

Computed values last. Raw data first. The spread provides defaults; the explicit fields override them.

The Wrong Value Always Has a Way In

Timing, staleness, ordering. Three mechanisms, same result: the value I intended never made it. If the frontend reads a field, the backend must validate it. If the backend computes a value, nothing downstream should be able to quietly replace it. The wrong value will always find a way in. The only defense is making sure the right value goes last.

An Empty AI Response Corrupted Chat History

Jurij Tokarski — Thu, 26 Mar 2026 00:00:00 +0000

The spinner ran. The stream closed. The chat bubble stayed empty. No error anywhere.

I was building a conversational discovery tool for founders — a multi-step Gemini-powered flow that walked people through product decisions, collected answers, and built a structured brief. Complex setup: long system prompt, tool definitions, large user messages. Genkit's generateStream handling each turn.

Intermittently, a user would send a message and get nothing back. No timeout, no catch block firing, no non-2xx status. Just a clean stream completion with zero content inside.

What the Logs Said When I Added Them

Standard error handling gives you no signal here:

try {
  const { stream, response } = await ai.generateStream({ ... });
  for await (const chunk of stream) {
    // exits immediately — no chunks arrive
  }
  // response.text() returns ''
  // no exception thrown
} catch (err) {
  // never reached
}

Adding chunk-level logging made it visible. The stream was completing, but the one chunk that arrived looked like this:

Chunk #1 has no content.
Keys: [ 'index', 'role', 'content', 'custom', 'previousChunks', 'parser' ]
role: model
content.length: 0

The content property existed. It wasn't null. It was an empty array. The keys custom, previousChunks, and parser are Genkit's internal markers for a thinking chunk. The model had spent the entire response budget on internal reasoning and had nothing left to output. HTTP 200. Genkit reported success.

Two Ways to Get Nothing

Gemini 2.5 Flash ships with thinking mode enabled by default. Under normal inputs that's fine. Under heavy inputs — long system prompt plus tool definitions plus a long user message — it can exhaust the entire token budget on reasoning before producing a single output token.

There's a second cause that produces the same result: silent rate limiting. Rather than returning a 4xx, Gemini returns a valid, complete, empty stream. The observable symptom is identical. The detection is identical: assert that at least one content chunk arrived after the stream closes.

For the thinking mode case, the fix is one line in the Genkit config:

const { stream, response } = ai.generateStream({
  model: MODEL,
  system: systemPrompt,
  messages,
  tools,
  config: {
    thinkingConfig: { thinkingBudget: 0 },
  },
});

thinkingBudget: 0 disables extended thinking. For a conversational flow where latency matters more than deep reasoning, there's no reason to let the model spend the budget on internal traces.

Fix deployed. I moved on.

The Save That Made It Permanent

What I hadn't checked: the database. Every one of those empty responses had already been saved to Firestore. An empty string is a valid string. The save ran. Nothing flagged it.

The stream handler read finalResult.text after generateStream resolved and wrote it as the AI's message. When thinking mode ate the budget, finalResult.text was "". Firestore now held a record of every affected conversation — each one storing a legitimate-looking AI turn with no content.

History as Poison

When those users came back and sent new messages, getChatHistory pulled their messages from Firestore and formatted them for Gemini:

return messages.map((msg) => ({
  role: msg.role === "ai" ? "model" : "user",
  content: [{ text: msg.content }],
}));

When msg.content is "", that produces { role: "model", content: [{ text: "" }] }. A valid-looking empty model turn in the middle of a real conversation. Gemini received it, interpreted it as unfinished context, entered thinking mode to reason about it, exhausted the budget, returned nothing — which got saved as another empty message, which poisoned the next turn.

The conversation was permanently, silently broken. No exception at any layer. No signal the user could act on. Just a chat that would never respond again.

The Fix That Requires Two Places

Fixing only the stream detection isn't enough — the database is already corrupted. Fixing only the history filter isn't enough — new empty responses can still arrive and be saved. Both defenses are required.

Never write an empty AI message:

const finalText = accumulatedText || finalResult.text || "";
if (finalText) {
  await saveAIMessage(chatId, finalText);
} else {
  console.warn("[StreamHandler] Skipping empty AI message save");
}

And filter empty turns before sending history to the model:

return messages
  .filter((msg) => msg.content)
  .map((msg) => ({
    role: msg.role === "ai" ? "model" : "user",
    content: [{ text: msg.content }],
  }));

Miss either one and the loop can restart. The stream guard stops new corruption. The history filter handles the records already in the database.

The Retry That Made It Worse

The first instinct after detecting an empty stream was to retry. The naive retry called the same send function — which re-inserted the user's message into the messages array. The model received the question twice. On an already-stressed conversation with heavy context, this accelerated the problem rather than resolving it.

The fix is an isRetry flag that skips message insertion on retry calls:

async function streamMessage(content, sessionId, token, { isRetry = false } = {}) {
  if (!isRetry) {
    setChatMessages(prev => [
      ...prev,
      { id: userMsgId, role: 'user', content },
      { id: aiMsgId,   role: 'assistant', content: '' },
    ]);
  } else {
    setChatMessages(prev => [
      ...prev.filter(m => m.id !== aiMsgId),
      { id: aiMsgId, role: 'assistant', content: '' },
    ]);
  }

  await streamAIResponse(sessionId, token);
}

The user message stays in history exactly once. Without this, retry logic breaks an already-broken conversation faster.

Why Every Layer Said "Success"

What made this hard to debug: every layer reported success. HTTP 200, no caught exceptions, valid Firestore writes, clean history formatting. The failure was in the semantics, not the mechanics. An empty model turn is not a successful model turn — and asserting that distinction at each boundary is the only thing that stops the loop.

Software Engineering Principles for Startups

Jurij Tokarski — Mon, 23 Mar 2026 00:00:00 +0000

Most software engineering principles are written for teams of 50. Agile ceremonies, sprint retrospectives, quarterly planning — built for organizations, not for founders shipping products.

I run a solo development studio. I ship to production every week, manage multiple client projects simultaneously, and maintain everything I build. Over the years I wrote down the principles that make this work. There are 33 of them, organized across five areas: philosophy, discovery, delivery, partnership, and diligence.

Here's what actually matters when you're building software for startups.

Start With What's Worth Building

The most expensive software is software that shouldn't exist. Before writing any code, I run every project through a simple filter: is this worth building?

Most ideas aren't. Not because they're bad ideas — but because they solve the wrong problem, or solve it at the wrong time, or solve it for a market that doesn't care enough to pay.

When something passes that filter, the next step is finding the core — the one capability that makes this product exist. Not the feature list. Not the competitor parity matrix. The single thing that, if it doesn't work, means nothing else matters.

Jane's booking app needed staff-to-service matching that handled real salon complexity. Everything else — payment processing, notifications, calendar sync — is infrastructure you can buy. The core is the only part worth building custom.

Fix the Budget, Flex the Scope

Startups don't have unlimited time or money. The traditional approach — estimate everything, add buffer, hope it fits — doesn't work because estimates are wrong.

I use appetite, not estimates. You decide how much time a problem is worth — two weeks, six weeks — and that's your constraint. Then scope shaping fits what you build inside that box.

This sounds backwards but it changes everything. Instead of "how long will this take?" the question becomes "what's the best version we can ship in three weeks?" That question has a useful answer.

Ship Continuously, Not Eventually

Startup velocity comes from short feedback loops. Every principle in my delivery system optimizes for one thing: getting working software in front of users faster.

WIP One means one task in progress at a time. Finish it, deploy it, move on. Context switching kills solo developers faster than bad architecture.

Production is done means nothing counts until it's live. Not "done on my machine." Not "ready for review." Live in production with monitoring in place. This sounds obvious but most projects have weeks of "almost done" work that never ships.

Continuous flow replaces sprints with a priority queue. No sprint planning, no velocity tracking, no ceremony. Just: what's most important right now? Do that. Deploy it.

For startup teams, this means you can change direction on Monday and ship the new thing by Wednesday. No "we'll add it to next sprint."

Software Development Is a Cost, Not a Craft

This is the one that makes developers uncomfortable: software development is a business cost. It's an operational expense, like rent or hosting.

That doesn't mean quality doesn't matter. It means quality serves the business, not the developer's ego. The scout rule — leave the codebase better than you found it — keeps quality high without separate "refactoring sprints" that never get prioritized.

Consolidation means fewer tools, fewer vendors, fewer moving parts. Every additional service is another bill, another dashboard, another thing that breaks at 2 AM. For startups, simplicity is a feature.

Build the Boring Parts Last

Context over purity means making pragmatic decisions, not architecturally perfect ones. Use the default stack. Buy what you can. Build only what's core.

I keep a default stack and use it for everything unless there's a specific reason not to. Deep expertise in familiar tools beats starting fresh with the "best" technology for each project.

When a client asks "should we use microservices?" the answer is almost always no. Not because microservices are bad — because for a startup, a monolith you ship in three weeks beats a distributed system you ship in three months.

Transparency Over Everything

Startup partnerships fail on misaligned expectations, not technical problems. Every partnership principle I follow addresses this directly.

Transparency means full visibility into progress, problems, and decisions. No weekly status reports that hide bad news. When something goes wrong — and it will — the client knows the same day.

Weekly accountability creates a billing cycle that forces honest conversations. If the week didn't produce visible progress, that's a problem we discuss before the next week starts.

Exit freedom means clients can leave at any time. No contracts, no lock-in, no hard feelings. If the work isn't valuable, you should be able to stop paying for it immediately. This keeps me accountable in a way that six-month contracts never could.

Maintenance Is Not a Phase

The biggest lie in software development: "We'll build it, launch it, then maintain it." As if building and maintaining are separate activities.

No split means development and maintenance happen continuously. Every feature I ship includes monitoring. Every deployment includes the ability to roll back. Quality gates and feature flags make it safe to fail and fast to fix.

For startups, this means you don't need a separate "operations team" from day one. The development process IS the operations process. Ship code, watch it run, fix what breaks, improve what works.

The Full System

These principles aren't independent tips — they form a system. Discovery principles prevent you from building the wrong thing. Delivery principles get the right thing shipped fast. Partnership principles keep everyone aligned. Diligence principles make sure it keeps working.

I documented all 33 principles as a reference — not as rules to follow blindly, but as a starting point for founders who want their engineering process to actually work.

The best engineering principles for your startup are the ones that let you ship every week. Everything else is overhead.

Live: varstatt.com/principles

Why Scrum Fails In Small Teams

Jurij Tokarski — Sat, 21 Mar 2026 00:00:00 +0000

A few years ago, my development team of three was sitting through a 90-minute sprint planning ceremony. The feature we planned took two days to build.

We spent more time estimating and discussing the work than doing it. I was the team lead, and this was the moment I started questioning what we were actually doing here.

Scrum Solved a Real Problem — Then Became One

Scrum is a project management framework built around fixed-length iterations called sprints — usually two weeks. Each sprint has a planning ceremony, daily standups, a review, and a retrospective. There's a product owner who manages the backlog, a scrum master who facilitates the process, and a development team that executes.

It was created in the 1990s to bring structure to software projects that were failing under waterfall — the old approach of planning everything upfront, building for months, and hoping the result matched reality. Scrum introduced short feedback cycles. Ship something every two weeks. Inspect and adapt. That was genuinely better than what came before.

The agile manifesto that underpins scrum development prioritizes individuals over processes, working software over documentation, customer collaboration over contracts, and responding to change over following a plan. Good principles. The problem is what the industry built on top of them.

Sprint Boundaries Are Artificial

Tasks don't fit neatly into two-week boxes. Some take three days. Some take twelve. Forcing them into fixed time boundaries creates two failure modes: you either pad estimates to fill the sprint, or you rush to hit an arbitrary deadline that has nothing to do with the actual complexity.

When a priority shifts mid-sprint, scrum says wait until the next planning ceremony. In a small team, that's absurd. The client calls, explains why Feature B is now urgent, and you should be able to switch today — not in nine days when the sprint ends.

Velocity Tracking Becomes Theater

Story points were meant to help teams estimate work. In practice, they become a performance metric. Teams optimize for point throughput instead of actual value delivered. A refactoring task that prevents six months of tech debt gets 2 points. A trivial UI change that the PM can demo gets 8.

When one person does the work, velocity tracking is particularly absurd. You already know your throughput. You lived it yesterday.

Ceremonies Replace Communication

Daily standups. Sprint planning. Sprint review. Sprint retrospective. Backlog grooming. For a team of fifteen with cross-functional dependencies, these rituals serve a real purpose — they force information sharing that wouldn't happen naturally.

For a team of three? Or a solo developer working with a client? These meetings replace the actual communication they were designed to facilitate. You don't need a standup when you can send an async update after each work session. You don't need sprint planning when the priority queue is a shared list that either side can reorder at any time.

When the framework produces more Jira tickets, confluence pages, and status updates than actual shipped code, something has gone wrong. The best process is invisible — it stays out of the way while work gets done.

Every Time I Switched to Kanban, Delivery Rocketed

I've led dev teams twice. Both times we started with scrum because that's what the organization used. Both times we shifted toward kanban. And both times the same thing happened: delivery rocketed and people became happier.

The only meeting that survived was a real daily standup — five minutes to talk about blockers and maybe share plans. That's it. The entire status was visible on the Jira board. Anyone could look at it anytime. No ceremony needed to extract information that was already public.

I've shipped software since 2011. Now I run my own practice based on continuous flow — Kanban, not Scrum. Here's how it works:

A Priority Queue, Not a Sprint Backlog

The client maintains a ranked list. The top item is the highest priority. I work top-down: finish what's in front, then pull the next thing. Priorities shift? The client reorders the list. No replanning ceremony. No negotiating what fits in the sprint. The developer is always working on what matters most right now.

One Thing at a Time, Then Ship It

One task at a time. Finish it. Deploy it. Then move on. This forces honest prioritization and kills context switching. It prevents the trap of being "90% done on five things" while nothing is actually working.

Code review isn't done. QA passed isn't done. Merged isn't done. Working in production is done. This changes how you think about deployment. If deploying is hard, it gets avoided. If it's easy, it happens constantly. Feature flags handle incomplete work — deploy behind the flag, keep building, flip it when it's ready.

Async Updates Beat Standups

Updates go out after each work session — not at end of day, not at a standup, but when the work is actually done. Meetings happen only for decisions that genuinely need real-time discussion. Everything else is written. This keeps calendars empty and focus time protected.

For significant features, I think in six-week cycles — long enough to deliver something end-to-end valuable, short enough to stay honest. A cycle isn't a deadline. It's a planning horizon. "In six weeks, we expect X to be working." The cycle serves orientation, not ceremony.

For Big Orgs, Scrum Is Still Revolutionary

I'm not anti-process. I'm anti-unnecessary-process.

For old-school corporations that have been running waterfall for decades, scrum is genuinely revolutionary. It introduces feedback loops, iterative delivery, and customer involvement where none existed before. That's a massive upgrade. If scrum is moving your 200-person org from annual releases to biweekly ones — keep going. That's real progress.

Scrum works when you have large teams with cross-functional dependencies, regulated environments where audit trails are compliance requirements, organizations that need guardrails to prevent chaos, or teams coming from waterfall who need a stepping stone.

But your dev team of four is probably shooting itself in the foot with this.

Small Is a Strength, Not a Problem to Fix

Here's what I see constantly: small teams and startups adopting processes designed for organizations ten times their size. Scrum is one of those processes. So are SAFe, detailed PRDs, elaborate RACI matrices, and weekly all-hands with thirty-slide decks.

It comes from the same instinct — wanting to look and feel like a "real" company. But it's backwards. Being small is not a weakness to compensate for. It's an advantage to exploit.

A team of four can make a decision in a Slack thread that would take a 40-person team two sprint ceremonies and a steering committee. You can deploy a hotfix in twenty minutes while a large org is still scheduling the incident review. You can pivot your roadmap over lunch.

My advice: use the strength you actually have. You're small, so act quick. Don't import the overhead of organizations that would kill to have your agility.

The Best Process Disappears

The agile manifesto got it right: individuals and interactions over processes and tools. Somewhere along the way, the industry built an entire certification industry, a tooling ecosystem, and a consulting practice around processes and tools.

The best development process is the one you don't notice. Work comes in, gets prioritized, gets built, gets shipped. No theater. No rituals that exist to feel productive rather than be productive.

Build it. Deploy it. Get feedback. Pull the next priority.

Three Bugs That Were Actually My Prompts

Jurij Tokarski — Thu, 19 Mar 2026 00:00:00 +0000

Three debugging sessions. Three different features. Every investigation eventually landed in the same place: my own prompt files.

The AI wasn't broken. I was a contradictory author.

The STRICT Rule That Was Overriding Itself

I built a structured interview tool — the kind that walks a founder through their idea one question at a time. The system prompt had this near the top:

STRICT: Ask only ONE question per message. Never bundle questions.

Users kept getting messages like "How will you make money? What are the major costs to build and run this?" I read the prompt again. Rule was right there. Added emphasis. Still happened. Moved it higher. Still happened.

Then I read the interview flow section — the part describing what topics to cover across the session. Step 4 read:

**Revenue Streams + Cost Structure** — How will you make money?
What are the major costs to build and run this?

The model wasn't defying the STRICT rule. It was following the flow description, which listed two topics as a single step and framed them as two inline questions. That structure implicitly granted permission to bundle. The more specific instruction — a concrete flow item with actual question text — overrode the more abstract one.

The fix was two things. Unbundle every flow item into separate steps. And add a concrete bad example directly inside the STRICT rule — not just the prohibition:

STRICT: Ask only ONE question per message. Never bundle questions.
Example of what NOT to do: "How will you make money? What are your costs?"
is TWO questions — send one, wait for the answer, then ask the next.

Abstract rules lose to specific structural descriptions. The model resolves contradictions by specificity, not by which rule came first or which one you emphasized. If your flow section describes two questions in the same bullet, that description is an instruction — regardless of what you wrote elsewhere.

The Tool That Read the Prohibition

After a discovery session, users could request a full report by email. The tool was registered. The backend handler existed. Users clicked the button. The AI said it couldn't send emails.

I checked tool registration — correct. Checked the API call — correct. Checked the backend handler — correct. Everything looked wired up properly at every technical layer.

The issue was in a place I hadn't thought to look. I grepped the prompt files for "report":

grep -r "report" mod/discovery/steps/*/prompt.md

Every single step prompt had lines like:

Do NOT offer to send a report.
Do NOT mention sending a report.

I'd written those prohibitions months earlier during a different phase of the project. The tool didn't exist yet when I wrote them. By the time it did, I'd forgotten those lines were there.

The model wasn't defective. It was obedient to instructions I'd authored and then lost track of. Ten minutes of grepping would have found this immediately. Instead I spent days checking tool registration and API calls.

Before investigating code when an AI-powered feature does nothing, grep your prompt files for explicit prohibitions against the behavior you're expecting. Search for "do not" and "don't" across your entire prompt corpus against the relevant action. It takes ten seconds and it would have saved me days on this one.

The Precondition That Lived Only in Prose

After fixing the prohibitions, a new problem surfaced. The model was supposed to ask for the user's email before calling send_report. The prompt said:

ALWAYS ask for the user's email before calling send_report.
Never call send_report without confirmed contact details.

In testing, the tool got called with founder@example.com. A placeholder the model had generated rather than asking for a real address. The instruction was clear. The model treated it as a suggestion.

I made the prompt stronger. Same result — it would comply sometimes, skip the step other times, depending on how the conversation had flowed. Prompt-only enforcement of a precondition is probabilistic.

The fix was to move validation into the tool handler itself:

const PLACEHOLDER_DOMAINS = ['example.com', 'test.com', 'placeholder.com'];

function validateEmail(email) {
  const domain = email.split('@')[1]?.toLowerCase();
  if (!email || !domain) {
    return {
      error: 'No email provided. Ask the user for their email address before calling this tool.'
    };
  }
  if (PLACEHOLDER_DOMAINS.some(d => domain.includes(d))) {
    return {
      error: `"${email}" looks like a placeholder. Ask the user for their real email address.`
    };
  }
}

Two things to notice. First, the validation returns errors instead of throwing them. A thrown exception terminates the tool call with a runtime error the model can't act on. A returned error lands back in the model's context as a tool result — the model reads it, understands what went wrong, and retries:

// This crashes. The model gets a runtime error and no useful signal.
if (!user_email) throw new Error('user_email is required');

// This works. The model reads the error and asks for the real address.
if (!user_email) {
  return {
    error: 'user_email is required. Ask the user for their email address, then call this tool again.'
  };
}

Second, required in a tool schema is a hint to the model, not a runtime guarantee. Models will omit required fields — sometimes because the value wasn't extracted yet, sometimes for reasons that aren't obvious from the logs. Treat every parameter as potentially absent at the handler boundary.

ALWAYS X in a prompt is a suggestion. Enforcing X belongs in code.

The Prompt Is the Program

All three bugs came from the same misread of what a system prompt is. I was treating it as documentation — a description of intended behavior that the real system (the code) would enforce. For an LLM-powered feature, that's backwards.

The system prompt isn't documentation. It's source code executed by a natural-language interpreter. Contradictions in it don't fail to compile — they resolve according to specificity and proximity rules you never wrote down. Prohibitions execute. Structure is semantics. A flow description with two inline questions is an instruction to ask two questions, regardless of the STRICT rule above it.

The debugging instinct to check the API, the tool registration, the network logs — all of that is valid. But it should come after you've read your own prompts as a hostile reader looking for contradictions, prohibitions, and preconditions that only exist in prose.

The model is rarely the bug. Read your prompts first.

Nobody Finishes a 15-Minute AI Interview

Jurij Tokarski — Tue, 17 Mar 2026 00:00:00 +0000

Last year I launched an AI-powered discovery tool for software founders. The idea was simple: instead of paying for a product consultant, sit through a 15-minute AI interview and get a comprehensive development roadmap. Business model, market sizing, personas, competitive analysis, PRD, tech stack, budget, action plan — all in one session, delivered as a PDF report.

The output was genuinely useful. Founders who completed it got something they could hand to a developer and start building from.

But most founders didn't complete it.

Where Sessions Died

I didn't need sophisticated analytics to see the pattern. Founders would start, get three or four exchanges in, and disappear. Not because the questions were wrong. Because they'd hit a question they couldn't answer yet.

"What's your monetization model?" at minute six, right after they'd just gotten excited describing the product idea. Or a market sizing question when they hadn't done that research. The session demanded answers in a fixed order. Real founder thinking doesn't work that way.

I spent weeks trying to fix the session — better prompts, shorter flows, smarter branching. None of it changed the completion rate. I was solving the wrong problem: "how do I get founders to finish a 15-minute interview" instead of "what does a founder actually need, when they need it."

The Insight Came From SEO

While researching keywords for content, I noticed something. "Competitive analysis template for startups" — thousands of monthly searches. "TAM SAM SOM calculator" — same. "PRD generator" — same. Each stage of the founder journey had its own search intent, its own moment of urgency.

I had been thinking about building a standalone tool around one of these keywords. Then it struck me: my discovery tool already does all of this and more. But a founder searching for "lean canvas generator" doesn't think of it as part of a 15-minute discovery interview. They want the canvas. Right now.

The monolithic tool was doing eight things well, packaged in a way that required commitment to all eight. The fix wasn't better prompting. It was decomposition.

Eight Tools, Eight Deliverables

The rebuild started with twelve steps, got trimmed to ten, and settled at eight. One per stage of the founder journey:

Business Model Canvas — lean canvas with revenue streams, cost structure, key partners
Competitive Analysis — positioning matrix, differentiation signals, competitor tech indicators
Market Sizing — TAM/SAM/SOM with growth assumptions
User Personas — typed persona objects with platform preferences and jobs-to-be-done
Feature Prioritization — domain classification (core / supporting / generic)
Tech Strategy — build-vs-buy decisions mapped to domain classification, specific stack recommendations
Project Requirements — scoped feature list, acceptance criteria, out-of-scope boundary
Build Cost & Plan — weekly estimate with a concrete action plan attached

The last two were originally four separate tools: Build vs Buy, Tech Stack Advisor, MVP Cost Estimator, and Action Plan. I merged them in pairs. "Should I build auth?" and "which auth provider?" aren't sequential questions — they're the same question. A cost estimate without an action plan is just a number that makes founders anxious. Eight made more sense than ten or twelve.

Each tool is fully self-contained. It works with no prior context, no prior steps. But designed to hand off cleanly if the founder continues.

The Architecture

Each tool gets its own SEO landing page — keyword-targeted hero, explanation copy, FAQ, and an input form, all server-rendered. The page doubles as the app: before generation it's a landing page Google can crawl, after the founder starts it becomes the chat interface. One URL, two render states.

The chat itself is a streaming conversation with a constrained AI model. Each tool has its own system prompt scoped to the decisions that step owns — Feature Prioritization scores by business value only, no effort or cost questions (those belong to later steps). The AI drives the conversation, but the scope is narrow: ask the right questions for this deliverable, produce a typed artifact, stop.

Three server-side tools do the heavy lifting:

update_artifact — incrementally builds the step's structured output as the conversation progresses
complete_step — finalizes the artifact, captures analysis and summary
send_report — collects all completed artifacts, generates a consolidated PDF, delivers via email

The artifact panel shows the structured output updating in real time as the conversation progresses — the founder sees their canvas or competitive matrix forming, not just chat bubbles.

How Context Moves Between Tools

Each tool produces a typed artifact. The Business Model Canvas produces an object with key_partners, revenue_streams, cost_structure. User Personas produces an array of persona objects. Feature Prioritization produces a classification map.

When a founder continues to the next tool, those artifacts get injected into the new tool's system prompt as structured JSON. Chat history doesn't cross tool boundaries — the back-and-forth of step one is noise inside step six. What crosses is the concluded output.

Each tool ends with two inline options rendered as suggestion pills on the last AI message: Continue to [next tool] or Send report via email. If the founder requests the report, all completed artifacts get compiled into a PDF and delivered to their inbox. If they continue, the next tool opens with context already loaded. Both outcomes are first-class. Stopping after step two means you have a competitive analysis report — that's a complete deliverable, not an abandoned session.

Email capture happens at the moment a founder requests their report — after they've gotten value, not before they've seen anything. That single change converted capture from a gate into an offer.

The Prompt Engineering That Wasn't

Early in the build, I added a line to each tool's system prompt: "Use prior context if available to inform your analysis." Seemed reasonable.

It didn't work. The model would occasionally reference something from an earlier step, but inconsistently and shallowly. Feature Prioritization wasn't connecting domain classifications to the Tech Strategy decisions that depended on them. I spent two hours trying different phrasings before accepting the problem wasn't the wording.

The fix was specificity. Not "use prior context" — enumerate every upstream artifact by name, every relevant field, and exactly how it should influence the current step:

## Prior Step Context

If the following steps are complete, use their outputs as described:

- **Feature Prioritization** — use `domain_classification` (core / supporting / generic)
  to anchor build-vs-buy decisions. Core = build custom. Generic = always buy.
- **User Personas** — use `technical_proficiency` and `platform_preferences`
  to shape deployment and integration decisions.
- **Market Sizing** — use TAM/SAM/SOM scale to calibrate infrastructure complexity.

The model follows explicit field references. It ignores vague instructions to "use context." The more precisely you enumerate the step name, the field name, and how to apply it — the more consistently the output reflects what prior steps actually found.

Build Around the Deliverable

The session format is an inherited assumption from chat UIs. It made sense for general-purpose assistants. It doesn't make sense for a process that unfolds across days or weeks, where each stage has its own mental context and its own moment of urgency.

Decomposing the monolithic tool changed everything downstream. Eight tools means eight landing pages means eight keywords. Each tool is a complete product for someone who needs just that one thing. The full journey still exists for founders who want it — they just don't have to commit to it upfront.

If your AI tool covers something that spans multiple sittings and mental states, the deliverable is the right unit to build around. Not the conversation.

Live: varstatt.com/discovery