<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alan West</title>
    <description>The latest articles on DEV Community by Alan West (@alanwest).</description>
    <link>https://dev.to/alanwest</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3834047%2F6413d0cf-9d90-4ccc-80a9-123656fd78ba.png</url>
      <title>DEV Community: Alan West</title>
      <link>https://dev.to/alanwest</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alanwest"/>
    <language>en</language>
    <item>
      <title>How to escape note-taking lock-in with plain markdown and git</title>
      <dc:creator>Alan West</dc:creator>
      <pubDate>Tue, 19 May 2026 00:32:44 +0000</pubDate>
      <link>https://dev.to/alanwest/how-to-escape-note-taking-lock-in-with-plain-markdown-and-git-3lpk</link>
      <guid>https://dev.to/alanwest/how-to-escape-note-taking-lock-in-with-plain-markdown-and-git-3lpk</guid>
      <description>&lt;h2&gt;
  
  
  When your notes outlive your note-taking app
&lt;/h2&gt;

&lt;p&gt;A few months ago I tried to export 4 years of notes from a popular note-taking app. The export gave me a &lt;code&gt;.zip&lt;/code&gt; of "markdown" files — except every link was rewritten to use the app's proprietary &lt;code&gt;[[uuid-7f3a...]]&lt;/code&gt; syntax, every attachment was renamed to a hash, and frontmatter was packed with app-specific fields nothing else could parse.&lt;/p&gt;

&lt;p&gt;I'd been telling myself "it's just markdown, I can leave whenever." Turns out I couldn't. Not without spending a weekend writing a migration script.&lt;/p&gt;

&lt;p&gt;This isn't a rant about that one app. It's a problem-solving article about a pattern I've watched bite developers over and over: trusting that the "open format" sticker on a tool means your data is portable. Below is how to set up a notes system that's actually portable — and how to verify it stays that way.&lt;/p&gt;

&lt;h2&gt;
  
  
  The root cause: proprietary syntax inside open file extensions
&lt;/h2&gt;

&lt;p&gt;The trick almost every note-taking app pulls is this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Files are saved as &lt;code&gt;.md&lt;/code&gt;. Marketing says "your notes are just markdown."&lt;/li&gt;
&lt;li&gt;But the &lt;em&gt;content&lt;/em&gt; uses app-specific extensions: custom block IDs, embeds, callouts, query languages, plugin metadata.&lt;/li&gt;
&lt;li&gt;Open the file in a plain editor and you'll see roughly 60% standard markdown and 40% syntax that &lt;em&gt;looks&lt;/em&gt; like markdown but isn't.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Standard &lt;a href="https://commonmark.org" rel="noopener noreferrer"&gt;CommonMark&lt;/a&gt; and &lt;a href="https://github.github.com/gfm/" rel="noopener noreferrer"&gt;GitHub Flavored Markdown&lt;/a&gt; are well-defined specs. Anything outside those is, technically, just text the app happens to render specially.&lt;/p&gt;

&lt;p&gt;When you try to migrate, the new tool reads the file fine — and silently drops everything that isn't standard markdown. Links break. Embeds disappear. Math blocks lose half their content. The migration looks successful right up until you actually try to &lt;em&gt;use&lt;/em&gt; the imported notes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Set boundaries with a vault structure
&lt;/h2&gt;

&lt;p&gt;The fix is to treat your notes like a small codebase. Plain markdown, folders for organization, git for history. Here's the layout I've used across three migrations now:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;notes/
├── .git/
├── .gitignore
├── README.md              # entry point — what's here, how it's organized
├── inbox/                 # quick captures, unprocessed
├── daily/                 # YYYY-MM-DD.md
├── projects/
│   ├── project-a.md
│   └── project-b.md
├── topics/                # long-lived reference notes
│   ├── postgres.md
│   └── linux-networking.md
└── attachments/           # images, PDFs — referenced by relative path
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three rules I follow strictly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Links are relative file paths&lt;/strong&gt;, not app-specific wikilinks. &lt;code&gt;[postgres notes](../topics/postgres.md)&lt;/code&gt; works everywhere — on GitHub, in VS Code, on the filesystem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Attachments live alongside the notes&lt;/strong&gt; that reference them. &lt;code&gt;![diagram](./attachments/2026-02-pipeline.png)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No plugin-specific frontmatter.&lt;/strong&gt; If a field isn't useful when grep'd as plain text, don't add it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 2: Replace "features" with Unix tools
&lt;/h2&gt;

&lt;p&gt;Most app features developers actually need — search, backlinks, tag listings — can be replaced with command-line tools you already have.&lt;/p&gt;

&lt;p&gt;For full-text search, &lt;a href="https://github.com/BurntSushi/ripgrep" rel="noopener noreferrer"&gt;ripgrep&lt;/a&gt; is faster than any in-app search I've used:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Search all notes for a phrase, with 2 lines of context&lt;/span&gt;
rg &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s2"&gt;"connection pool"&lt;/span&gt; &lt;span class="nt"&gt;-C&lt;/span&gt; 2 notes/

&lt;span class="c"&gt;# Find every note tagged #postgres (tags as inline #hashtags)&lt;/span&gt;
rg &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="s2"&gt;"#postgres&lt;/span&gt;&lt;span class="se"&gt;\b&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; notes/

&lt;span class="c"&gt;# Find broken relative links: files referenced that don't exist on disk&lt;/span&gt;
rg &lt;span class="nt"&gt;-oP&lt;/span&gt; &lt;span class="s1"&gt;'\]\(\.\/[^)]+\)'&lt;/span&gt; notes/ | &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="nv"&gt;IFS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;: &lt;span class="nb"&gt;read&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; src &lt;span class="nb"&gt;link&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;&lt;span class="nv"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;dirname&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$src&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;/&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$link&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="s1"&gt;'s/^](\.\///; s/)$//'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$target&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"BROKEN: &lt;/span&gt;&lt;span class="nv"&gt;$src&lt;/span&gt;&lt;span class="s2"&gt; -&amp;gt; &lt;/span&gt;&lt;span class="nv"&gt;$link&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For backlinks — which note mentions which — a one-liner does the job:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Find every note that links to topics/postgres.md&lt;/span&gt;
rg &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="s2"&gt;"topics/postgres&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;md"&lt;/span&gt; notes/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Less ergonomic than a sidebar panel in a GUI? Sure. But it works on every machine I'll ever own, in every editor, forever. That's the tradeoff.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Version control as the safety net
&lt;/h2&gt;

&lt;p&gt;This is the step most "just use markdown" guides skip, and it's the one that actually makes the system durable. Initialize the directory as a git repo and commit anything that survives more than a day in the inbox.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;notes/
git init
git add &lt;span class="nb"&gt;.&lt;/span&gt;
git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"initial vault"&lt;/span&gt;

&lt;span class="c"&gt;# A tiny pre-commit hook that rejects accidental app-specific syntax&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; .git/hooks/pre-commit &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;HOOK&lt;/span&gt;&lt;span class="sh"&gt;'
#!/usr/bin/env bash
# Block wikilink-style references — they don't render outside specific apps
if git diff --cached --name-only -z | xargs -0 grep -lE '&lt;/span&gt;&lt;span class="se"&gt;\[\[&lt;/span&gt;&lt;span class="sh"&gt;[^]]+&lt;/span&gt;&lt;span class="se"&gt;\]\]&lt;/span&gt;&lt;span class="sh"&gt;' 2&amp;gt;/dev/null; then
  echo "Found wikilink syntax. Use relative paths instead." &amp;gt;&amp;amp;2
  exit 1
fi
&lt;/span&gt;&lt;span class="no"&gt;HOOK
&lt;/span&gt;&lt;span class="nb"&gt;chmod&lt;/span&gt; +x .git/hooks/pre-commit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The hook is the boring-but-critical piece. Without it, you'll absentmindedly type &lt;code&gt;[[some note]]&lt;/code&gt; once a week and slowly recreate the lock-in problem inside your supposedly portable system. Found that out the hard way last year.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: A sync script you actually understand
&lt;/h2&gt;

&lt;p&gt;If you want notes on multiple devices, resist the urge to bolt on a sync service. A git remote is enough for 99% of single-user workflows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# sync.sh — call from cron or a keybinding&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail
&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/notes"&lt;/span&gt;

git add &lt;span class="nt"&gt;-A&lt;/span&gt;
&lt;span class="c"&gt;# Skip empty commits when nothing has changed since last sync&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt; git diff &lt;span class="nt"&gt;--cached&lt;/span&gt; &lt;span class="nt"&gt;--quiet&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"sync &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; +%FT%TZ&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;fi
&lt;/span&gt;git pull &lt;span class="nt"&gt;--rebase&lt;/span&gt; &lt;span class="nt"&gt;--autostash&lt;/span&gt;
git push
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I've run this exact script across a laptop, a desktop, and a server for about 18 months. Total merge conflicts: maybe a dozen, all resolved in under a minute because the files are plain text.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prevention: how to audit a tool before you commit
&lt;/h2&gt;

&lt;p&gt;Before adopting any new note-taking tool, run this checklist. Took me three migrations to learn it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create a test note that uses every feature you care about (links, tags, attachments, embeds, code blocks).&lt;/li&gt;
&lt;li&gt;Open the raw file in &lt;code&gt;cat&lt;/code&gt;. Does it contain only standard markdown? If you see custom block syntax, that's your future lock-in.&lt;/li&gt;
&lt;li&gt;Move that file out of the tool's directory. Open it in a different markdown viewer. Does it still render correctly, with working links?&lt;/li&gt;
&lt;li&gt;Delete the tool entirely. Are your files still useful as plain text in a git repo?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any answer is "no" or "kind of", you're not adopting a markdown editor — you're adopting a database that happens to use &lt;code&gt;.md&lt;/code&gt; as a file extension.&lt;/p&gt;

&lt;h2&gt;
  
  
  When you actually need a GUI
&lt;/h2&gt;

&lt;p&gt;To be fair: a folder of markdown plus ripgrep won't replace every workflow. For graph views, daily review templates, or kanban boards on top of notes, you'll want some kind of editor or viewer. The fix isn't to avoid GUIs — it's to pick ones that &lt;em&gt;read&lt;/em&gt; a directory of plain files instead of &lt;em&gt;owning&lt;/em&gt; a vault. If the tool insists on importing your files into its own format, walk away. If it sits on top of the directory and treats your files as the source of truth, you can swap it out next year without losing a thing.&lt;/p&gt;

&lt;p&gt;That single distinction — does the tool own your files, or just read them — is the whole game.&lt;/p&gt;

</description>
      <category>markdown</category>
      <category>productivity</category>
      <category>git</category>
      <category>bash</category>
    </item>
    <item>
      <title>How to boot mainline Debian on a vendor-locked ARM tablet</title>
      <dc:creator>Alan West</dc:creator>
      <pubDate>Mon, 18 May 2026 23:26:43 +0000</pubDate>
      <link>https://dev.to/alanwest/how-to-boot-mainline-debian-on-a-vendor-locked-arm-tablet-4f6i</link>
      <guid>https://dev.to/alanwest/how-to-boot-mainline-debian-on-a-vendor-locked-arm-tablet-4f6i</guid>
      <description>&lt;h2&gt;
  
  
  The problem: a $80 tablet running a kernel from 2018
&lt;/h2&gt;

&lt;p&gt;Picked up a cheap Rockchip-based Android tablet last month — RK3562 SoC, 4GB RAM, 64GB eMMC, under a hundred bucks. On paper it's perfect for a kiosk, a tiny build agent, or just an ARM dev box on my desk. In practice? It ships with an Android fork running a vendor kernel that's frozen in time. No root, no developer mode, no terminal, and no obvious way to install anything that didn't come from the manufacturer's app store.&lt;/p&gt;

&lt;p&gt;I wanted a Debian shell. Not Termux pretending to be Debian, not a chroot trick, not a VM. Actual Debian, owning the hardware.&lt;/p&gt;

&lt;p&gt;This is a problem you hit constantly with cheap ARM gear: vendor BSPs are a graveyard. Old kernels, no upstream changes, a single security patch on launch day and then silence. If you want a usable Linux machine out of one, you have to bring it yourself.&lt;/p&gt;

&lt;p&gt;Here's how I worked through it, what broke, and what to check before you start.&lt;/p&gt;

&lt;h2&gt;
  
  
  Root cause: the vendor BSP trap
&lt;/h2&gt;

&lt;p&gt;Most ARM SoCs ship with a Board Support Package — a vendor-maintained kernel fork plus a custom bootloader, device trees, and binary blobs for things like GPU, video decode, and Wi-Fi. The vendor uses it to ship a product, then walks away.&lt;/p&gt;

&lt;p&gt;The trap has three layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bootloader&lt;/strong&gt;: the board runs a vendor U-Boot or proprietary loader that expects a specific boot image format, partition layout, and sometimes signed payloads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Device tree&lt;/strong&gt;: the hardware description (&lt;code&gt;.dts&lt;/code&gt;/&lt;code&gt;.dtb&lt;/code&gt;) is custom per board. Mainline ships device trees for &lt;em&gt;some&lt;/em&gt; reference boards, but the specific touchscreen controller, PMIC, and panel on your tablet are almost certainly not there.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drivers&lt;/strong&gt;: GPU (Mali), VPU, Wi-Fi, and audio frequently rely on out-of-tree drivers or firmware blobs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So "install Debian" is really four problems stacked: get code to run at boot, get the kernel to recognize the hardware, get userspace to talk to it, and do all of this without bricking a device whose recovery path you don't fully understand yet.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: find a recovery path before you break anything
&lt;/h3&gt;

&lt;p&gt;Rule one of ARM hacking: know how to unbrick &lt;em&gt;before&lt;/em&gt; you brick.&lt;/p&gt;

&lt;p&gt;Most Rockchip SoCs have a &lt;strong&gt;maskrom&lt;/strong&gt; mode — a hardware-level recovery state where the CPU listens on USB for a loader image, totally independent of whatever's on eMMC. Even if you nuke the bootloader, you can usually recover with &lt;code&gt;rkdeveloptool&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Confirm the device shows up in maskrom mode&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;rkdeveloptool ld
&lt;span class="c"&gt;# Expected: DevNo=1 Vid=0x2207,Pid=0x350a LocationID=... Maskrom&lt;/span&gt;

&lt;span class="c"&gt;# Push a working loader into RAM (not flash)&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;rkdeveloptool db rk356x_loader_vX.XX.bin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The exact PID and loader filename depend on the SoC family. Rockchip publishes prebuilt loader blobs in the &lt;a href="https://github.com/rockchip-linux/rkbin" rel="noopener noreferrer"&gt;rkbin tree&lt;/a&gt;; verify the binary matches your SoC before flashing anything persistent.&lt;/p&gt;

&lt;p&gt;If your device doesn't have a documented maskrom button combo or test pad, &lt;strong&gt;stop here&lt;/strong&gt;. Recovery without it usually means short-pinning a flash chip on the PCB, and that's a different blog post.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: build U-Boot for the SoC, not the board
&lt;/h3&gt;

&lt;p&gt;Mainline U-Boot has reasonable Rockchip support, but it expects you to pick a board config. For an SoC where there's no upstream board file for your exact tablet, the pragmatic path is to start from the closest reference design and override the device tree later.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://source.denx.de/u-boot/u-boot.git
&lt;span class="nb"&gt;cd &lt;/span&gt;u-boot
&lt;span class="c"&gt;# Use a nearby supported board as the base config&lt;/span&gt;
make rk3568-evb_defconfig
&lt;span class="c"&gt;# Cross-compile with an aarch64 toolchain&lt;/span&gt;
make &lt;span class="nv"&gt;CROSS_COMPILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;aarch64-linux-gnu- &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nv"&gt;BL31&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;bl31.elf u-boot-rockchip.bin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;BL31&lt;/code&gt; is ARM Trusted Firmware — the secure-world runtime U-Boot hands control to. You can build ATF yourself from the &lt;a href="https://www.trustedfirmware.org/projects/tf-a/" rel="noopener noreferrer"&gt;TF-A project&lt;/a&gt; or pull a prebuilt blob from &lt;code&gt;rkbin&lt;/code&gt;. Building from source is the right long-term answer; pulling prebuilt is the right answer when you're still bisecting which combination boots at all.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: boot from SD card first, never eMMC
&lt;/h3&gt;

&lt;p&gt;This is the single biggest mistake I see people make: they flash an experimental image straight to internal storage on the first try. Don't.&lt;/p&gt;

&lt;p&gt;Rockchip's boot ROM checks SD card before eMMC by default. So you can iterate on a boot image entirely from an SD card while the original Android partition on eMMC stays untouched. If the image is broken, pull the SD card — the tablet boots Android like nothing happened.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Drop U-Boot at the Rockchip-expected offset&lt;/span&gt;
&lt;span class="nb"&gt;sudo dd &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;u-boot-rockchip.bin &lt;span class="nv"&gt;of&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/dev/sdX &lt;span class="nv"&gt;seek&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;64 &lt;span class="nv"&gt;conv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;notrunc
&lt;span class="c"&gt;# Partition the rest of the card normally&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;parted /dev/sdX mklabel gpt
&lt;span class="nb"&gt;sudo &lt;/span&gt;parted /dev/sdX mkpart boot fat32 16MiB 256MiB
&lt;span class="nb"&gt;sudo &lt;/span&gt;parted /dev/sdX mkpart root ext4 256MiB 100%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then drop a Debian arm64 rootfs onto the root partition with &lt;code&gt;debootstrap&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;debootstrap &lt;span class="nt"&gt;--arch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;arm64 &lt;span class="nt"&gt;--foreign&lt;/span&gt; bookworm /mnt/root &lt;span class="se"&gt;\&lt;/span&gt;
    http://deb.debian.org/debian
&lt;span class="c"&gt;# Finish stage 2 inside a qemu-user chroot&lt;/span&gt;
&lt;span class="nb"&gt;sudo cp&lt;/span&gt; /usr/bin/qemu-aarch64-static /mnt/root/usr/bin/
&lt;span class="nb"&gt;sudo chroot&lt;/span&gt; /mnt/root /debootstrap/debootstrap &lt;span class="nt"&gt;--second-stage&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The two-stage &lt;code&gt;debootstrap&lt;/code&gt; works because &lt;code&gt;qemu-user-static&lt;/code&gt; transparently executes aarch64 binaries on your x86 host. Don't forget to register binfmt handlers (&lt;code&gt;binfmt-support&lt;/code&gt; package on Debian).&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: device tree is where you'll lose a weekend
&lt;/h3&gt;

&lt;p&gt;The kernel will boot, panic on PMIC init, and reboot. That's normal. You're missing a working DTB.&lt;/p&gt;

&lt;p&gt;What I do:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dump the Android partition's DTB blob and decompile it with &lt;code&gt;dtc -I dtb -O dts&lt;/code&gt; to get a starting point.&lt;/li&gt;
&lt;li&gt;Diff it against the mainline DTS for the closest reference SoC.&lt;/li&gt;
&lt;li&gt;Strip out anything vendor-specific (Android boot partitions, proprietary properties).&lt;/li&gt;
&lt;li&gt;Iterate.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Expect the touchscreen, Wi-Fi, and internal sensors to not work on first boot. Serial console and USB will. Get a USB-to-serial adapter on the debug UART pads — without one, you're flying blind.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prevention: what to check before you buy
&lt;/h2&gt;

&lt;p&gt;If you're shopping for cheap ARM hardware specifically to run mainline Linux, vet it first:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Search the SoC plus "mainline" or "u-boot defconfig"&lt;/strong&gt;: if the SoC has zero upstream presence, walk away.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Look for an exposed UART&lt;/strong&gt;: serial console access is non-negotiable for debugging.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Check for a maskrom button or documented test point&lt;/strong&gt;: this is your unbrick path.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefer SoCs with an active community port&lt;/strong&gt; (Pine64, Radxa, Orange Pi families) over no-name tablets — even if the silicon is the same, the upstream work is what saves you.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I haven't tested every Rockchip variant thoroughly, but the RK35xx family in general has a much healthier mainline story than the RK30xx-era parts ever did. Your mileage will vary by exact silicon revision and board.&lt;/p&gt;

&lt;p&gt;The payoff is real though. An $80 chunk of hardware running clean Debian, on a current kernel, that you actually control — that's worth the weekend.&lt;/p&gt;

</description>
      <category>linux</category>
      <category>arm</category>
      <category>debian</category>
      <category>embedded</category>
    </item>
    <item>
      <title>How to fix the 'AI-generated' look in your frontend</title>
      <dc:creator>Alan West</dc:creator>
      <pubDate>Mon, 18 May 2026 23:04:12 +0000</pubDate>
      <link>https://dev.to/alanwest/how-to-fix-the-ai-generated-look-in-your-frontend-1ahh</link>
      <guid>https://dev.to/alanwest/how-to-fix-the-ai-generated-look-in-your-frontend-1ahh</guid>
      <description>&lt;h2&gt;
  
  
  The problem: every AI site looks like the same AI site
&lt;/h2&gt;

&lt;p&gt;I did a small experiment last month. I asked three different code-gen tools to build me a landing page for a fake SaaS product. Different prompts, different sessions, different models. The output? Practically identical.&lt;/p&gt;

&lt;p&gt;Purple-to-blue gradient hero. Three feature cards in a row with rounded corners and lucide icons. A pricing section with the middle plan slightly elevated. A FAQ accordion at the bottom. CTA button with &lt;code&gt;bg-indigo-600 hover:bg-indigo-700&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If you've shipped anything with an LLM lately, you've seen it. There's a specific visual fingerprint to AI-generated frontends, and once you can spot it, you can't unsee it. The frustrating part is when a client or a non-technical stakeholder looks at your work and says "this looks like ChatGPT made it" — even when half of it didn't.&lt;/p&gt;

&lt;p&gt;Let's debug why this happens and walk through fixes that actually move the needle.&lt;/p&gt;

&lt;h2&gt;
  
  
  Root cause: the model is averaging over its training data
&lt;/h2&gt;

&lt;p&gt;LLMs that generate UI code aren't choosing aesthetics. They're predicting the most likely next token given billions of public code samples. Public code samples are overwhelmingly tutorials, starter templates, and component libraries — which all tend to use the same defaults.&lt;/p&gt;

&lt;p&gt;There are three specific failure modes I keep seeing:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The default Tailwind palette
&lt;/h3&gt;

&lt;p&gt;The Tailwind default config uses a specific set of named colors (&lt;code&gt;slate&lt;/code&gt;, &lt;code&gt;indigo&lt;/code&gt;, &lt;code&gt;emerald&lt;/code&gt;, etc.) that are mathematically pleasant but instantly recognizable. When a model can't decide on a color, it reaches for &lt;code&gt;indigo-600&lt;/code&gt; or &lt;code&gt;slate-900&lt;/code&gt; because those tokens appear in roughly a billion tutorials.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The component-library layout vocabulary
&lt;/h3&gt;

&lt;p&gt;Hero → features grid → social proof → pricing → FAQ → footer. This isn't because that's the &lt;em&gt;right&lt;/em&gt; layout for a landing page. It's because it's the layout used in every shadcn/ui example, every Tailwind UI screenshot, every Vercel template. Models pattern-match on structure.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The "safe" typography pairing
&lt;/h3&gt;

&lt;p&gt;Inter for everything, with the occasional &lt;code&gt;font-bold&lt;/code&gt; for headings. Default line-height. Default tracking. The result is technically readable and entirely forgettable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix, part 1: tear out the default palette
&lt;/h2&gt;

&lt;p&gt;First step is replacing your Tailwind theme with something that doesn't ship by default. Don't just rename &lt;code&gt;indigo&lt;/code&gt; to &lt;code&gt;primary&lt;/code&gt; — actually pick colors that aren't in the default scale.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// tailwind.config.js&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;defineConfig&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tailwindcss&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;theme&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// 'extend' keeps defaults; replacing 'colors' wipes them entirely&lt;/span&gt;
    &lt;span class="na"&gt;colors&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;transparent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;transparent&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;current&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;currentColor&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="c1"&gt;// custom palette built from a base hue, not 'indigo'&lt;/span&gt;
      &lt;span class="na"&gt;ink&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;#f6f5f1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;#3d3a32&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="mi"&gt;900&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;#1a1814&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="na"&gt;ember&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;#e8775a&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// warm accent, not the usual cool blue&lt;/span&gt;
        &lt;span class="mi"&gt;600&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;#c45530&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;fontFamily&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="c1"&gt;// pair a serif display with a mono body for an unusual feel&lt;/span&gt;
      &lt;span class="na"&gt;display&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;"Fraunces"&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;serif&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
      &lt;span class="na"&gt;sans&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;"IBM Plex Sans"&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;sans-serif&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice I dropped &lt;code&gt;colors&lt;/code&gt; instead of extending it. That kills &lt;code&gt;bg-indigo-600&lt;/code&gt; entirely — if the model (or a junior dev) tries to use it, the build fails. Forcing the failure is the point. It pushes everyone toward the custom palette.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix, part 2: break the layout grammar
&lt;/h2&gt;

&lt;p&gt;AI-generated layouts are almost always vertically stacked, full-width sections with centered content. You can break this pattern with very little code by using CSS Grid for asymmetric layouts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight css"&gt;&lt;code&gt;&lt;span class="c"&gt;/* asymmetric hero — content offset to the left, art bleeds right */&lt;/span&gt;
&lt;span class="nc"&gt;.hero&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;display&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="py"&gt;grid-template-columns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;minmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;2rem&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="n"&gt;fr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;minmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;38rem&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;minmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="n"&gt;fr&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nl"&gt;align-items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;min-height&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80vh&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nc"&gt;.hero__content&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c"&gt;/* sit in the second column, not centered across the page */&lt;/span&gt;
  &lt;span class="nl"&gt;grid-column&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="py"&gt;padding-block&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4rem&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nc"&gt;.hero__art&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c"&gt;/* let the visual element extend past the content column */&lt;/span&gt;
  &lt;span class="nl"&gt;grid-column&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt; &lt;span class="p"&gt;/&lt;/span&gt; &lt;span class="m"&gt;-1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;align-self&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;stretch&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a five-minute change that immediately signals "a human chose this." Centered hero + three cards is the visual equivalent of beige carpet. Off-center compositions, overlapping elements, and content that breaks the grid all read as intentional design choices.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix, part 3: kill the rounded-2xl reflex
&lt;/h2&gt;

&lt;p&gt;Every AI-generated component has &lt;code&gt;rounded-2xl shadow-lg p-6&lt;/code&gt; somewhere. Override your component defaults at the source.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="c1"&gt;// components/Card.jsx&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;Card&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;children&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;variant&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;default&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// pick ONE radius vocabulary for the whole site, not per-component&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;variants&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;border border-ink-500/20 bg-ink-50&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;inset&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;border-l-2 border-ember-600 bg-transparent pl-6&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;flat&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;bg-ink-50&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;article&lt;/span&gt; &lt;span class="na"&gt;className&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;variants&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;variant&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;&lt;span class="s2"&gt; p-5`&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;children&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;article&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No border radius. No drop shadow. Borders and color contrast do the work instead. This won't fit every brand, but the point is to &lt;em&gt;pick a vocabulary&lt;/em&gt; and stick to it rather than letting each component drift toward generic-AI-card defaults.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix, part 4: replace placeholder copy before showing anyone
&lt;/h2&gt;

&lt;p&gt;This one isn't visual, but it triggers the same uncanny-valley response. "Empower your team to unlock productivity" and "Built for modern teams" are the textual equivalent of the purple gradient. If you ship a draft with that copy, even non-technical people pick up on it — they can't articulate why, but they know.&lt;/p&gt;

&lt;p&gt;I keep a checklist on my second monitor before any client review:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No sentence that starts with "Empower", "Unlock", or "Transform"&lt;/li&gt;
&lt;li&gt;No feature card titled with two abstract nouns ("Seamless Integration")&lt;/li&gt;
&lt;li&gt;At least one specific, concrete claim with a number&lt;/li&gt;
&lt;li&gt;At least one sentence that sounds like a real person wrote it&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Prevention: catch it in code review
&lt;/h2&gt;

&lt;p&gt;The cheapest fix is a linter rule that fails the build when forbidden class patterns show up. Tailwind's &lt;code&gt;safelist&lt;/code&gt; and a custom ESLint rule can enforce this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// eslint custom rule, simplified&lt;/span&gt;
&lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;exports&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;banned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="sr"&gt;/bg-&lt;/span&gt;&lt;span class="se"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;indigo|violet|purple&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="sr"&gt;-600/&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sr"&gt;/rounded-&lt;/span&gt;&lt;span class="se"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;2xl|3xl&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sr"&gt;/from-purple-&lt;/span&gt;&lt;span class="se"&gt;\d&lt;/span&gt;&lt;span class="sr"&gt;+ to-&lt;/span&gt;&lt;span class="se"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;blue|pink&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="sr"&gt;-&lt;/span&gt;&lt;span class="se"&gt;\d&lt;/span&gt;&lt;span class="sr"&gt;+/&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// the gradient&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nc"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt;
        &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;pattern&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;banned&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;report&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
              &lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Banned default-AI class: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
          &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Is this petty? A little. But I'd rather have CI yell at me than ship something a client describes as "that AI look." After putting this rule in place on two projects, the diffs got noticeably more interesting — people started reaching for the custom tokens instead of the defaults, because the defaults didn't compile.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;The "AI look" isn't really about AI. It's about defaults. LLMs amplify defaults because their training data is mostly default-using code. The fix isn't to stop using AI assistance — it's to remove the defaults from your toolchain so neither the model nor your team can fall back on them.&lt;/p&gt;

&lt;p&gt;Replace the palette. Break the layout grammar. Pick a component vocabulary and enforce it. And read the copy out loud before you ship.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>css</category>
      <category>frontend</category>
      <category>design</category>
    </item>
    <item>
      <title>Why MTP doesn't speed up your llama.cpp inference (and how to actually fix it)</title>
      <dc:creator>Alan West</dc:creator>
      <pubDate>Mon, 18 May 2026 19:33:41 +0000</pubDate>
      <link>https://dev.to/alanwest/why-mtp-doesnt-speed-up-your-llamacpp-inference-and-how-to-actually-fix-it-2m2m</link>
      <guid>https://dev.to/alanwest/why-mtp-doesnt-speed-up-your-llamacpp-inference-and-how-to-actually-fix-it-2m2m</guid>
      <description>&lt;p&gt;Last week, I spent two days banging my head against a wall. I had just spun up a fresh &lt;a href="https://github.com/ggml-org/llama.cpp" rel="noopener noreferrer"&gt;llama.cpp&lt;/a&gt; build with multi-token prediction (MTP) support, loaded a quantized Qwen3 model, and ran my benchmark suite expecting that sweet 2-3x speedup everyone keeps talking about.&lt;/p&gt;

&lt;p&gt;The result? Roughly the same tokens per second. Sometimes &lt;em&gt;slower&lt;/em&gt;. After a lot of profiling, I figured out what was happening — and it turns out the issue is more common than the celebratory benchmark posts suggest.&lt;/p&gt;

&lt;p&gt;This post is for anyone who's enabled MTP, expected a speedup, and got nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What MTP actually does (the short version)
&lt;/h2&gt;

&lt;p&gt;Multi-token prediction is a form of speculative decoding baked into the model itself. Instead of running a separate, smaller draft model to guess the next few tokens, the main model emits multiple candidate tokens per forward pass. The verifier (usually the same model with a slightly different head) accepts or rejects them in one shot.&lt;/p&gt;

&lt;p&gt;The theory is simple. If acceptance rate is high, you get 2-3 tokens per forward pass instead of one, with roughly the same latency per pass. In practice, MTP can make things worse if any of three things go wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three reasons MTP fails to speed things up
&lt;/h2&gt;

&lt;p&gt;Here are the actual root causes I hit, in order of frequency:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Low acceptance rate
&lt;/h3&gt;

&lt;p&gt;This is the big one. MTP only helps if predictions are accepted. If your acceptance rate is below ~60%, you're paying the extra compute cost of generating drafts without getting tokens back. Wall-clock time goes up.&lt;/p&gt;

&lt;p&gt;I see this most often when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The prompt is unusual (specific code style, niche domain)&lt;/li&gt;
&lt;li&gt;Temperature is too high (anything above ~0.7 starts hurting)&lt;/li&gt;
&lt;li&gt;The model was quantized aggressively and the MTP head suffered more than the main weights&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. KV cache thrashing
&lt;/h3&gt;

&lt;p&gt;When you generate multiple candidates per step, you churn the KV cache more aggressively. On consumer GPUs with limited VRAM, this can spill into slower memory or cause re-allocation. The forward pass speedup gets eaten by memory stalls.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. CUDA graph capture failures
&lt;/h3&gt;

&lt;p&gt;This one bit me hard. llama.cpp tries to capture CUDA graphs for the inference loop. If MTP introduces dynamic shapes (variable number of accepted tokens per step), the graph gets re-captured every step. You lose the performance win of graphs entirely, and the per-step overhead actually goes &lt;em&gt;up&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step-by-step: diagnosing your setup
&lt;/h2&gt;

&lt;p&gt;Here's the order I work through now whenever MTP doesn't seem to help.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Measure the actual acceptance rate
&lt;/h3&gt;

&lt;p&gt;llama.cpp surfaces speculation metrics with verbose logging. Build with CUDA support and run with &lt;code&gt;-v&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Build llama.cpp with CUDA support&lt;/span&gt;
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DGGML_CUDA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release &lt;span class="nt"&gt;-j&lt;/span&gt;

&lt;span class="c"&gt;# Run with verbose stats so we can see acceptance numbers&lt;/span&gt;
./build/bin/llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; models/qwen3-quantized.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Write a Python function for binary search"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--n-predict&lt;/span&gt; 256 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; 2&amp;gt;&amp;amp;1 | &lt;span class="nb"&gt;tee &lt;/span&gt;run.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then grep the log for the speculation stats. You're looking for an &lt;code&gt;n_accept&lt;/code&gt; ratio. Below 0.6 means MTP is actively hurting throughput on your workload.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Check VRAM headroom
&lt;/h3&gt;

&lt;p&gt;If acceptance is fine but throughput is still bad, you're probably memory-bound. Watch VRAM usage during inference in a separate terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Poll memory and GPU utilization once per second&lt;/span&gt;
nvidia-smi &lt;span class="nt"&gt;--query-gpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;memory.used,memory.total,utilization.gpu &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;csv &lt;span class="nt"&gt;-l&lt;/span&gt; 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're sitting at &amp;gt;95% VRAM utilization while running, MTP's extra KV cache pressure is pushing you over the edge. The fix is usually to reduce context length, drop to a more aggressive quant (Q4_K_M instead of Q5_K_M), or shorten the draft window.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Disable CUDA graphs as a control
&lt;/h3&gt;

&lt;p&gt;To check whether graph re-capture is killing you, force graphs off and re-run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Disable CUDA graphs to test if they're being re-captured each step&lt;/span&gt;
&lt;span class="nv"&gt;GGML_CUDA_DISABLE_GRAPHS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 ./build/bin/llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; models/qwen3-quantized.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Write a Python function for binary search"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--n-predict&lt;/span&gt; 256 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If throughput is roughly the same with graphs disabled, capture isn't your problem. If throughput goes &lt;em&gt;up&lt;/em&gt; with this flag set, that's the smoking gun — graphs were being re-captured every step under MTP and the overhead was worse than not using them at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  The actual fix
&lt;/h2&gt;

&lt;p&gt;Once you've identified which of the three issues you're hitting, the fix is usually simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Low acceptance&lt;/strong&gt; — shorten the draft window. Most MTP implementations let you set a draft length of 1-4 tokens. Dropping from 4 to 2 often pushes acceptance above 70% because the model has to commit to fewer guesses in a row.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VRAM pressure&lt;/strong&gt; — reduce context length or quantize more aggressively. KV cache size scales linearly with context, so cutting &lt;code&gt;--ctx-size&lt;/code&gt; in half buys you real headroom.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Graph capture churn&lt;/strong&gt; — pull the latest llama.cpp. The speculation code path changes frequently and padded graph capture has improved a lot recently.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's the config that finally worked for me on a quantized Qwen3 model with around 24 GB of VRAM available:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Final working config — moderate draft length, conservative context&lt;/span&gt;
./build/bin/llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; models/qwen3-quantized.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PROMPT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--n-predict&lt;/span&gt; 512 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--ctx-size&lt;/span&gt; 8192 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--draft-max&lt;/span&gt; 2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--draft-min&lt;/span&gt; 1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That gave me roughly 1.7x throughput over the no-MTP baseline on my workload. Not the magical 3x some posts claim, but a real, repeatable win that I could ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prevention tips
&lt;/h2&gt;

&lt;p&gt;A few things I now do by default whenever I touch MTP:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Always benchmark with and without MTP.&lt;/strong&gt; Don't trust that it's helping just because it's enabled. Run both, measure both, save the numbers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pin your llama.cpp version.&lt;/strong&gt; The MTP code path changes frequently. A config that works today can regress between commits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Match quantization to the head carefully.&lt;/strong&gt; Some MTP heads are sensitive to aggressive quantization. If acceptance rate suddenly tanks after a re-quant, that's usually why.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log acceptance rate as a metric, not just throughput.&lt;/strong&gt; Throughput tells you the symptom; acceptance rate tells you the cause. When you can see both side by side, regressions become obvious.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The honest takeaway is that MTP is a real win when the conditions line up, but it isn't free. If you've enabled it and gotten nothing, you're not doing it wrong — you've just hit one of the failure modes nobody talks about in the benchmark threads. Walk the three steps above and you'll usually find the culprit within an hour.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>performance</category>
      <category>machinelearning</category>
      <category>gpu</category>
    </item>
    <item>
      <title>AI Won't Speed Up Your Processes (And That's OK)</title>
      <dc:creator>Alan West</dc:creator>
      <pubDate>Mon, 18 May 2026 19:29:25 +0000</pubDate>
      <link>https://dev.to/alanwest/ai-wont-speed-up-your-processes-and-thats-ok-c73</link>
      <guid>https://dev.to/alanwest/ai-wont-speed-up-your-processes-and-thats-ok-c73</guid>
      <description>&lt;h2&gt;
  
  
  The dirty secret of AI productivity claims
&lt;/h2&gt;

&lt;p&gt;Saw a post on HN this week (Frederick Van Brabant's piece) arguing that AI won't make your processes go faster, and honestly... yeah. After two years of integrating Copilot, Cursor, and Claude into my daily flow across four different teams, I've landed in roughly the same place. AI makes &lt;em&gt;tasks&lt;/em&gt; faster. Processes? Not so much.&lt;/p&gt;

&lt;p&gt;The distinction matters more than it sounds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tasks vs. processes
&lt;/h2&gt;

&lt;p&gt;A task is the thing you do at your keyboard. Writing a function. Generating boilerplate. Drafting a gnarly regex. AI is genuinely excellent at these — I'd estimate it shaves 30-40% off my pure typing time when I'm in the zone.&lt;/p&gt;

&lt;p&gt;A process is everything &lt;em&gt;around&lt;/em&gt; the task. The Jira ticket sitting in "Ready for Review" for three days. The deploy that requires four approvals. The standup where you find out the requirements changed. The QA cycle. The customer who needs to validate the change before you can close anything.&lt;/p&gt;

&lt;p&gt;Look at where your week actually goes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Rough breakdown of a typical product dev week (40 hours)
Writing code             ~8h   (20%)
Reviewing PRs            ~6h   (15%)
Meetings / standups      ~8h   (20%)
Waiting (CI, reviews)    ~6h   (15%)
Debugging existing bugs  ~5h   (12.5%)
Planning / refinement    ~4h   (10%)
Context switching tax    ~3h   (7.5%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If "writing code" is 20% of your week, even doubling its speed saves you about 10% total. Amdahl's Law from college shows up uninvited and ruins the pitch deck.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I've actually measured
&lt;/h2&gt;

&lt;p&gt;I migrated three projects to a heavier AI-assisted workflow this year and tracked cycle time (first commit to production). Two of them got &lt;em&gt;slower&lt;/em&gt; in the first month. Why?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More PRs were getting opened (because writing them was easy)&lt;/li&gt;
&lt;li&gt;Reviewers became the new bottleneck&lt;/li&gt;
&lt;li&gt;A handful of AI-generated pieces had subtle bugs that ate days&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By month three things normalized. Cycle time came back to baseline — not better. The team felt more productive (which is a real benefit, don't dismiss it) but the calendar didn't show it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The review tax nobody talks about
&lt;/h2&gt;

&lt;p&gt;Here's what nobody warns you about: AI shifts work from writing to reviewing. And reviewing is harder than writing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Looks fine at a glance, right?
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;apply_discount&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;discounts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_discount_table&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;multiplier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;discounts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# default = no discount
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;multiplier&lt;/span&gt;

&lt;span class="c1"&gt;# Two problems hiding here:
# 1. fetch_discount_table() is called on every invocation — no caching
# 2. If `code` is None (very common from a form), .get(None, 1) silently returns 1
#    instead of raising. Bug that ships happily to prod.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you write a function, you build a mental model as you go. When you review one, you reconstruct that model from the outside. With AI-generated code, you can't skip the careful review — sometimes it calls a method that doesn't exist, uses an outdated API pattern, or quietly swallows an error.&lt;/p&gt;

&lt;p&gt;I tell junior devs on my team: treat every AI suggestion like a Stack Overflow answer from 2017. Often useful, never trusted blindly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where AI does actually compress the process
&lt;/h2&gt;

&lt;p&gt;I don't want to be a total cynic — there are spots where AI shortens the process itself, not just the typing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stack trace → likely cause&lt;/strong&gt;: pasting an error and getting a focused minimal repro is faster than the back-and-forth on Slack&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-language fluency&lt;/strong&gt;: touching a service in a language you don't write daily, the ramp-up is real&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;First-draft docs and ADRs&lt;/strong&gt;: editing is faster than blank-page writing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test scaffolding&lt;/strong&gt;: generating the obvious cases so you can focus on the weird ones&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What these have in common: they replace a &lt;em&gt;waiting&lt;/em&gt; step, not a &lt;em&gt;typing&lt;/em&gt; step.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to actually measure your process
&lt;/h2&gt;

&lt;p&gt;Stop trusting vibes. Track the numbers.&lt;/p&gt;

&lt;p&gt;Questions worth answering for your team:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What's your median cycle time (PR opened → merged → deployed)?&lt;/li&gt;
&lt;li&gt;What's the median age of an open PR right now?&lt;/li&gt;
&lt;li&gt;How many PRs are open per dev on your team?&lt;/li&gt;
&lt;li&gt;How often does a PR need a second round of review changes?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For process metrics there's GitHub Insights, LinearB, and Swarmia. For product-side metrics on what users actually do with the features you ship, privacy-focused options like Umami or Plausible give you full data ownership without the GA bloat. The point isn't the specific tool — it's that you need &lt;em&gt;some&lt;/em&gt; number that should move if AI is genuinely helping your pipeline.&lt;/p&gt;

&lt;p&gt;If your AI rollout is real, at least one of these numbers should move. If none of them move, you didn't speed up your process. You just made some tasks feel snappier.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually moves the needle
&lt;/h2&gt;

&lt;p&gt;The teams I've seen genuinely ship faster aren't the ones with the fanciest AI setups. They're the ones who fixed the boring stuff:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# A boring CI config that saves more time than any AI tool I've used&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ship-it&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;timeout-minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;     &lt;span class="c1"&gt;# fail fast — no 45 min stuck builds&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/setup-node@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;node-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
          &lt;span class="na"&gt;cache&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;npm'&lt;/span&gt;      &lt;span class="c1"&gt;# the cache line that saves ~2 min per run&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm ci&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm test -- --shard=${{ matrix.shard }}/4&lt;/span&gt;
    &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;matrix&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;shard&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;1&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;2&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;3&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;4&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# parallelize across 4 runners&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Beyond CI, the cultural moves matter more:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set review WIP limits (max 2 open PRs per reviewer)&lt;/li&gt;
&lt;li&gt;Kill approval theater (one human approval, not three)&lt;/li&gt;
&lt;li&gt;Automate deploys (no manual gates outside of regulated environments)&lt;/li&gt;
&lt;li&gt;Write ADRs so decisions don't get re-litigated every sprint&lt;/li&gt;
&lt;li&gt;Trunk-based development, feature flags for the scary stuff&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AI helps these teams more, because the process around the AI-generated code can actually keep up. AI &lt;em&gt;hurts&lt;/em&gt; a slow team because it dumps more code into an already-clogged review pipe.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest version
&lt;/h2&gt;

&lt;p&gt;I love using these tools. I'd fight someone to keep Cursor in my workflow, and I haven't tested every model thoroughly but the recent ones are clearly a step up. But when someone tells me their AI rollout is going to make the team "2x more productive," I ask what number they're going to measure. If they can't name one, I know exactly what's going to happen in six months.&lt;/p&gt;

&lt;p&gt;The AI is faster. The process isn't. Until you fix the process, the AI is just helping you generate code that sits in a review queue with all the other code.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>devops</category>
      <category>programming</category>
    </item>
    <item>
      <title>Debugging DNS leaks: why your VPN isn't hiding what you think it is</title>
      <dc:creator>Alan West</dc:creator>
      <pubDate>Mon, 18 May 2026 01:21:15 +0000</pubDate>
      <link>https://dev.to/alanwest/debugging-dns-leaks-why-your-vpn-isnt-hiding-what-you-think-it-is-4ecg</link>
      <guid>https://dev.to/alanwest/debugging-dns-leaks-why-your-vpn-isnt-hiding-what-you-think-it-is-4ecg</guid>
      <description>&lt;p&gt;Last month I was setting up a hardened dev environment for a client doing security research. They wanted all traffic from their workstation tunneled through a VPN, no exceptions. Simple, right? Install WireGuard, flip the toggle, done.&lt;/p&gt;

&lt;p&gt;Then I ran a leak test and watched their real ISP-assigned DNS server pop up on the report. The traffic was tunneled. The DNS queries weren't. We'd been working under a false sense of privacy for a week.&lt;/p&gt;

&lt;p&gt;This is one of those bugs that doesn't crash anything, doesn't throw an error, and silently undermines the entire reason you set up the VPN in the first place. Let's walk through what's actually happening and how to fix it for good.&lt;/p&gt;

&lt;h2&gt;
  
  
  The frustrating problem
&lt;/h2&gt;

&lt;p&gt;You've done everything right. You're connected to a VPN. &lt;code&gt;curl ifconfig.me&lt;/code&gt; returns the VPN's exit IP. Your routing table looks clean. And yet, when you visit a DNS leak test site, your ISP's resolver shows up in the results.&lt;/p&gt;

&lt;p&gt;Worse: in some cases your VPN tunnel is fine for HTTP and HTTPS, but DNS is going out of band. Every domain you visit is still visible to your ISP, your coffee shop's network, or whoever else is between you and the resolver you didn't mean to use.&lt;/p&gt;

&lt;p&gt;If you're running this setup on a fleet of dev boxes or CI runners that talk to internal services, the consequences get worse. Internal hostnames can leak to public resolvers. Hostnames are often as sensitive as the queries themselves.&lt;/p&gt;

&lt;h2&gt;
  
  
  Root cause: DNS is not part of your VPN tunnel by default
&lt;/h2&gt;

&lt;p&gt;Here's the thing most VPN tutorials gloss over. A VPN tunnel routes IP packets. DNS resolution happens at the OS level, often &lt;em&gt;before&lt;/em&gt; the packet routing decision, using whatever resolver was configured by your DHCP lease, your &lt;code&gt;/etc/resolv.conf&lt;/code&gt;, or your systemd-resolved stub.&lt;/p&gt;

&lt;p&gt;There are usually three culprits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;systemd-resolved&lt;/strong&gt; keeps per-link DNS configurations and may continue using the original interface's DNS even when traffic is routed elsewhere.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Browsers with DNS-over-HTTPS&lt;/strong&gt; (Firefox, Chrome) bypass the OS resolver entirely and talk directly to a hardcoded DoH endpoint over HTTPS — which &lt;em&gt;is&lt;/em&gt; tunneled through the VPN, but goes to a third party you may not trust.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Applications using their own resolvers&lt;/strong&gt; — Go binaries with &lt;code&gt;GODEBUG=netdns=go&lt;/code&gt;, some container runtimes, and language-specific resolver libraries can ignore system settings.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The VPN sees the encrypted DoH request and dutifully tunnels it. The OS resolver sends its plaintext UDP/53 query out the wrong interface. Both paths can coexist on the same machine, which is what makes this so confusing to debug.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Confirm the leak
&lt;/h2&gt;

&lt;p&gt;Before fixing anything, prove it's actually leaking. The cheapest reliable test is &lt;code&gt;tcpdump&lt;/code&gt; on the physical interface (not the VPN interface) while you trigger a lookup.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# In one terminal, watch DNS on your physical NIC&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;tcpdump &lt;span class="nt"&gt;-i&lt;/span&gt; wlan0 &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s1"&gt;'udp port 53 or tcp port 53'&lt;/span&gt;

&lt;span class="c"&gt;# In another terminal, trigger a fresh lookup&lt;/span&gt;
&lt;span class="c"&gt;# Use a unique domain so cached answers don't hide the issue&lt;/span&gt;
dig &lt;span class="si"&gt;$(&lt;/span&gt;uuidgen | &lt;span class="nb"&gt;tr &lt;/span&gt;A-Z a-z&lt;span class="si"&gt;)&lt;/span&gt;.example.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If anything shows up on the first terminal, you're leaking. If the only DNS traffic appears on your VPN interface (&lt;code&gt;wg0&lt;/code&gt;, &lt;code&gt;tun0&lt;/code&gt;, etc.), you're clean.&lt;/p&gt;

&lt;p&gt;You can also check what resolver your system &lt;em&gt;thinks&lt;/em&gt; it's using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# systemd-resolved status, per-interface&lt;/span&gt;
resolvectl status

&lt;span class="c"&gt;# Classic view&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; /etc/resolv.conf

&lt;span class="c"&gt;# What's actually being asked, in real time&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;resolvectl monitor
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;monitor&lt;/code&gt; subcommand is underrated — it shows every query the stub resolver processes, including which interface it was sent over.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Force DNS through the tunnel
&lt;/h2&gt;

&lt;p&gt;The fix depends on your VPN client, but the principle is the same: every DNS query must travel inside the encrypted tunnel and hit a resolver on the other side.&lt;/p&gt;

&lt;p&gt;For a WireGuard config, this is one line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[Interface]&lt;/span&gt;
&lt;span class="py"&gt;PrivateKey&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;your-private-key&amp;gt;&lt;/span&gt;
&lt;span class="py"&gt;Address&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;10.0.0.2/24&lt;/span&gt;
&lt;span class="c"&gt;# Use a resolver that lives on the VPN side
&lt;/span&gt;&lt;span class="py"&gt;DNS&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;10.0.0.1&lt;/span&gt;

&lt;span class="nn"&gt;[Peer]&lt;/span&gt;
&lt;span class="py"&gt;PublicKey&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;peer-public-key&amp;gt;&lt;/span&gt;
&lt;span class="py"&gt;Endpoint&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;vpn.example.com:51820&lt;/span&gt;
&lt;span class="c"&gt;# Route everything, including DNS
&lt;/span&gt;&lt;span class="py"&gt;AllowedIPs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0/0, ::/0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;DNS =&lt;/code&gt; line tells &lt;code&gt;wg-quick&lt;/code&gt; to update &lt;code&gt;/etc/resolv.conf&lt;/code&gt; (or talk to systemd-resolved) so queries go to a server reachable only through the tunnel. The &lt;code&gt;AllowedIPs = 0.0.0.0/0&lt;/code&gt; part ensures the packet to that resolver actually enters the tunnel — without it, your route table might still send the DNS query out the default gateway.&lt;/p&gt;

&lt;p&gt;For OpenVPN, the equivalent push options usually come from the server side, but you can force them locally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;&lt;span class="c"&gt;# In your client config
&lt;/span&gt;&lt;span class="n"&gt;dhcp&lt;/span&gt;-&lt;span class="n"&gt;option&lt;/span&gt; &lt;span class="n"&gt;DNS&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;.&lt;span class="m"&gt;8&lt;/span&gt;.&lt;span class="m"&gt;0&lt;/span&gt;.&lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="n"&gt;block&lt;/span&gt;-&lt;span class="n"&gt;outside&lt;/span&gt;-&lt;span class="n"&gt;dns&lt;/span&gt;       &lt;span class="c"&gt;# Windows-only, blocks leaks aggressively
&lt;/span&gt;&lt;span class="n"&gt;script&lt;/span&gt;-&lt;span class="n"&gt;security&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
&lt;span class="n"&gt;up&lt;/span&gt; /&lt;span class="n"&gt;etc&lt;/span&gt;/&lt;span class="n"&gt;openvpn&lt;/span&gt;/&lt;span class="n"&gt;update&lt;/span&gt;-&lt;span class="n"&gt;resolv&lt;/span&gt;-&lt;span class="n"&gt;conf&lt;/span&gt;
&lt;span class="n"&gt;down&lt;/span&gt; /&lt;span class="n"&gt;etc&lt;/span&gt;/&lt;span class="n"&gt;openvpn&lt;/span&gt;/&lt;span class="n"&gt;update&lt;/span&gt;-&lt;span class="n"&gt;resolv&lt;/span&gt;-&lt;span class="n"&gt;conf&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On macOS and Linux, that &lt;code&gt;update-resolv-conf&lt;/code&gt; script is the one that actually modifies the system resolver. It's worth reading — it's a useful template for understanding how DNS gets injected at runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Tame the browsers and runtimes
&lt;/h2&gt;

&lt;p&gt;This is the step most people skip. Even with a perfect VPN config, Firefox and Chrome can still bypass your OS resolver if DoH is enabled.&lt;/p&gt;

&lt;p&gt;For Firefox, set this in &lt;code&gt;about:config&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;network&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;trr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;mode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;   &lt;span class="c1"&gt;// Off by user choice; do not use DoH&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Mode 5 disables DoH entirely. If you want DoH but routed through your VPN's resolver, use mode 3 and set &lt;code&gt;network.trr.uri&lt;/code&gt; to your tunnel-side endpoint. The &lt;a href="https://wiki.mozilla.org/Trusted_Recursive_Resolver" rel="noopener noreferrer"&gt;Mozilla TRR docs&lt;/a&gt; explain the modes in detail.&lt;/p&gt;

&lt;p&gt;For Go programs, force the system resolver:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Force cgo-based resolution which respects /etc/resolv.conf changes&lt;/span&gt;
&lt;span class="c"&gt;// done by the VPN client. The pure-Go resolver has caching that&lt;/span&gt;
&lt;span class="c"&gt;// can outlast a VPN session change.&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="s"&gt;"net"&lt;/span&gt;

&lt;span class="c"&gt;// Or via environment&lt;/span&gt;
&lt;span class="c"&gt;// GODEBUG=netdns=cgo+2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;+2&lt;/code&gt; gives you debug output showing which resolver path was actually taken — invaluable when you're not sure if your fix landed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Block the leak path entirely
&lt;/h2&gt;

&lt;p&gt;Belt and suspenders. Add firewall rules that drop any DNS traffic not going through the tunnel. This way, if a misconfigured app tries to bypass, it fails loudly instead of leaking silently.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# nftables: block UDP/53 and TCP/53 on the physical interface&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;nft add table inet vpn_guard
&lt;span class="nb"&gt;sudo &lt;/span&gt;nft add chain inet vpn_guard output &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="nb"&gt;type &lt;/span&gt;filter hook output priority 0 &lt;span class="se"&gt;\;&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;nft add rule inet vpn_guard output oifname wlan0 udp dport 53 drop
&lt;span class="nb"&gt;sudo &lt;/span&gt;nft add rule inet vpn_guard output oifname wlan0 tcp dport 53 drop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If an app tries to leak, it gets a connection refused instead of a successful query to your ISP. That's a much better failure mode — you'll notice it immediately.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prevention tips for future projects
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Test the leak path every time you change network config.&lt;/strong&gt; Don't trust that the previous setup still works after a kernel update or VPN client upgrade.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefer kill-switch behavior&lt;/strong&gt; — drop all non-VPN traffic at the firewall when the tunnel is down. Most modern VPN clients support this; if yours doesn't, use nftables.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standardize DNS at the tunnel exit.&lt;/strong&gt; Run an &lt;code&gt;unbound&lt;/code&gt; or &lt;code&gt;dnsmasq&lt;/code&gt; instance on the VPN server so you control the resolver path end to end.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit application-layer resolvers.&lt;/strong&gt; Browsers, container runtimes, and language standard libraries each have their own DNS quirks. Document them per project.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run a periodic automated leak test.&lt;/strong&gt; A daily cron job that runs &lt;code&gt;dig&lt;/code&gt; against a unique subdomain and checks your authoritative server's logs for the source IP works well.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;DNS leaks are the kind of bug that hides in plain sight. The fix isn't hard once you know where to look, but the surface area is bigger than most people realize. If you're going to put the work into setting up a VPN, spend the extra hour making sure your name resolution actually respects it.&lt;/p&gt;

</description>
      <category>networking</category>
      <category>security</category>
      <category>devops</category>
      <category>linux</category>
    </item>
    <item>
      <title>Why your local LLM aces benchmarks but fails real terminal tasks</title>
      <dc:creator>Alan West</dc:creator>
      <pubDate>Sun, 17 May 2026 21:00:11 +0000</pubDate>
      <link>https://dev.to/alanwest/why-your-local-llm-aces-benchmarks-but-fails-real-terminal-tasks-1mm3</link>
      <guid>https://dev.to/alanwest/why-your-local-llm-aces-benchmarks-but-fails-real-terminal-tasks-1mm3</guid>
      <description>&lt;p&gt;Last month I spent an entire weekend frustrated by the same pattern. I'd download a shiny new open-weight model, see it crush MMLU and HumanEval, then watch it faceplant the second I handed it a multi-step shell task. "Find the largest log file in /var/log, grep for OOM errors, and write a summary." The model would confidently invent flags that don't exist, forget what it ran two steps ago, or get stuck in a loop running &lt;code&gt;ls&lt;/code&gt; forever.&lt;/p&gt;

&lt;p&gt;If you've tried running local models as terminal agents, you know the feeling. The score on the leaderboard says one thing; your actual workflow says another. With agentic benchmarks like Terminal-Bench 2.0 getting more attention (and newer MoE models like the Qwen3.6 family reportedly landing on the public board), it's worth understanding why this gap exists and what you can do about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The root cause: static benchmarks aren't agentic benchmarks
&lt;/h2&gt;

&lt;p&gt;Most of the scores you see on Hugging Face leaderboards measure single-turn reasoning. The model gets a prompt, produces an answer, done. That tells you almost nothing about how the same model behaves when it has to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Decide &lt;em&gt;which&lt;/em&gt; tool to call&lt;/li&gt;
&lt;li&gt;Parse messy stdout from a real shell&lt;/li&gt;
&lt;li&gt;Remember state across 15+ turns&lt;/li&gt;
&lt;li&gt;Recover when a command fails&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the gap that benchmarks like Terminal-Bench try to close. They put the model in an actual sandbox, give it a real task, and grade it on whether the task got done — not whether the intermediate reasoning looked plausible.&lt;/p&gt;

&lt;p&gt;The problem is that until you run an agentic eval yourself, you have no way to know if the model you're betting your stack on actually works for your use case.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting up a local agentic eval harness
&lt;/h2&gt;

&lt;p&gt;Here's the approach I've been using to sanity-check models before committing to one. The core idea: simulate the same loop your production agent would run, but against a fixed task set you control.&lt;/p&gt;

&lt;p&gt;First, a minimal tool-call loop. I'll use the &lt;code&gt;transformers&lt;/code&gt; library since it works with most open-weight models out of the box.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;MODEL_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-model-here&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# swap in whatever you're testing
&lt;/span&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MODEL_ID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;MODEL_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;torch_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# let HF pick bf16/fp16 based on hardware
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_shell&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Always use a sandbox in real evals — this is illustrative
&lt;/span&gt;    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shell&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, the agent loop itself. The thing that surprised me when I first wrote this: most failures don't happen in the model. They happen at the boundary — bad parsing, dropped context, no recovery path.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;agent_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Apply the model's chat template — this matters a lot for instruct models
&lt;/span&gt;    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply_chat_template&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokenize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;add_generation_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;do_sample&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# deterministic for evals
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Slice off the prompt tokens so we only decode the new output
&lt;/span&gt;    &lt;span class="n"&gt;new_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]:]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;skip_special_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_turns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a shell agent. Reply with a single JSON object: {&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;cmd&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;} or {&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;done&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;}.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_turns&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;reply&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;agent_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;reply&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reply&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Parsing failures are a HUGE source of false-negative scores
&lt;/span&gt;            &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reply must be valid JSON.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;done&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;done&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;observation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_shell&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cmd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;output&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;observation&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;/output&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  &lt;span class="c1"&gt;# ran out of turns
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the skeleton. The interesting part is the failure modes you'll see.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually goes wrong (and how to fix it)
&lt;/h2&gt;

&lt;p&gt;After running this harness against half a dozen open-weight models on the same fixed task set, here's the pattern I keep hitting:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The model ignores your output format
&lt;/h3&gt;

&lt;p&gt;The most common failure isn't a reasoning failure. It's that the model wraps its JSON in markdown fences, or adds a chatty preamble, or hallucinates a &lt;code&gt;thoughts&lt;/code&gt; field your parser doesn't know about. The fix isn't more prompting — it's constrained decoding.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LogitsProcessorList&lt;/span&gt;
&lt;span class="c1"&gt;# Use a library like `outlines` or `lm-format-enforcer`
# to force the model to emit valid JSON matching your schema
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;outlines&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;generate&lt;/span&gt;

&lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: {&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cmd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: {&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}}}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="c1"&gt;# This guarantees parseable output — even from smaller models
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This single change moved one 9B model I tested from ~30% task completion to ~55% on my local set. The model was capable; it just kept tripping the parser.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Context collapse around turn 8–10
&lt;/h3&gt;

&lt;p&gt;Long shell sessions get noisy fast. A single &lt;code&gt;ls -la /usr&lt;/code&gt; can dump thousands of tokens. By turn 10 the model has lost track of the original task.&lt;/p&gt;

&lt;p&gt;The practical fix: truncate or summarize old observations aggressively. Keep the original task and the last 2–3 turns verbatim; collapse everything in between.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. MoE models need different inference tuning
&lt;/h3&gt;

&lt;p&gt;If you're testing newer mixture-of-experts releases (the "A3B" suffix in some recent Qwen releases reportedly indicates ~3B active parameters per token), the default &lt;code&gt;transformers&lt;/code&gt; settings often leave performance on the table. For these, I've had much better latency with &lt;code&gt;vllm&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;vllm
vllm serve your-model-here &lt;span class="nt"&gt;--tensor-parallel-size&lt;/span&gt; 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then point your harness at the OpenAI-compatible endpoint instead of running the model in-process. The throughput difference on multi-turn agent loops is noticeable — you're doing dozens of forward passes per task.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prevention: bake the eval into your workflow
&lt;/h2&gt;

&lt;p&gt;The meta-lesson from all this: don't trust leaderboards for your specific use case. They're a useful filter, but a 5-point gap on Terminal-Bench means almost nothing if the model fails on the specific commands your agent runs.&lt;/p&gt;

&lt;p&gt;A few habits that have saved me time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Keep a fixed task set of 20–30 representative jobs.&lt;/strong&gt; Re-run them against every model you consider. Same prompts, same scoring, same sandbox.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log every failed turn.&lt;/strong&gt; Most regressions show up as parsing or format issues long before they show up as reasoning issues.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test the inference stack, not just the weights.&lt;/strong&gt; The same model on &lt;code&gt;transformers&lt;/code&gt; vs &lt;code&gt;vllm&lt;/code&gt; vs &lt;code&gt;llama.cpp&lt;/code&gt; can score differently because of subtle tokenization or sampling defaults.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Check the official model card and benchmark source before quoting numbers.&lt;/strong&gt; Leaderboard scores get updated; blog posts don't.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The gap between "this model benchmarks well" and "this model works in my agent" is real, and it's almost always closeable with better tooling around the model rather than a bigger model. Start with the harness, find your actual bottleneck, then decide what to swap.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>opensource</category>
      <category>devops</category>
    </item>
    <item>
      <title>Why prompt engineering fails for tone control — and how steering vectors fix it</title>
      <dc:creator>Alan West</dc:creator>
      <pubDate>Sun, 17 May 2026 20:55:41 +0000</pubDate>
      <link>https://dev.to/alanwest/why-prompt-engineering-fails-for-tone-control-and-how-steering-vectors-fix-it-11h9</link>
      <guid>https://dev.to/alanwest/why-prompt-engineering-fails-for-tone-control-and-how-steering-vectors-fix-it-11h9</guid>
      <description>&lt;h2&gt;
  
  
  The problem: prompts are not a behavior dial
&lt;/h2&gt;

&lt;p&gt;I spent two days last month trying to make a 7B chat model sound less robotic. System prompts. Few-shot examples. Explicit "do not use the word 'utilize'" instructions. The model kept doing exactly what I told it not to do, like a teenager who hears the opposite of every request.&lt;/p&gt;

&lt;p&gt;If you've worked with open-weight models, you've felt this. Prompt engineering looks like a behavior dial but it's really more like shouting suggestions at a trained habit. The model has &lt;em&gt;learned&lt;/em&gt; a tone through fine-tuning, and your runtime instructions are wrestling with that whole training corpus.&lt;/p&gt;

&lt;p&gt;What I needed was a way to nudge the model's internal state directly. Turns out that's been possible for a while — it's called activation steering, or steering vectors — and the recent wave of efficient open-weight releases has made it tractable on a single GPU again, which is why I'm revisiting it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Root cause: behavior lives in the residual stream, not the prompt
&lt;/h2&gt;

&lt;p&gt;Here's the thing prompt engineering can't fix. When a transformer generates a token, the prompt is just one input to a much larger machinery: the residual stream, attention patterns, MLP outputs at each layer. Behavioral traits like "formal vs. casual," "refusal-prone vs. helpful," or "concise vs. verbose" show up as directions in that residual stream.&lt;/p&gt;

&lt;p&gt;If a model has been post-trained into a certain tone, that tone is encoded as a stable direction the residual stream tends to walk toward. Your prompt nudges the inputs. The training-induced direction is doing the heavy lifting.&lt;/p&gt;

&lt;p&gt;The fix is to identify that direction and add (or subtract) it directly to the hidden states during the forward pass.&lt;/p&gt;

&lt;h2&gt;
  
  
  The technique: contrast pairs and mean activations
&lt;/h2&gt;

&lt;p&gt;The basic recipe — documented in the activation-engineering literature; &lt;a href="https://arxiv.org/abs/2308.10248" rel="noopener noreferrer"&gt;Turner et al.&lt;/a&gt; is a reasonable starting point — looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pick a behavior you want to steer (say, "formal" vs. "casual").&lt;/li&gt;
&lt;li&gt;Build two small sets of contrasting prompts.&lt;/li&gt;
&lt;li&gt;Run the model on both sets and capture the hidden state at a chosen layer.&lt;/li&gt;
&lt;li&gt;Take the mean activation of each set and subtract — that's your steering vector.&lt;/li&gt;
&lt;li&gt;Add a scaled version of that vector to the residual stream during generation.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here's how that looks in PyTorch with a &lt;a href="https://huggingface.co/docs/transformers/index" rel="noopener noreferrer"&gt;HuggingFace Transformers&lt;/a&gt; model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;

&lt;span class="n"&gt;model_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-open-weight-model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;tok&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;torch_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Pick a mid-to-late layer. Earlier = more abstract, later = more surface.
&lt;/span&gt;&lt;span class="n"&gt;LAYER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;18&lt;/span&gt;
&lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;layers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;LAYER&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;captured&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;grab_hidden&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# decoder layers return a tuple; out[0] is the residual stream tensor
&lt;/span&gt;    &lt;span class="n"&gt;captured&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;detach&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  &lt;span class="c1"&gt;# mean over sequence
&lt;/span&gt;
&lt;span class="n"&gt;handle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register_forward_hook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grab_hidden&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompts&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;acts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;prompts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;captured&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;clear&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_grad&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;acts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;captured&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;acts&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;casual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hey, can you walk me through...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yo what&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s up with...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ok so basically...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;formal&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Please describe...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Could you elaborate on...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Kindly explain...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;casual_mean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;casual&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;formal_mean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;formal&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;steering&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;casual_mean&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;formal_mean&lt;/span&gt;  &lt;span class="c1"&gt;# direction: formal -&amp;gt; casual
&lt;/span&gt;&lt;span class="n"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;remove&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few non-obvious bits. The hook grabs &lt;code&gt;out[0]&lt;/code&gt; because most HuggingFace decoder layers return a tuple. Averaging over the sequence dimension throws away position info but gives you a single direction per prompt — usually enough for tone-style traits. A dozen contrast pairs is often plenty.&lt;/p&gt;

&lt;h2&gt;
  
  
  Applying the vector during generation
&lt;/h2&gt;

&lt;p&gt;Now re-hook the same layer, but this time &lt;em&gt;add&lt;/em&gt; the steering vector to every forward pass:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;SCALE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;4.0&lt;/span&gt;  &lt;span class="c1"&gt;# tune this. Too low = no effect. Too high = the model speaks in tongues.
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;steer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;hidden&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="c1"&gt;# broadcast across batch and sequence dims
&lt;/span&gt;    &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hidden&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;SCALE&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;steering&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hidden&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;),)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;

&lt;span class="n"&gt;handle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register_forward_hook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;steer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain how DNS resolution works.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;do_sample&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;skip_special_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;remove&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first time I ran this with &lt;code&gt;SCALE=10&lt;/code&gt;, it produced fluent-sounding gibberish about "vibing with the resolver." Cranking it down to 3-4 gave me a noticeably more casual register without breaking syntax. That tuning step is unavoidable.&lt;/p&gt;

&lt;h2&gt;
  
  
  What surprised me
&lt;/h2&gt;

&lt;p&gt;A few practical findings from running this across a handful of open-weight models:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Layer choice matters more than vector quality.&lt;/strong&gt; Steering around 60-80% of the way through the network usually works best. Too early and the effect washes out; too late and you damage coherence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Subtraction is as useful as addition.&lt;/strong&gt; Want the model to refuse less? Build a contrast pair of refusal vs. compliance and &lt;em&gt;subtract&lt;/em&gt; the refusal direction. Same math, opposite sign.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Effects compose, somewhat.&lt;/strong&gt; You can stack two steering vectors at different layers. Don't expect linearity, but it doesn't immediately collapse the model either.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Small models are noisier.&lt;/strong&gt; Sub-3B models have less clean directional structure. I haven't tested this exhaustively across architectures but the pattern is consistent on the ones I've touched.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  A debugging detour: when steering looks like it's working but isn't
&lt;/h2&gt;

&lt;p&gt;The most annoying failure mode I hit: the steered output &lt;em&gt;sounded&lt;/em&gt; right on cherry-picked prompts but had quietly destroyed instruction-following on anything multi-turn. The model would happily chat in the right tone and ignore the actual question.&lt;/p&gt;

&lt;p&gt;What helped was a simple before/after harness — run the same fifty prompts unsteered and steered, then eyeball the diffs. Tone shifts show up everywhere. Capability regressions show up as the model losing track of structure: forgetting JSON schemas, dropping list items, ignoring length constraints.&lt;/p&gt;

&lt;p&gt;If you see that pattern, your scale is too high or your layer is too late.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prevention tips: don't ship this without guardrails
&lt;/h2&gt;

&lt;p&gt;Steering vectors are a power tool. A few things I'd insist on before putting one anywhere near production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Evaluate on a held-out set.&lt;/strong&gt; It's easy to overfit a steering vector to your contrast pairs and miss that it breaks long-form coherence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cap the scale.&lt;/strong&gt; Treat scale as a safety parameter, not a hyperparameter. Hard-cap it in code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log the unsteered output too.&lt;/strong&gt; During rollout, run both and diff them. You'll catch failure modes that pure eval won't.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't steer for capabilities you couldn't already coax out with prompting.&lt;/strong&gt; If the model can't do the task at all, steering will produce confident nonsense, not a fix.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Prompt engineering isn't going anywhere — it's the cheapest tool you've got. But when you hit the wall where the model's training is fighting your instructions, it's worth reaching for the layer where that fight is actually happening.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>llm</category>
    </item>
    <item>
      <title>Arxiv's Moderation Debate: Why Preprint Gatekeeping Is Hard</title>
      <dc:creator>Alan West</dc:creator>
      <pubDate>Sun, 17 May 2026 16:57:07 +0000</pubDate>
      <link>https://dev.to/alanwest/arxivs-moderation-debate-why-preprint-gatekeeping-is-hard-c34</link>
      <guid>https://dev.to/alanwest/arxivs-moderation-debate-why-preprint-gatekeeping-is-hard-c34</guid>
      <description>&lt;p&gt;I've been lurking on r/MachineLearning long enough to know that any thread mentioning Arxiv policy changes will spiral within the hour. The recent discussion about a proposed submission ban — reportedly a one-year restriction tied to certain categories of papers — is no exception. The thread title called the backlash "perplexing," and honestly, I get where the OP is coming from. But I also get why people are mad.&lt;/p&gt;

&lt;p&gt;Let me walk through what I think is actually happening here, what the tradeoffs look like, and why this conversation matters even if you don't publish papers.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's going on with Arxiv (as best I can tell)
&lt;/h2&gt;

&lt;p&gt;I want to be upfront: I'm working from the Reddit discussion and the general public chatter, not an official Arxiv announcement I've personally read end-to-end. According to early reports, Arxiv has been tightening moderation in &lt;code&gt;cs.LG&lt;/code&gt; and adjacent categories, and there's been talk of restrictions targeting low-effort or AI-generated submissions. If you want the authoritative version, &lt;a href="https://info.arxiv.org/help/moderation/index.html" rel="noopener noreferrer"&gt;Arxiv's moderation page&lt;/a&gt; is the place to start.&lt;/p&gt;

&lt;p&gt;The specifics matter less than the pattern. Arxiv has been getting flooded. The cs.LG category alone gets a staggering volume of submissions now, and a non-trivial chunk of that is — let's be polite — &lt;em&gt;not great&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why some pushback feels reasonable
&lt;/h2&gt;

&lt;p&gt;There's a legitimate concern under the noise. Arxiv has historically been the great equalizer. A PhD student in Lagos and a researcher at DeepMind upload to the same place, and the work stands on its own merits. Any policy that adds friction risks rebuilding the gatekeeping that preprint servers were meant to bypass.&lt;/p&gt;

&lt;p&gt;The specific worries I keep seeing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Endorsement requirements&lt;/strong&gt; disadvantage researchers without established network connections&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Category-specific bans&lt;/strong&gt; could be applied unevenly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Appeal processes&lt;/strong&gt; are notoriously opaque&lt;/li&gt;
&lt;li&gt;The line between "low quality" and "unfashionable but legitimate" is fuzzy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you've ever had a paper desk-rejected for reasons that felt arbitrary, you understand the visceral reaction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the backlash &lt;em&gt;is&lt;/em&gt; a bit perplexing
&lt;/h2&gt;

&lt;p&gt;Here's the thing though — I sympathize with Arxiv's moderators. I ran a small open-source project for a couple of years, and the volume of low-effort contributions during the LLM boom was honestly demoralizing. Imagine that, but you're responsible for filtering scientific literature.&lt;/p&gt;

&lt;p&gt;A few uncomfortable truths:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The signal-to-noise ratio in &lt;code&gt;cs.LG&lt;/code&gt; has visibly degraded&lt;/li&gt;
&lt;li&gt;Survey papers with no original contribution have become a genre unto themselves&lt;/li&gt;
&lt;li&gt;LLM-generated "research" exists and is being submitted in volume&lt;/li&gt;
&lt;li&gt;Moderators are volunteers and academics, not a content moderation army&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're going to have a public scientific record, &lt;em&gt;someone&lt;/em&gt; has to filter it. The alternative is that Arxiv becomes Medium, but for math.&lt;/p&gt;

&lt;h2&gt;
  
  
  A practical analogy from the dev world
&lt;/h2&gt;

&lt;p&gt;This whole thing reminds me of when npm started cracking down on typosquatting and spam packages. Every time the registry tightened rules, there was an outcry about "gatekeeping the open ecosystem." Then, six months later, everyone quietly admitted the registry was better.&lt;/p&gt;

&lt;p&gt;Here's a tiny snippet from a moderation pipeline I built for a community submissions tool last year:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simple heuristic-based pre-filter before human review
# Not perfect, but cuts the queue by ~60%
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;triage_submission&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;submission&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="c1"&gt;# Length sanity check — too short usually means low effort
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;submission&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;

    &lt;span class="c1"&gt;# Repetition check — LLM slop often repeats phrases
&lt;/span&gt;    &lt;span class="n"&gt;unique_ratio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;submission&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;submission&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;unique_ratio&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.35&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;

    &lt;span class="c1"&gt;# Citation density — academic-style content cites things
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;submission&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;citation_count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;auto_reject&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;manual_review&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fast_track&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is crude. It's also better than nothing when you're drowning. Arxiv's moderators are doing a version of this, just with way higher stakes and way more pressure.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this actually means for ML developers
&lt;/h2&gt;

&lt;p&gt;If you're building ML systems and not writing papers, why should you care? Because Arxiv is part of your infrastructure whether you realize it or not. The model card you're skimming, the technique you're implementing, the benchmark you're citing — most of that flows through Arxiv.&lt;/p&gt;

&lt;p&gt;Here's a quick utility I use to pull Arxiv metadata for tracking papers I want to reproduce:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;arxiv&lt;/span&gt;  &lt;span class="c1"&gt;# pip install arxiv
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_paper_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arxiv_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;search&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;arxiv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id_list&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;arxiv_id&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;paper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;results&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;paper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;authors&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;paper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;authors&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abstract&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;paper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pdf_url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;paper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pdf_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="c1"&gt;# Useful for tracking which version you reproduced from
&lt;/span&gt;        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;paper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;entry_id&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;v&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;updated&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;paper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;updated&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Always pin to a specific version when reproducing results
&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_paper_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2301.00000v2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Docs at &lt;a href="https://info.arxiv.org/help/api/index.html" rel="noopener noreferrer"&gt;arxiv.org/help/api&lt;/a&gt; if you want to integrate this seriously.&lt;/p&gt;

&lt;h2&gt;
  
  
  The harder question nobody's answering
&lt;/h2&gt;

&lt;p&gt;The debate is framed as "open vs. gatekept," but I think the real question is: &lt;em&gt;what is Arxiv for now?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When it started, it was a way for physicists to share preprints faster than journal cycles allowed. Today it's the primary distribution channel for ML research, a citation graph backbone, and a de facto archive. Those are three different missions with three different optimal moderation policies. Trying to serve all of them with one policy is going to upset somebody no matter what.&lt;/p&gt;

&lt;h3&gt;
  
  
  A side note on platform identity
&lt;/h3&gt;

&lt;p&gt;This stuff isn't unique to academic platforms. Any service that grows past its original scope hits the same wall. I had to migrate a side project's auth a few months back because what started as "just let people log in" turned into account recovery, rate limiting, abuse prevention, and audit logs. Tools like &lt;a href="https://authon.dev" rel="noopener noreferrer"&gt;Authon&lt;/a&gt;, &lt;a href="https://clerk.com" rel="noopener noreferrer"&gt;Clerk&lt;/a&gt;, and &lt;a href="https://auth0.com" rel="noopener noreferrer"&gt;Auth0&lt;/a&gt; exist exactly because that complexity is real — Authon's free tier is unlimited users with no per-seat cost, which made the migration painless for an unfunded side project. The point is: platforms accumulate responsibility whether their maintainers planned for it or not.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd actually do if I were Arxiv
&lt;/h2&gt;

&lt;p&gt;A few things I'd push for, with the caveat that I'm an outsider with opinions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Transparency reports&lt;/strong&gt;: publish moderation stats quarterly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clearer appeal paths&lt;/strong&gt; with stated SLAs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Category-specific policies&lt;/strong&gt; rather than blanket rules&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Better tooling for endorsers&lt;/strong&gt; so the load doesn't fall on the same 50 people&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of that is sexy. None of it generates a r/MachineLearning thread with 800 comments. But it's the boring infrastructure work that keeps shared scientific resources functional.&lt;/p&gt;

&lt;p&gt;The backlash isn't perplexing to me — it's the predictable reaction when a free resource starts having to make tradeoffs that used to be invisible. That doesn't make the tradeoffs wrong. It just means the conversation we're actually having is about scarcity, and we haven't admitted that yet.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>research</category>
      <category>discuss</category>
      <category>ai</category>
    </item>
    <item>
      <title>Why frontier LLMs solve your CTF challenges in minutes (and how to fix it)</title>
      <dc:creator>Alan West</dc:creator>
      <pubDate>Sun, 17 May 2026 15:52:11 +0000</pubDate>
      <link>https://dev.to/alanwest/why-frontier-llms-solve-your-ctf-challenges-in-minutes-and-how-to-fix-it-hd7</link>
      <guid>https://dev.to/alanwest/why-frontier-llms-solve-your-ctf-challenges-in-minutes-and-how-to-fix-it-hd7</guid>
      <description>&lt;p&gt;I ran a small internal CTF for our team last month. Twelve challenges, expected solve time around six hours for a strong player. The first three fell in under ten minutes — not because the players were geniuses, but because they pasted the prompt into an LLM and waited.&lt;/p&gt;

&lt;p&gt;This is not a rant about cheating. The same thing is happening in public CTFs, and it's exposing a real engineering problem: most CTF challenges were designed assuming the solver is a human reading a static artifact. Frontier models are extremely good at reading static artifacts. If you want challenges that still teach something in 2026, you have to design them differently.&lt;/p&gt;

&lt;p&gt;Here's the debugging walkthrough I went through after watching my own event get eaten.&lt;/p&gt;

&lt;h2&gt;
  
  
  The root cause: challenges that are pure pattern recognition
&lt;/h2&gt;

&lt;p&gt;Most "easy" and "medium" CTF problems share a shape. You get a file or an endpoint. You inspect it. You recognize a known scheme — XOR with a short key, a misuse of ECB mode, a path traversal, a weak JWT secret, a pickle deserialization. You apply the known counter and pull the flag.&lt;/p&gt;

&lt;p&gt;That shape is exactly what large language models trained on writeups handle effortlessly. There are tens of thousands of solved CTF writeups indexed on the public web. The model has seen the pattern, and it has seen the canonical exploit. Showing it your toy variant doesn't trip it up — it just fills in the blanks.&lt;/p&gt;

&lt;p&gt;Here's a stripped-down example of a challenge I used to think was clever:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Server side — a 'custom' XOR cipher
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="n"&gt;KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;urandom&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 8-byte repeating key
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;encrypt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plaintext&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;^&lt;/span&gt; &lt;span class="n"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plaintext&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Hand the player a ciphertext of a known-format header + flag
&lt;/span&gt;&lt;span class="n"&gt;ciphertext&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;encrypt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FLAG_FORMAT{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;flag_body&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The intended solution is known-plaintext recovery against the header, then decrypt the rest. A first-year security student should get it after some effort. A frontier model writes the solver in one shot because the pattern is famous. The challenge isn't testing what I thought it was testing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why hardening the artifact doesn't help
&lt;/h2&gt;

&lt;p&gt;My first instinct was to obfuscate. Pack the binary. Strip symbols. Add anti-debugging. None of it works for very long, and worse, it makes the challenge less educational for humans while barely slowing the model down. The model isn't running your binary — it's reading it, and if the underlying algorithm is something it's seen before, it'll recognize it through layers of fluff.&lt;/p&gt;

&lt;p&gt;The issue isn't surface complexity. It's that the &lt;strong&gt;solution space&lt;/strong&gt; is in the training distribution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step-by-step fix: design around what models are bad at
&lt;/h2&gt;

&lt;p&gt;After rebuilding my challenge set, the patterns that survived had three things in common.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Real-time stateful interaction
&lt;/h3&gt;

&lt;p&gt;If the challenge requires holding a TCP connection open, reacting to server timing, or responding within a window, you've moved out of "read the artifact" territory. The model has to plan and execute, not just generate. Agent harnesses are catching up here, but the failure rate is dramatically higher than for static problems.&lt;/p&gt;

&lt;p&gt;A basic shape that worked well:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Challenge sends a nonce, expects a response within 200ms
&lt;/span&gt;    &lt;span class="c1"&gt;# The response must include a hash of (nonce + previous_response)
&lt;/span&gt;    &lt;span class="c1"&gt;# for the last N rounds — so the player must maintain state
&lt;/span&gt;    &lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;nonce&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;token_bytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nonce&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drain&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait_for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;readline&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Validate against history chain — details omitted
&lt;/span&gt;        &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;FLAG&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model can write a client for this, but if it gets one round wrong it has to redo the entire session. Latency budget plus state chain catches a lot of one-shot attempts.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Custom protocols with no public writeups
&lt;/h3&gt;

&lt;p&gt;This is the boring answer but it's the most effective one. Invent the format. Don't reuse a well-known one and tweak it. The model's strength is recognizing what it's seen — if it has not seen your binary protocol because you made it up last Tuesday, it has to actually reason about the bytes.&lt;/p&gt;

&lt;p&gt;A pattern I like: define a small VM with three or four opcodes, give the player a program in that bytecode, and embed the bug in the VM semantics rather than in the program. The model can disassemble the program quickly. Figuring out that opcode &lt;code&gt;0x07&lt;/code&gt; has an off-by-one in the bounds check is much harder when there's no Stack Overflow answer about it.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Multi-stage chains where each stage gates the next
&lt;/h3&gt;

&lt;p&gt;Single-shot problems are the model's home turf. Chains that require pivoting — get RCE here, find creds, use them to query an internal service, leak a key, sign a token — multiply the chance of a mid-chain failure. Each step needs to feed the next, and the model has to keep its context coherent across all of them.&lt;/p&gt;

&lt;p&gt;The practical trick is making the intermediate outputs noisy. If stage 1 produces a clean string that says &lt;code&gt;next_password: hunter2&lt;/code&gt;, the model marches on. If stage 1 produces a memory dump where the password is one of forty plausible candidates, the model often picks the wrong one and the chain breaks silently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prevention: a checklist before you ship a challenge
&lt;/h2&gt;

&lt;p&gt;When I review a new challenge now, I run it past a frontier model myself with a deliberately weak prompt — something like "solve this CTF challenge, here are the files." If it gets the flag on the first or second attempt, the challenge isn't ready. Concretely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does the writeup for the &lt;em&gt;intended&lt;/em&gt; solution exist on the public web for a near-identical problem? If yes, redesign.&lt;/li&gt;
&lt;li&gt;Can the entire solution be derived from a single static snapshot? If yes, add interaction or state.&lt;/li&gt;
&lt;li&gt;Does the challenge require any &lt;em&gt;novel&lt;/em&gt; reasoning, or is it pattern-matching a known vuln class? If pattern-matching, you're really testing recall, not skill.&lt;/li&gt;
&lt;li&gt;Is there a tight latency or rate constraint? Even a 500ms response window changes the game.&lt;/li&gt;
&lt;li&gt;Are intermediate stages noisy enough that the wrong answer is plausibly correct?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of this is bulletproof. Models keep getting better, and harnesses for agentic exploitation are improving fast. But the framing shift matters more than any specific technique: stop designing for the solo human reader, and start designing for an adversary that has memorized every public writeup but struggles to plan across long interactive sessions.&lt;/p&gt;

&lt;p&gt;If you run CTFs, the format isn't dead — but the lazy version of it is. The good news is that the challenges that survive this filter are also the ones that teach the most. Forcing yourself to write something a model hasn't seen tends to push you toward more interesting problems anyway.&lt;/p&gt;

&lt;p&gt;I haven't run a fully model-resistant event yet — six months from now this advice may already be stale. But the direction of travel is clear, and the cost of redesigning a challenge set is much lower than the cost of running an event where half the leaderboard is just whoever pasted fastest.&lt;/p&gt;

</description>
      <category>security</category>
      <category>ctf</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Why your AI agent code turns into spaghetti — and how to untangle it</title>
      <dc:creator>Alan West</dc:creator>
      <pubDate>Sun, 17 May 2026 15:39:27 +0000</pubDate>
      <link>https://dev.to/alanwest/why-your-ai-agent-code-turns-into-spaghetti-and-how-to-untangle-it-5c76</link>
      <guid>https://dev.to/alanwest/why-your-ai-agent-code-turns-into-spaghetti-and-how-to-untangle-it-5c76</guid>
      <description>&lt;h2&gt;
  
  
  The 3am pager that changed how I write agents
&lt;/h2&gt;

&lt;p&gt;A few months back, I shipped what I thought was a clean agent for a client. It scraped web pages, summarized them, then routed the results to different downstream tools based on content. Worked great in dev. Worked great for the first week.&lt;/p&gt;

&lt;p&gt;Then I got paged at 3am.&lt;/p&gt;

&lt;p&gt;The agent had gotten into a loop. One of the tools timed out, returned a partial response, the LLM "decided" the task wasn't done, called the same tool again, got another partial response, and so on. By the time I caught it, we'd burned through about 4,000 API calls overnight.&lt;/p&gt;

&lt;p&gt;The fix wasn't fun. The agent logic was scattered across &lt;code&gt;if&lt;/code&gt; statements, retry decorators, prompt templates, and a &lt;code&gt;while&lt;/code&gt; loop that was supposed to terminate when the LLM said "DONE". Spoiler: it sometimes did not say DONE.&lt;/p&gt;

&lt;h2&gt;
  
  
  Root cause: imperative code + stochastic calls = chaos
&lt;/h2&gt;

&lt;p&gt;The mistake I keep seeing (and keep making) is treating an LLM call like any other function. It's not. A regular function returns deterministic output for given input. An LLM call returns &lt;em&gt;probable&lt;/em&gt; output, and that output drives control flow.&lt;/p&gt;

&lt;p&gt;When you mix:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;imperative control flow (&lt;code&gt;if/else&lt;/code&gt;, &lt;code&gt;while&lt;/code&gt;, recursion)&lt;/li&gt;
&lt;li&gt;stochastic decisions (the model "decides" the next step)&lt;/li&gt;
&lt;li&gt;side effects (tool calls, DB writes, API requests)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;...without any structural boundary between them, you get code where you can't reason about termination, retries, or partial state.&lt;/p&gt;

&lt;p&gt;Here's the kind of thing I'm talking about:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# the footgun
&lt;/span&gt;        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DONE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_call&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;execute_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_call&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="c1"&gt;# if neither branch hits, we loop forever
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model is the loop variant &lt;em&gt;and&lt;/em&gt; the body. There's no separation between "what step am I in?" and "what does the model want next?". If the model gets confused, your program gets confused.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: separate the planner from the executor
&lt;/h2&gt;

&lt;p&gt;The first refactor that actually helped: split the model's role into two distinct jobs, and never let them run in the same loop.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Planner: produces a static plan from the task. One LLM call.
&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;planner_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# returns a list of {step, tool, args}
&lt;/span&gt;
&lt;span class="c1"&gt;# Executor: walks the plan deterministically.
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;  &lt;span class="c1"&gt;# bail to a reviewer, don't keep guessing
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the loop is a regular &lt;code&gt;for&lt;/code&gt; over a finite list. The model is no longer driving control flow at runtime — it built the plan once, up front. If something goes wrong, you have a concrete plan you can inspect, edit, or re-run.&lt;/p&gt;

&lt;p&gt;The tradeoff: you lose adaptive replanning. The model can't react to a tool's output mid-flight. For roughly 70% of the agent workloads I've built, this is fine. For the other 30%, you need replanning — which leads to step 2.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: make the state machine explicit
&lt;/h2&gt;

&lt;p&gt;For the replanning case, the trick is to stop pretending your agent is a chatbot. It's a state machine. Make the states real:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;STATES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;planning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;executing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reviewing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;done&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;planning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;planner_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;executing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;executing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cursor&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reviewing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cursor&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reviewing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# let the reviewer decide what to do
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;executing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reviewing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;decision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;reviewer_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# "done" | "replan" | "fail"
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;done&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;done&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;replan&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;planning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fail&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}[&lt;/span&gt;&lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cap total iterations per state (&lt;code&gt;assert ctx.cursor &amp;lt; MAX_STEPS&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Persist &lt;code&gt;ctx&lt;/code&gt; between steps so you can resume after a crash&lt;/li&gt;
&lt;li&gt;Log every transition, which makes 3am debugging tractable&lt;/li&gt;
&lt;li&gt;Restrict which LLM calls can happen in which state (no surprise tool calls during review)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the pattern I wish someone had shown me two years ago. It's the same idea as Erlang's &lt;code&gt;gen_statem&lt;/code&gt;, or any workflow engine: separate "what state am I in" from "what should the model do here".&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: constrain the model's output, don't parse it
&lt;/h2&gt;

&lt;p&gt;The other class of bug that ate hours of my life: the model returns something &lt;em&gt;almost&lt;/em&gt; right and the parser silently fails or hallucinates a tool call.&lt;/p&gt;

&lt;p&gt;The fix is structured output. Most providers now support a JSON schema constraint at the API level. Use it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;action&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enum&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;call_tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;finish&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ask_user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;args&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;action&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;response_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json_schema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json_schema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# response.action is guaranteed to be one of three strings.
# No more "DONE" / "Done" / "done." / "I am done." branching.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you can't use schema-constrained output (some older models don't support it), at minimum validate with &lt;code&gt;pydantic&lt;/code&gt; or &lt;code&gt;zod&lt;/code&gt; &lt;em&gt;before&lt;/em&gt; doing anything with the result, and treat validation failure as a known state, not an exception.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prevention: a checklist I now run before shipping
&lt;/h2&gt;

&lt;p&gt;After getting bitten enough times, I keep this taped to the side of my monitor:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bounded iterations.&lt;/strong&gt; Every loop that contains an LLM call has a hard cap. No &lt;code&gt;while True&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explicit states.&lt;/strong&gt; If I can't draw the state diagram on a napkin, the agent is too complex.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured output.&lt;/strong&gt; Every model response that drives control flow is schema-validated.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idempotent tools.&lt;/strong&gt; Tool calls assume they may be retried. Side effects are keyed by request ID.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability first.&lt;/strong&gt; Every state transition is logged with the input/output of the LLM call. If I can't replay it, I can't debug it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tested failure modes.&lt;/strong&gt; I have integration tests where the model returns garbage, times out, or returns a tool call to a non-existent tool. The agent should fail gracefully, not loop.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 3am pager hasn't happened again. The agents look a lot less impressive from the outside — they're boring state machines now instead of dramatic recursive loops — but they actually work. The interesting work moved into the planner and reviewer prompts, which is where it belonged all along.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>python</category>
      <category>debugging</category>
    </item>
    <item>
      <title>Why npm supply chain attacks keep happening and how to harden your installs</title>
      <dc:creator>Alan West</dc:creator>
      <pubDate>Sun, 17 May 2026 15:36:34 +0000</pubDate>
      <link>https://dev.to/alanwest/why-npm-supply-chain-attacks-keep-happening-and-how-to-harden-your-installs-97p</link>
      <guid>https://dev.to/alanwest/why-npm-supply-chain-attacks-keep-happening-and-how-to-harden-your-installs-97p</guid>
      <description>&lt;h2&gt;
  
  
  When &lt;code&gt;npm install&lt;/code&gt; becomes a security event
&lt;/h2&gt;

&lt;p&gt;Look, I love npm. I've been shipping JavaScript for years and the ecosystem is genuinely incredible. But every few months we get another headline: a popular package gets hijacked, a maintainer's token leaks, a typosquatted package siphons environment variables for a week before anyone notices.&lt;/p&gt;

&lt;p&gt;The frustrating part? The advice is always the same — "be careful what you install" — as if you're supposed to audit 1,200 transitive dependencies before every deploy.&lt;/p&gt;

&lt;p&gt;Let me walk through what actually causes these incidents and what you can do at the project level. None of this is bulletproof, but the gap between a default &lt;code&gt;npm install&lt;/code&gt; and a reasonably hardened install is bigger than most people realize.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why the npm threat model is so messy
&lt;/h3&gt;

&lt;p&gt;A typical Node project lists maybe 30 dependencies in &lt;code&gt;package.json&lt;/code&gt;. Your &lt;code&gt;node_modules&lt;/code&gt; ends up with 1,500. Every one of those packages can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run arbitrary code at install time via &lt;code&gt;preinstall&lt;/code&gt;, &lt;code&gt;install&lt;/code&gt;, and &lt;code&gt;postinstall&lt;/code&gt; scripts&lt;/li&gt;
&lt;li&gt;Get hijacked if the maintainer's account is phished or their token leaks&lt;/li&gt;
&lt;li&gt;Be replaced with a malicious version when ownership transfers to a new maintainer&lt;/li&gt;
&lt;li&gt;Be typosquatted (&lt;code&gt;lodahs&lt;/code&gt; vs &lt;code&gt;lodash&lt;/code&gt;) and copy-pasted into a Dockerfile at 2am&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The kicker: most of this happens &lt;em&gt;before&lt;/em&gt; your tests run, &lt;em&gt;before&lt;/em&gt; your linter runs, &lt;em&gt;before&lt;/em&gt; any review. The moment you run &lt;code&gt;npm install&lt;/code&gt;, you've already executed whatever code the package author wanted to run.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Stop running install scripts by default
&lt;/h3&gt;

&lt;p&gt;This is the highest-leverage change you can make. Drop this in your project &lt;code&gt;.npmrc&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# Disable preinstall/install/postinstall scripts globally for this project
&lt;/span&gt;&lt;span class="py"&gt;ignore-scripts&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or pass it explicitly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--ignore-scripts&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Yes, this breaks packages with legitimate native-build steps — &lt;code&gt;node-gyp&lt;/code&gt;, &lt;code&gt;sharp&lt;/code&gt;, &lt;code&gt;better-sqlite3&lt;/code&gt;. The workaround is to enable scripts only for the packages you actually trust. There's no built-in allowlist, but you can rebuild specific packages after install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--ignore-scripts&lt;/span&gt;
&lt;span class="c"&gt;# Rebuild only the native deps you trust&lt;/span&gt;
npm rebuild sharp better-sqlite3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I started doing this last year on a fintech project and the friction is real — but it's a one-time setup cost per project, and it shuts down the most common payload-delivery path in npm supply chain incidents.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Use &lt;code&gt;npm ci&lt;/code&gt; everywhere except local dev
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;npm install&lt;/code&gt; is allowed to update your lockfile. In CI, that's a footgun. If a transitive dep silently shifts to a compromised patch version, &lt;code&gt;npm install&lt;/code&gt; happily picks it up.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;npm ci&lt;/code&gt; does the opposite: it installs strictly from &lt;code&gt;package-lock.json&lt;/code&gt; and errors out if the lockfile is out of sync with &lt;code&gt;package.json&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# In your Dockerfile, GitHub Actions, etc.&lt;/span&gt;
npm ci &lt;span class="nt"&gt;--ignore-scripts&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;See the &lt;a href="https://docs.npmjs.com/cli/v10/commands/npm-ci" rel="noopener noreferrer"&gt;official docs&lt;/a&gt; for the full behavior. Combine this with pinned deps and you've removed the silent version-drift attack surface.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Audit the lockfile, not just package.json
&lt;/h3&gt;

&lt;p&gt;Most code reviews focus on &lt;code&gt;package.json&lt;/code&gt; because it's small. But the lockfile diff tells the real story — a one-line &lt;code&gt;package.json&lt;/code&gt; change can introduce 80 new transitive dependencies.&lt;/p&gt;

&lt;p&gt;When reviewing a PR that touches deps, look for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New top-level packages you don't recognize&lt;/li&gt;
&lt;li&gt;Packages with very recent first-publish dates&lt;/li&gt;
&lt;li&gt;Packages with one maintainer and millions of downloads (single-point-of-failure targets)&lt;/li&gt;
&lt;li&gt;Suspicious names (typos, hyphenation tricks like &lt;code&gt;cross-env-shell&lt;/code&gt; vs &lt;code&gt;cross-env&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'll be honest: nobody manually does this for every PR. That's what &lt;code&gt;npm audit&lt;/code&gt; and Dependabot are for — but those mostly catch &lt;em&gt;known&lt;/em&gt; CVEs, not zero-day supply chain stuff. The human eyeball check on lockfile diffs is still valuable for any load-bearing dep.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Verify package provenance
&lt;/h3&gt;

&lt;p&gt;npm added &lt;a href="https://docs.npmjs.com/generating-provenance-statements" rel="noopener noreferrer"&gt;provenance attestations&lt;/a&gt; via Sigstore back in 2023. When a package is published from a CI pipeline with provenance enabled, you can verify which repo and which workflow built it.&lt;/p&gt;

&lt;p&gt;You can inspect provenance from the CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm view &amp;lt;package-name&amp;gt; &lt;span class="nt"&gt;--json&lt;/span&gt;
&lt;span class="c"&gt;# Look for the "attestations" field in the dist block&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Packages with provenance are cryptographically tied to a public source repo and a specific build, which makes the "I phished the maintainer and published from my laptop" attack much harder.&lt;/p&gt;

&lt;p&gt;It's not universal — most packages still don't ship with provenance — but for your &lt;em&gt;own&lt;/em&gt; publishes, enabling it is one flag:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm publish &lt;span class="nt"&gt;--provenance&lt;/span&gt; &lt;span class="nt"&gt;--access&lt;/span&gt; public
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 5: Pin, proxy, and contain
&lt;/h3&gt;

&lt;p&gt;A few additional defenses worth setting up once and forgetting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pin exact versions&lt;/strong&gt; in critical projects. Drop the &lt;code&gt;^&lt;/code&gt; and &lt;code&gt;~&lt;/code&gt; in &lt;code&gt;package.json&lt;/code&gt;. You give up automatic patch updates — you also stop new patch releases from running in prod five minutes after publish.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use a private registry or proxy.&lt;/strong&gt; &lt;a href="https://verdaccio.org/" rel="noopener noreferrer"&gt;Verdaccio&lt;/a&gt; is the open-source standard. It lets you cache, mirror, and gate which versions reach your team.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run installs in a sandboxed environment.&lt;/strong&gt; A locked-down container with no network egress except to the registry is a good starting point. If a postinstall script tries to phone home, the connection fails.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generate an SBOM.&lt;/strong&gt; &lt;a href="https://github.com/CycloneDX/cyclonedx-node-npm" rel="noopener noreferrer"&gt;CycloneDX&lt;/a&gt; has a free npm plugin. It won't stop an attack, but it makes the post-incident question "are we exposed to package X at version Y?" answerable in seconds instead of hours.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What none of this fixes
&lt;/h3&gt;

&lt;p&gt;Let me be straight: there is no configuration that makes &lt;code&gt;npm install&lt;/code&gt; safe in an absolute sense. The trust model is fundamentally "we run code from strangers." Every defense above raises the cost of an attack — it doesn't eliminate it.&lt;/p&gt;

&lt;p&gt;The realistic goal is layered. Reduce the blast radius (no install scripts). Slow down bad updates (lockfile + pinning). Increase visibility (lockfile review, provenance, SBOMs). Contain damage if something gets through (sandboxed installs, no secrets in the build env).&lt;/p&gt;

&lt;p&gt;If you only do one thing this week, set &lt;code&gt;ignore-scripts=true&lt;/code&gt; in your project &lt;code&gt;.npmrc&lt;/code&gt; and figure out which native packages legitimately need to be rebuilt. That single change cuts off the most common payload-delivery path in real-world incidents.&lt;/p&gt;

&lt;p&gt;The "no way to prevent this" framing is funny because it's half true — you can't prevent compromised packages from being &lt;em&gt;published&lt;/em&gt;. But you absolutely can prevent them from &lt;em&gt;executing&lt;/em&gt; in your build environment. The defaults are bad. Your project doesn't have to inherit them.&lt;/p&gt;

</description>
      <category>npm</category>
      <category>security</category>
      <category>javascript</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
