DEV Community: Alan West

How to escape note-taking lock-in with plain markdown and git

Alan West — Tue, 19 May 2026 00:32:44 +0000

When your notes outlive your note-taking app

A few months ago I tried to export 4 years of notes from a popular note-taking app. The export gave me a .zip of "markdown" files — except every link was rewritten to use the app's proprietary [[uuid-7f3a...]] syntax, every attachment was renamed to a hash, and frontmatter was packed with app-specific fields nothing else could parse.

I'd been telling myself "it's just markdown, I can leave whenever." Turns out I couldn't. Not without spending a weekend writing a migration script.

This isn't a rant about that one app. It's a problem-solving article about a pattern I've watched bite developers over and over: trusting that the "open format" sticker on a tool means your data is portable. Below is how to set up a notes system that's actually portable — and how to verify it stays that way.

The root cause: proprietary syntax inside open file extensions

The trick almost every note-taking app pulls is this:

Files are saved as .md. Marketing says "your notes are just markdown."
But the content uses app-specific extensions: custom block IDs, embeds, callouts, query languages, plugin metadata.
Open the file in a plain editor and you'll see roughly 60% standard markdown and 40% syntax that looks like markdown but isn't.

Standard CommonMark and GitHub Flavored Markdown are well-defined specs. Anything outside those is, technically, just text the app happens to render specially.

When you try to migrate, the new tool reads the file fine — and silently drops everything that isn't standard markdown. Links break. Embeds disappear. Math blocks lose half their content. The migration looks successful right up until you actually try to use the imported notes.

Step 1: Set boundaries with a vault structure

The fix is to treat your notes like a small codebase. Plain markdown, folders for organization, git for history. Here's the layout I've used across three migrations now:

notes/
├── .git/
├── .gitignore
├── README.md              # entry point — what's here, how it's organized
├── inbox/                 # quick captures, unprocessed
├── daily/                 # YYYY-MM-DD.md
├── projects/
│   ├── project-a.md
│   └── project-b.md
├── topics/                # long-lived reference notes
│   ├── postgres.md
│   └── linux-networking.md
└── attachments/           # images, PDFs — referenced by relative path

Three rules I follow strictly:

Links are relative file paths, not app-specific wikilinks. [postgres notes](../topics/postgres.md) works everywhere — on GitHub, in VS Code, on the filesystem.
Attachments live alongside the notes that reference them. ![diagram](./attachments/2026-02-pipeline.png).
No plugin-specific frontmatter. If a field isn't useful when grep'd as plain text, don't add it.

Step 2: Replace "features" with Unix tools

Most app features developers actually need — search, backlinks, tag listings — can be replaced with command-line tools you already have.

For full-text search, ripgrep is faster than any in-app search I've used:

# Search all notes for a phrase, with 2 lines of context
rg -i "connection pool" -C 2 notes/

# Find every note tagged #postgres (tags as inline #hashtags)
rg -l "#postgres\b" notes/

# Find broken relative links: files referenced that don't exist on disk
rg -oP '\]\(\.\/[^)]+\)' notes/ | while IFS=: read -r src link; do
  target=$(dirname "$src")/$(echo "$link" | sed 's/^](\.\///; s/)$//')
  [ -f "$target" ] || echo "BROKEN: $src -> $link"
done

For backlinks — which note mentions which — a one-liner does the job:

# Find every note that links to topics/postgres.md
rg -l "topics/postgres\.md" notes/

Less ergonomic than a sidebar panel in a GUI? Sure. But it works on every machine I'll ever own, in every editor, forever. That's the tradeoff.

Step 3: Version control as the safety net

This is the step most "just use markdown" guides skip, and it's the one that actually makes the system durable. Initialize the directory as a git repo and commit anything that survives more than a day in the inbox.

cd notes/
git init
git add .
git commit -m "initial vault"

# A tiny pre-commit hook that rejects accidental app-specific syntax
cat > .git/hooks/pre-commit <<'HOOK'
#!/usr/bin/env bash
# Block wikilink-style references — they don't render outside specific apps
if git diff --cached --name-only -z | xargs -0 grep -lE '\[\[[^]]+\]\]' 2>/dev/null; then
  echo "Found wikilink syntax. Use relative paths instead." >&2
  exit 1
fi
HOOK
chmod +x .git/hooks/pre-commit

The hook is the boring-but-critical piece. Without it, you'll absentmindedly type [[some note]] once a week and slowly recreate the lock-in problem inside your supposedly portable system. Found that out the hard way last year.

Step 4: A sync script you actually understand

If you want notes on multiple devices, resist the urge to bolt on a sync service. A git remote is enough for 99% of single-user workflows:

# sync.sh — call from cron or a keybinding
set -euo pipefail
cd "$HOME/notes"

git add -A
# Skip empty commits when nothing has changed since last sync
if ! git diff --cached --quiet; then
  git commit -m "sync $(date -u +%FT%TZ)"
fi
git pull --rebase --autostash
git push

I've run this exact script across a laptop, a desktop, and a server for about 18 months. Total merge conflicts: maybe a dozen, all resolved in under a minute because the files are plain text.

Prevention: how to audit a tool before you commit

Before adopting any new note-taking tool, run this checklist. Took me three migrations to learn it:

Create a test note that uses every feature you care about (links, tags, attachments, embeds, code blocks).
Open the raw file in cat. Does it contain only standard markdown? If you see custom block syntax, that's your future lock-in.
Move that file out of the tool's directory. Open it in a different markdown viewer. Does it still render correctly, with working links?
Delete the tool entirely. Are your files still useful as plain text in a git repo?

If any answer is "no" or "kind of", you're not adopting a markdown editor — you're adopting a database that happens to use .md as a file extension.

When you actually need a GUI

To be fair: a folder of markdown plus ripgrep won't replace every workflow. For graph views, daily review templates, or kanban boards on top of notes, you'll want some kind of editor or viewer. The fix isn't to avoid GUIs — it's to pick ones that read a directory of plain files instead of owning a vault. If the tool insists on importing your files into its own format, walk away. If it sits on top of the directory and treats your files as the source of truth, you can swap it out next year without losing a thing.

That single distinction — does the tool own your files, or just read them — is the whole game.

How to boot mainline Debian on a vendor-locked ARM tablet

Alan West — Mon, 18 May 2026 23:26:43 +0000

The problem: a $80 tablet running a kernel from 2018

Picked up a cheap Rockchip-based Android tablet last month — RK3562 SoC, 4GB RAM, 64GB eMMC, under a hundred bucks. On paper it's perfect for a kiosk, a tiny build agent, or just an ARM dev box on my desk. In practice? It ships with an Android fork running a vendor kernel that's frozen in time. No root, no developer mode, no terminal, and no obvious way to install anything that didn't come from the manufacturer's app store.

I wanted a Debian shell. Not Termux pretending to be Debian, not a chroot trick, not a VM. Actual Debian, owning the hardware.

This is a problem you hit constantly with cheap ARM gear: vendor BSPs are a graveyard. Old kernels, no upstream changes, a single security patch on launch day and then silence. If you want a usable Linux machine out of one, you have to bring it yourself.

Here's how I worked through it, what broke, and what to check before you start.

Root cause: the vendor BSP trap

Most ARM SoCs ship with a Board Support Package — a vendor-maintained kernel fork plus a custom bootloader, device trees, and binary blobs for things like GPU, video decode, and Wi-Fi. The vendor uses it to ship a product, then walks away.

The trap has three layers:

Bootloader: the board runs a vendor U-Boot or proprietary loader that expects a specific boot image format, partition layout, and sometimes signed payloads.
Device tree: the hardware description (.dts/.dtb) is custom per board. Mainline ships device trees for some reference boards, but the specific touchscreen controller, PMIC, and panel on your tablet are almost certainly not there.
Drivers: GPU (Mali), VPU, Wi-Fi, and audio frequently rely on out-of-tree drivers or firmware blobs.

So "install Debian" is really four problems stacked: get code to run at boot, get the kernel to recognize the hardware, get userspace to talk to it, and do all of this without bricking a device whose recovery path you don't fully understand yet.

Step 1: find a recovery path before you break anything

Rule one of ARM hacking: know how to unbrick before you brick.

Most Rockchip SoCs have a maskrom mode — a hardware-level recovery state where the CPU listens on USB for a loader image, totally independent of whatever's on eMMC. Even if you nuke the bootloader, you can usually recover with rkdeveloptool:

# Confirm the device shows up in maskrom mode
sudo rkdeveloptool ld
# Expected: DevNo=1 Vid=0x2207,Pid=0x350a LocationID=... Maskrom

# Push a working loader into RAM (not flash)
sudo rkdeveloptool db rk356x_loader_vX.XX.bin

The exact PID and loader filename depend on the SoC family. Rockchip publishes prebuilt loader blobs in the rkbin tree; verify the binary matches your SoC before flashing anything persistent.

If your device doesn't have a documented maskrom button combo or test pad, stop here. Recovery without it usually means short-pinning a flash chip on the PCB, and that's a different blog post.

Step 2: build U-Boot for the SoC, not the board

Mainline U-Boot has reasonable Rockchip support, but it expects you to pick a board config. For an SoC where there's no upstream board file for your exact tablet, the pragmatic path is to start from the closest reference design and override the device tree later.

git clone https://source.denx.de/u-boot/u-boot.git
cd u-boot
# Use a nearby supported board as the base config
make rk3568-evb_defconfig
# Cross-compile with an aarch64 toolchain
make CROSS_COMPILE=aarch64-linux-gnu- \
     BL31=bl31.elf u-boot-rockchip.bin

BL31 is ARM Trusted Firmware — the secure-world runtime U-Boot hands control to. You can build ATF yourself from the TF-A project or pull a prebuilt blob from rkbin. Building from source is the right long-term answer; pulling prebuilt is the right answer when you're still bisecting which combination boots at all.

Step 3: boot from SD card first, never eMMC

This is the single biggest mistake I see people make: they flash an experimental image straight to internal storage on the first try. Don't.

Rockchip's boot ROM checks SD card before eMMC by default. So you can iterate on a boot image entirely from an SD card while the original Android partition on eMMC stays untouched. If the image is broken, pull the SD card — the tablet boots Android like nothing happened.

# Drop U-Boot at the Rockchip-expected offset
sudo dd if=u-boot-rockchip.bin of=/dev/sdX seek=64 conv=notrunc
# Partition the rest of the card normally
sudo parted /dev/sdX mklabel gpt
sudo parted /dev/sdX mkpart boot fat32 16MiB 256MiB
sudo parted /dev/sdX mkpart root ext4 256MiB 100%

Then drop a Debian arm64 rootfs onto the root partition with debootstrap:

sudo debootstrap --arch=arm64 --foreign bookworm /mnt/root \
    http://deb.debian.org/debian
# Finish stage 2 inside a qemu-user chroot
sudo cp /usr/bin/qemu-aarch64-static /mnt/root/usr/bin/
sudo chroot /mnt/root /debootstrap/debootstrap --second-stage

The two-stage debootstrap works because qemu-user-static transparently executes aarch64 binaries on your x86 host. Don't forget to register binfmt handlers (binfmt-support package on Debian).

Step 4: device tree is where you'll lose a weekend

The kernel will boot, panic on PMIC init, and reboot. That's normal. You're missing a working DTB.

What I do:

Dump the Android partition's DTB blob and decompile it with dtc -I dtb -O dts to get a starting point.
Diff it against the mainline DTS for the closest reference SoC.
Strip out anything vendor-specific (Android boot partitions, proprietary properties).
Iterate.

Expect the touchscreen, Wi-Fi, and internal sensors to not work on first boot. Serial console and USB will. Get a USB-to-serial adapter on the debug UART pads — without one, you're flying blind.

Prevention: what to check before you buy

If you're shopping for cheap ARM hardware specifically to run mainline Linux, vet it first:

Search the SoC plus "mainline" or "u-boot defconfig": if the SoC has zero upstream presence, walk away.
Look for an exposed UART: serial console access is non-negotiable for debugging.
Check for a maskrom button or documented test point: this is your unbrick path.
Prefer SoCs with an active community port (Pine64, Radxa, Orange Pi families) over no-name tablets — even if the silicon is the same, the upstream work is what saves you.

I haven't tested every Rockchip variant thoroughly, but the RK35xx family in general has a much healthier mainline story than the RK30xx-era parts ever did. Your mileage will vary by exact silicon revision and board.

The payoff is real though. An $80 chunk of hardware running clean Debian, on a current kernel, that you actually control — that's worth the weekend.

How to fix the 'AI-generated' look in your frontend

Alan West — Mon, 18 May 2026 23:04:12 +0000

The problem: every AI site looks like the same AI site

I did a small experiment last month. I asked three different code-gen tools to build me a landing page for a fake SaaS product. Different prompts, different sessions, different models. The output? Practically identical.

Purple-to-blue gradient hero. Three feature cards in a row with rounded corners and lucide icons. A pricing section with the middle plan slightly elevated. A FAQ accordion at the bottom. CTA button with bg-indigo-600 hover:bg-indigo-700.

If you've shipped anything with an LLM lately, you've seen it. There's a specific visual fingerprint to AI-generated frontends, and once you can spot it, you can't unsee it. The frustrating part is when a client or a non-technical stakeholder looks at your work and says "this looks like ChatGPT made it" — even when half of it didn't.

Let's debug why this happens and walk through fixes that actually move the needle.

Root cause: the model is averaging over its training data

LLMs that generate UI code aren't choosing aesthetics. They're predicting the most likely next token given billions of public code samples. Public code samples are overwhelmingly tutorials, starter templates, and component libraries — which all tend to use the same defaults.

There are three specific failure modes I keep seeing:

1. The default Tailwind palette

The Tailwind default config uses a specific set of named colors (slate, indigo, emerald, etc.) that are mathematically pleasant but instantly recognizable. When a model can't decide on a color, it reaches for indigo-600 or slate-900 because those tokens appear in roughly a billion tutorials.

2. The component-library layout vocabulary

Hero → features grid → social proof → pricing → FAQ → footer. This isn't because that's the right layout for a landing page. It's because it's the layout used in every shadcn/ui example, every Tailwind UI screenshot, every Vercel template. Models pattern-match on structure.

3. The "safe" typography pairing

Inter for everything, with the occasional font-bold for headings. Default line-height. Default tracking. The result is technically readable and entirely forgettable.

The fix, part 1: tear out the default palette

First step is replacing your Tailwind theme with something that doesn't ship by default. Don't just rename indigo to primary — actually pick colors that aren't in the default scale.

// tailwind.config.js
import { defineConfig } from 'tailwindcss'

export default {
  theme: {
    // 'extend' keeps defaults; replacing 'colors' wipes them entirely
    colors: {
      transparent: 'transparent',
      current: 'currentColor',
      // custom palette built from a base hue, not 'indigo'
      ink: {
        50:  '#f6f5f1',
        500: '#3d3a32',
        900: '#1a1814',
      },
      ember: {
        400: '#e8775a', // warm accent, not the usual cool blue
        600: '#c45530',
      },
    },
    fontFamily: {
      // pair a serif display with a mono body for an unusual feel
      display: ['"Fraunces"', 'serif'],
      sans: ['"IBM Plex Sans"', 'sans-serif'],
    },
  },
}

Notice I dropped colors instead of extending it. That kills bg-indigo-600 entirely — if the model (or a junior dev) tries to use it, the build fails. Forcing the failure is the point. It pushes everyone toward the custom palette.

The fix, part 2: break the layout grammar

AI-generated layouts are almost always vertically stacked, full-width sections with centered content. You can break this pattern with very little code by using CSS Grid for asymmetric layouts.

/* asymmetric hero — content offset to the left, art bleeds right */
.hero {
  display: grid;
  grid-template-columns: minmax(2rem, 1fr) minmax(0, 38rem) minmax(0, 1fr);
  align-items: end;
  min-height: 80vh;
}

.hero__content {
  /* sit in the second column, not centered across the page */
  grid-column: 2;
  padding-block: 4rem;
}

.hero__art {
  /* let the visual element extend past the content column */
  grid-column: 2 / -1;
  align-self: stretch;
}

This is a five-minute change that immediately signals "a human chose this." Centered hero + three cards is the visual equivalent of beige carpet. Off-center compositions, overlapping elements, and content that breaks the grid all read as intentional design choices.

The fix, part 3: kill the rounded-2xl reflex

Every AI-generated component has rounded-2xl shadow-lg p-6 somewhere. Override your component defaults at the source.

// components/Card.jsx
export function Card({ children, variant = 'default' }) {
  // pick ONE radius vocabulary for the whole site, not per-component
  const variants = {
    default: 'border border-ink-500/20 bg-ink-50',
    inset:   'border-l-2 border-ember-600 bg-transparent pl-6',
    flat:    'bg-ink-50',
  }

  return (
    <article className={`${variants[variant]} p-5`}>
      {children}
    </article>
  )
}

No border radius. No drop shadow. Borders and color contrast do the work instead. This won't fit every brand, but the point is to pick a vocabulary and stick to it rather than letting each component drift toward generic-AI-card defaults.

The fix, part 4: replace placeholder copy before showing anyone

This one isn't visual, but it triggers the same uncanny-valley response. "Empower your team to unlock productivity" and "Built for modern teams" are the textual equivalent of the purple gradient. If you ship a draft with that copy, even non-technical people pick up on it — they can't articulate why, but they know.

I keep a checklist on my second monitor before any client review:

No sentence that starts with "Empower", "Unlock", or "Transform"
No feature card titled with two abstract nouns ("Seamless Integration")
At least one specific, concrete claim with a number
At least one sentence that sounds like a real person wrote it

Prevention: catch it in code review

The cheapest fix is a linter rule that fails the build when forbidden class patterns show up. Tailwind's safelist and a custom ESLint rule can enforce this:

// eslint custom rule, simplified
module.exports = {
  create(context) {
    const banned = [
      /bg-(indigo|violet|purple)-600/,
      /rounded-(2xl|3xl)/,
      /from-purple-\d+ to-(blue|pink)-\d+/, // the gradient
    ]
    return {
      Literal(node) {
        if (typeof node.value !== 'string') return
        for (const pattern of banned) {
          if (pattern.test(node.value)) {
            context.report({
              node,
              message: `Banned default-AI class: ${node.value}`,
            })
          }
        }
      },
    }
  },
}

Is this petty? A little. But I'd rather have CI yell at me than ship something a client describes as "that AI look." After putting this rule in place on two projects, the diffs got noticeably more interesting — people started reaching for the custom tokens instead of the defaults, because the defaults didn't compile.

The takeaway

The "AI look" isn't really about AI. It's about defaults. LLMs amplify defaults because their training data is mostly default-using code. The fix isn't to stop using AI assistance — it's to remove the defaults from your toolchain so neither the model nor your team can fall back on them.

Replace the palette. Break the layout grammar. Pick a component vocabulary and enforce it. And read the copy out loud before you ship.

Why MTP doesn't speed up your llama.cpp inference (and how to actually fix it)

Alan West — Mon, 18 May 2026 19:33:41 +0000

Last week, I spent two days banging my head against a wall. I had just spun up a fresh llama.cpp build with multi-token prediction (MTP) support, loaded a quantized Qwen3 model, and ran my benchmark suite expecting that sweet 2-3x speedup everyone keeps talking about.

The result? Roughly the same tokens per second. Sometimes slower. After a lot of profiling, I figured out what was happening — and it turns out the issue is more common than the celebratory benchmark posts suggest.

This post is for anyone who's enabled MTP, expected a speedup, and got nothing.

What MTP actually does (the short version)

Multi-token prediction is a form of speculative decoding baked into the model itself. Instead of running a separate, smaller draft model to guess the next few tokens, the main model emits multiple candidate tokens per forward pass. The verifier (usually the same model with a slightly different head) accepts or rejects them in one shot.

The theory is simple. If acceptance rate is high, you get 2-3 tokens per forward pass instead of one, with roughly the same latency per pass. In practice, MTP can make things worse if any of three things go wrong.

The three reasons MTP fails to speed things up

Here are the actual root causes I hit, in order of frequency:

1. Low acceptance rate

This is the big one. MTP only helps if predictions are accepted. If your acceptance rate is below ~60%, you're paying the extra compute cost of generating drafts without getting tokens back. Wall-clock time goes up.

I see this most often when:

The prompt is unusual (specific code style, niche domain)
Temperature is too high (anything above ~0.7 starts hurting)
The model was quantized aggressively and the MTP head suffered more than the main weights

2. KV cache thrashing

When you generate multiple candidates per step, you churn the KV cache more aggressively. On consumer GPUs with limited VRAM, this can spill into slower memory or cause re-allocation. The forward pass speedup gets eaten by memory stalls.

3. CUDA graph capture failures

This one bit me hard. llama.cpp tries to capture CUDA graphs for the inference loop. If MTP introduces dynamic shapes (variable number of accepted tokens per step), the graph gets re-captured every step. You lose the performance win of graphs entirely, and the per-step overhead actually goes up.

Step-by-step: diagnosing your setup

Here's the order I work through now whenever MTP doesn't seem to help.

Step 1: Measure the actual acceptance rate

llama.cpp surfaces speculation metrics with verbose logging. Build with CUDA support and run with -v:

# Build llama.cpp with CUDA support
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

# Run with verbose stats so we can see acceptance numbers
./build/bin/llama-cli \
  -m models/qwen3-quantized.gguf \
  -p "Write a Python function for binary search" \
  --n-predict 256 \
  -ngl 99 \
  -v 2>&1 | tee run.log

Then grep the log for the speculation stats. You're looking for an n_accept ratio. Below 0.6 means MTP is actively hurting throughput on your workload.

Step 2: Check VRAM headroom

If acceptance is fine but throughput is still bad, you're probably memory-bound. Watch VRAM usage during inference in a separate terminal:

# Poll memory and GPU utilization once per second
nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu \
  --format=csv -l 1

If you're sitting at >95% VRAM utilization while running, MTP's extra KV cache pressure is pushing you over the edge. The fix is usually to reduce context length, drop to a more aggressive quant (Q4_K_M instead of Q5_K_M), or shorten the draft window.

Step 3: Disable CUDA graphs as a control

To check whether graph re-capture is killing you, force graphs off and re-run:

# Disable CUDA graphs to test if they're being re-captured each step
GGML_CUDA_DISABLE_GRAPHS=1 ./build/bin/llama-cli \
  -m models/qwen3-quantized.gguf \
  -p "Write a Python function for binary search" \
  --n-predict 256 \
  -ngl 99

If throughput is roughly the same with graphs disabled, capture isn't your problem. If throughput goes up with this flag set, that's the smoking gun — graphs were being re-captured every step under MTP and the overhead was worse than not using them at all.

The actual fix

Once you've identified which of the three issues you're hitting, the fix is usually simple:

Low acceptance — shorten the draft window. Most MTP implementations let you set a draft length of 1-4 tokens. Dropping from 4 to 2 often pushes acceptance above 70% because the model has to commit to fewer guesses in a row.
VRAM pressure — reduce context length or quantize more aggressively. KV cache size scales linearly with context, so cutting --ctx-size in half buys you real headroom.
Graph capture churn — pull the latest llama.cpp. The speculation code path changes frequently and padded graph capture has improved a lot recently.

Here's the config that finally worked for me on a quantized Qwen3 model with around 24 GB of VRAM available:

# Final working config — moderate draft length, conservative context
./build/bin/llama-cli \
  -m models/qwen3-quantized.gguf \
  -p "$PROMPT" \
  --n-predict 512 \
  --ctx-size 8192 \
  --draft-max 2 \
  --draft-min 1 \
  -ngl 99

That gave me roughly 1.7x throughput over the no-MTP baseline on my workload. Not the magical 3x some posts claim, but a real, repeatable win that I could ship.

Prevention tips

A few things I now do by default whenever I touch MTP:

Always benchmark with and without MTP. Don't trust that it's helping just because it's enabled. Run both, measure both, save the numbers.
Pin your llama.cpp version. The MTP code path changes frequently. A config that works today can regress between commits.
Match quantization to the head carefully. Some MTP heads are sensitive to aggressive quantization. If acceptance rate suddenly tanks after a re-quant, that's usually why.
Log acceptance rate as a metric, not just throughput. Throughput tells you the symptom; acceptance rate tells you the cause. When you can see both side by side, regressions become obvious.

The honest takeaway is that MTP is a real win when the conditions line up, but it isn't free. If you've enabled it and gotten nothing, you're not doing it wrong — you've just hit one of the failure modes nobody talks about in the benchmark threads. Walk the three steps above and you'll usually find the culprit within an hour.

AI Won't Speed Up Your Processes (And That's OK)

Alan West — Mon, 18 May 2026 19:29:25 +0000

The dirty secret of AI productivity claims

Saw a post on HN this week (Frederick Van Brabant's piece) arguing that AI won't make your processes go faster, and honestly... yeah. After two years of integrating Copilot, Cursor, and Claude into my daily flow across four different teams, I've landed in roughly the same place. AI makes tasks faster. Processes? Not so much.

The distinction matters more than it sounds.

Tasks vs. processes

A task is the thing you do at your keyboard. Writing a function. Generating boilerplate. Drafting a gnarly regex. AI is genuinely excellent at these — I'd estimate it shaves 30-40% off my pure typing time when I'm in the zone.

A process is everything around the task. The Jira ticket sitting in "Ready for Review" for three days. The deploy that requires four approvals. The standup where you find out the requirements changed. The QA cycle. The customer who needs to validate the change before you can close anything.

Look at where your week actually goes:

# Rough breakdown of a typical product dev week (40 hours)
Writing code             ~8h   (20%)
Reviewing PRs            ~6h   (15%)
Meetings / standups      ~8h   (20%)
Waiting (CI, reviews)    ~6h   (15%)
Debugging existing bugs  ~5h   (12.5%)
Planning / refinement    ~4h   (10%)
Context switching tax    ~3h   (7.5%)

If "writing code" is 20% of your week, even doubling its speed saves you about 10% total. Amdahl's Law from college shows up uninvited and ruins the pitch deck.

What I've actually measured

I migrated three projects to a heavier AI-assisted workflow this year and tracked cycle time (first commit to production). Two of them got slower in the first month. Why?

More PRs were getting opened (because writing them was easy)
Reviewers became the new bottleneck
A handful of AI-generated pieces had subtle bugs that ate days

By month three things normalized. Cycle time came back to baseline — not better. The team felt more productive (which is a real benefit, don't dismiss it) but the calendar didn't show it.

The review tax nobody talks about

Here's what nobody warns you about: AI shifts work from writing to reviewing. And reviewing is harder than writing.

# Looks fine at a glance, right?
def apply_discount(price, code):
    discounts = fetch_discount_table()
    multiplier = discounts.get(code, 1)  # default = no discount
    return price * multiplier

# Two problems hiding here:
# 1. fetch_discount_table() is called on every invocation — no caching
# 2. If `code` is None (very common from a form), .get(None, 1) silently returns 1
#    instead of raising. Bug that ships happily to prod.

When you write a function, you build a mental model as you go. When you review one, you reconstruct that model from the outside. With AI-generated code, you can't skip the careful review — sometimes it calls a method that doesn't exist, uses an outdated API pattern, or quietly swallows an error.

I tell junior devs on my team: treat every AI suggestion like a Stack Overflow answer from 2017. Often useful, never trusted blindly.

Where AI does actually compress the process

I don't want to be a total cynic — there are spots where AI shortens the process itself, not just the typing:

Stack trace → likely cause: pasting an error and getting a focused minimal repro is faster than the back-and-forth on Slack
Cross-language fluency: touching a service in a language you don't write daily, the ramp-up is real
First-draft docs and ADRs: editing is faster than blank-page writing
Test scaffolding: generating the obvious cases so you can focus on the weird ones

What these have in common: they replace a waiting step, not a typing step.

How to actually measure your process

Stop trusting vibes. Track the numbers.

Questions worth answering for your team:

What's your median cycle time (PR opened → merged → deployed)?
What's the median age of an open PR right now?
How many PRs are open per dev on your team?
How often does a PR need a second round of review changes?

For process metrics there's GitHub Insights, LinearB, and Swarmia. For product-side metrics on what users actually do with the features you ship, privacy-focused options like Umami or Plausible give you full data ownership without the GA bloat. The point isn't the specific tool — it's that you need some number that should move if AI is genuinely helping your pipeline.

If your AI rollout is real, at least one of these numbers should move. If none of them move, you didn't speed up your process. You just made some tasks feel snappier.

What actually moves the needle

The teams I've seen genuinely ship faster aren't the ones with the fanciest AI setups. They're the ones who fixed the boring stuff:

# A boring CI config that saves more time than any AI tool I've used
name: ship-it
on:
  pull_request:
    branches: [main]
jobs:
  test:
    runs-on: ubuntu-latest
    timeout-minutes: 8     # fail fast — no 45 min stuck builds
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'      # the cache line that saves ~2 min per run
      - run: npm ci
      - run: npm test -- --shard=${{ matrix.shard }}/4
    strategy:
      matrix:
        shard: [1, 2, 3, 4] # parallelize across 4 runners

Beyond CI, the cultural moves matter more:

Set review WIP limits (max 2 open PRs per reviewer)
Kill approval theater (one human approval, not three)
Automate deploys (no manual gates outside of regulated environments)
Write ADRs so decisions don't get re-litigated every sprint
Trunk-based development, feature flags for the scary stuff

AI helps these teams more, because the process around the AI-generated code can actually keep up. AI hurts a slow team because it dumps more code into an already-clogged review pipe.

The honest version

I love using these tools. I'd fight someone to keep Cursor in my workflow, and I haven't tested every model thoroughly but the recent ones are clearly a step up. But when someone tells me their AI rollout is going to make the team "2x more productive," I ask what number they're going to measure. If they can't name one, I know exactly what's going to happen in six months.

The AI is faster. The process isn't. Until you fix the process, the AI is just helping you generate code that sits in a review queue with all the other code.

Debugging DNS leaks: why your VPN isn't hiding what you think it is

Alan West — Mon, 18 May 2026 01:21:15 +0000

Last month I was setting up a hardened dev environment for a client doing security research. They wanted all traffic from their workstation tunneled through a VPN, no exceptions. Simple, right? Install WireGuard, flip the toggle, done.

Then I ran a leak test and watched their real ISP-assigned DNS server pop up on the report. The traffic was tunneled. The DNS queries weren't. We'd been working under a false sense of privacy for a week.

This is one of those bugs that doesn't crash anything, doesn't throw an error, and silently undermines the entire reason you set up the VPN in the first place. Let's walk through what's actually happening and how to fix it for good.

The frustrating problem

You've done everything right. You're connected to a VPN. curl ifconfig.me returns the VPN's exit IP. Your routing table looks clean. And yet, when you visit a DNS leak test site, your ISP's resolver shows up in the results.

Worse: in some cases your VPN tunnel is fine for HTTP and HTTPS, but DNS is going out of band. Every domain you visit is still visible to your ISP, your coffee shop's network, or whoever else is between you and the resolver you didn't mean to use.

If you're running this setup on a fleet of dev boxes or CI runners that talk to internal services, the consequences get worse. Internal hostnames can leak to public resolvers. Hostnames are often as sensitive as the queries themselves.

Root cause: DNS is not part of your VPN tunnel by default

Here's the thing most VPN tutorials gloss over. A VPN tunnel routes IP packets. DNS resolution happens at the OS level, often before the packet routing decision, using whatever resolver was configured by your DHCP lease, your /etc/resolv.conf, or your systemd-resolved stub.

There are usually three culprits:

systemd-resolved keeps per-link DNS configurations and may continue using the original interface's DNS even when traffic is routed elsewhere.
Browsers with DNS-over-HTTPS (Firefox, Chrome) bypass the OS resolver entirely and talk directly to a hardcoded DoH endpoint over HTTPS — which is tunneled through the VPN, but goes to a third party you may not trust.
Applications using their own resolvers — Go binaries with GODEBUG=netdns=go, some container runtimes, and language-specific resolver libraries can ignore system settings.

The VPN sees the encrypted DoH request and dutifully tunnels it. The OS resolver sends its plaintext UDP/53 query out the wrong interface. Both paths can coexist on the same machine, which is what makes this so confusing to debug.

Step 1: Confirm the leak

Before fixing anything, prove it's actually leaking. The cheapest reliable test is tcpdump on the physical interface (not the VPN interface) while you trigger a lookup.

# In one terminal, watch DNS on your physical NIC
sudo tcpdump -i wlan0 -n 'udp port 53 or tcp port 53'

# In another terminal, trigger a fresh lookup
# Use a unique domain so cached answers don't hide the issue
dig $(uuidgen | tr A-Z a-z).example.com

If anything shows up on the first terminal, you're leaking. If the only DNS traffic appears on your VPN interface (wg0, tun0, etc.), you're clean.

You can also check what resolver your system thinks it's using:

# systemd-resolved status, per-interface
resolvectl status

# Classic view
cat /etc/resolv.conf

# What's actually being asked, in real time
sudo resolvectl monitor

The monitor subcommand is underrated — it shows every query the stub resolver processes, including which interface it was sent over.

Step 2: Force DNS through the tunnel

The fix depends on your VPN client, but the principle is the same: every DNS query must travel inside the encrypted tunnel and hit a resolver on the other side.

For a WireGuard config, this is one line:

[Interface]
PrivateKey = <your-private-key>
Address = 10.0.0.2/24
# Use a resolver that lives on the VPN side
DNS = 10.0.0.1

[Peer]
PublicKey = <peer-public-key>
Endpoint = vpn.example.com:51820
# Route everything, including DNS
AllowedIPs = 0.0.0.0/0, ::/0

The DNS = line tells wg-quick to update /etc/resolv.conf (or talk to systemd-resolved) so queries go to a server reachable only through the tunnel. The AllowedIPs = 0.0.0.0/0 part ensures the packet to that resolver actually enters the tunnel — without it, your route table might still send the DNS query out the default gateway.

For OpenVPN, the equivalent push options usually come from the server side, but you can force them locally:

# In your client config
dhcp-option DNS 10.8.0.1
block-outside-dns       # Windows-only, blocks leaks aggressively
script-security 2
up /etc/openvpn/update-resolv-conf
down /etc/openvpn/update-resolv-conf

On macOS and Linux, that update-resolv-conf script is the one that actually modifies the system resolver. It's worth reading — it's a useful template for understanding how DNS gets injected at runtime.

Step 3: Tame the browsers and runtimes

This is the step most people skip. Even with a perfect VPN config, Firefox and Chrome can still bypass your OS resolver if DoH is enabled.

For Firefox, set this in about:config:

network.trr.mode = 5   // Off by user choice; do not use DoH

Mode 5 disables DoH entirely. If you want DoH but routed through your VPN's resolver, use mode 3 and set network.trr.uri to your tunnel-side endpoint. The Mozilla TRR docs explain the modes in detail.

For Go programs, force the system resolver:

// Force cgo-based resolution which respects /etc/resolv.conf changes
// done by the VPN client. The pure-Go resolver has caching that
// can outlast a VPN session change.
import _ "net"

// Or via environment
// GODEBUG=netdns=cgo+2

The +2 gives you debug output showing which resolver path was actually taken — invaluable when you're not sure if your fix landed.

Step 4: Block the leak path entirely

Belt and suspenders. Add firewall rules that drop any DNS traffic not going through the tunnel. This way, if a misconfigured app tries to bypass, it fails loudly instead of leaking silently.

# nftables: block UDP/53 and TCP/53 on the physical interface
sudo nft add table inet vpn_guard
sudo nft add chain inet vpn_guard output { type filter hook output priority 0 \; }
sudo nft add rule inet vpn_guard output oifname wlan0 udp dport 53 drop
sudo nft add rule inet vpn_guard output oifname wlan0 tcp dport 53 drop

If an app tries to leak, it gets a connection refused instead of a successful query to your ISP. That's a much better failure mode — you'll notice it immediately.

Prevention tips for future projects

Test the leak path every time you change network config. Don't trust that the previous setup still works after a kernel update or VPN client upgrade.
Prefer kill-switch behavior — drop all non-VPN traffic at the firewall when the tunnel is down. Most modern VPN clients support this; if yours doesn't, use nftables.
Standardize DNS at the tunnel exit. Run an unbound or dnsmasq instance on the VPN server so you control the resolver path end to end.
Audit application-layer resolvers. Browsers, container runtimes, and language standard libraries each have their own DNS quirks. Document them per project.
Run a periodic automated leak test. A daily cron job that runs dig against a unique subdomain and checks your authoritative server's logs for the source IP works well.

DNS leaks are the kind of bug that hides in plain sight. The fix isn't hard once you know where to look, but the surface area is bigger than most people realize. If you're going to put the work into setting up a VPN, spend the extra hour making sure your name resolution actually respects it.

Why your local LLM aces benchmarks but fails real terminal tasks

Alan West — Sun, 17 May 2026 21:00:11 +0000

Last month I spent an entire weekend frustrated by the same pattern. I'd download a shiny new open-weight model, see it crush MMLU and HumanEval, then watch it faceplant the second I handed it a multi-step shell task. "Find the largest log file in /var/log, grep for OOM errors, and write a summary." The model would confidently invent flags that don't exist, forget what it ran two steps ago, or get stuck in a loop running ls forever.

If you've tried running local models as terminal agents, you know the feeling. The score on the leaderboard says one thing; your actual workflow says another. With agentic benchmarks like Terminal-Bench 2.0 getting more attention (and newer MoE models like the Qwen3.6 family reportedly landing on the public board), it's worth understanding why this gap exists and what you can do about it.

The root cause: static benchmarks aren't agentic benchmarks

Most of the scores you see on Hugging Face leaderboards measure single-turn reasoning. The model gets a prompt, produces an answer, done. That tells you almost nothing about how the same model behaves when it has to:

Decide which tool to call
Parse messy stdout from a real shell
Remember state across 15+ turns
Recover when a command fails

This is the gap that benchmarks like Terminal-Bench try to close. They put the model in an actual sandbox, give it a real task, and grade it on whether the task got done — not whether the intermediate reasoning looked plausible.

The problem is that until you run an agentic eval yourself, you have no way to know if the model you're betting your stack on actually works for your use case.

Setting up a local agentic eval harness

Here's the approach I've been using to sanity-check models before committing to one. The core idea: simulate the same loop your production agent would run, but against a fixed task set you control.

First, a minimal tool-call loop. I'll use the transformers library since it works with most open-weight models out of the box.

from transformers import AutoModelForCausalLM, AutoTokenizer
import subprocess, json

MODEL_ID = "your-model-here"  # swap in whatever you're testing
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype="auto",  # let HF pick bf16/fp16 based on hardware
)

def run_shell(cmd: str, timeout: int = 10) -> str:
    # Always use a sandbox in real evals — this is illustrative
    result = subprocess.run(
        cmd, shell=True, capture_output=True, text=True, timeout=timeout
    )
    return result.stdout + result.stderr

Next, the agent loop itself. The thing that surprised me when I first wrote this: most failures don't happen in the model. They happen at the boundary — bad parsing, dropped context, no recovery path.

def agent_step(history, max_new_tokens=512):
    # Apply the model's chat template — this matters a lot for instruct models
    prompt = tokenizer.apply_chat_template(
        history, tokenize=False, add_generation_prompt=True
    )
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    out = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=False,  # deterministic for evals
    )
    # Slice off the prompt tokens so we only decode the new output
    new_tokens = out[0][inputs.input_ids.shape[1]:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True)

def run_task(task: str, max_turns: int = 20):
    history = [
        {"role": "system", "content": "You are a shell agent. Reply with a single JSON object: {\"cmd\": \"...\"} or {\"done\": \"summary\"}."},
        {"role": "user", "content": task},
    ]
    for _ in range(max_turns):
        reply = agent_step(history)
        history.append({"role": "assistant", "content": reply})
        try:
            action = json.loads(reply)
        except json.JSONDecodeError:
            # Parsing failures are a HUGE source of false-negative scores
            history.append({"role": "user", "content": "Reply must be valid JSON."})
            continue
        if "done" in action:
            return action["done"]
        observation = run_shell(action["cmd"])
        history.append({"role": "user", "content": f"<output>\n{observation}\n</output>"})
    return None  # ran out of turns

That's the skeleton. The interesting part is the failure modes you'll see.

What actually goes wrong (and how to fix it)

After running this harness against half a dozen open-weight models on the same fixed task set, here's the pattern I keep hitting:

1. The model ignores your output format

The most common failure isn't a reasoning failure. It's that the model wraps its JSON in markdown fences, or adds a chatty preamble, or hallucinates a thoughts field your parser doesn't know about. The fix isn't more prompting — it's constrained decoding.

from transformers import LogitsProcessorList
# Use a library like `outlines` or `lm-format-enforcer`
# to force the model to emit valid JSON matching your schema
from outlines import models, generate

schema = '{"type": "object", "properties": {"cmd": {"type": "string"}}}'
# This guarantees parseable output — even from smaller models

This single change moved one 9B model I tested from ~30% task completion to ~55% on my local set. The model was capable; it just kept tripping the parser.

2. Context collapse around turn 8–10

Long shell sessions get noisy fast. A single ls -la /usr can dump thousands of tokens. By turn 10 the model has lost track of the original task.

The practical fix: truncate or summarize old observations aggressively. Keep the original task and the last 2–3 turns verbatim; collapse everything in between.

3. MoE models need different inference tuning

If you're testing newer mixture-of-experts releases (the "A3B" suffix in some recent Qwen releases reportedly indicates ~3B active parameters per token), the default transformers settings often leave performance on the table. For these, I've had much better latency with vllm:

pip install vllm
vllm serve your-model-here --tensor-parallel-size 2

Then point your harness at the OpenAI-compatible endpoint instead of running the model in-process. The throughput difference on multi-turn agent loops is noticeable — you're doing dozens of forward passes per task.

Prevention: bake the eval into your workflow

The meta-lesson from all this: don't trust leaderboards for your specific use case. They're a useful filter, but a 5-point gap on Terminal-Bench means almost nothing if the model fails on the specific commands your agent runs.

A few habits that have saved me time:

Keep a fixed task set of 20–30 representative jobs. Re-run them against every model you consider. Same prompts, same scoring, same sandbox.
Log every failed turn. Most regressions show up as parsing or format issues long before they show up as reasoning issues.
Test the inference stack, not just the weights. The same model on transformers vs vllm vs llama.cpp can score differently because of subtle tokenization or sampling defaults.
Check the official model card and benchmark source before quoting numbers. Leaderboard scores get updated; blog posts don't.

The gap between "this model benchmarks well" and "this model works in my agent" is real, and it's almost always closeable with better tooling around the model rather than a bigger model. Start with the harness, find your actual bottleneck, then decide what to swap.

Why prompt engineering fails for tone control — and how steering vectors fix it

Alan West — Sun, 17 May 2026 20:55:41 +0000

The problem: prompts are not a behavior dial

I spent two days last month trying to make a 7B chat model sound less robotic. System prompts. Few-shot examples. Explicit "do not use the word 'utilize'" instructions. The model kept doing exactly what I told it not to do, like a teenager who hears the opposite of every request.

If you've worked with open-weight models, you've felt this. Prompt engineering looks like a behavior dial but it's really more like shouting suggestions at a trained habit. The model has learned a tone through fine-tuning, and your runtime instructions are wrestling with that whole training corpus.

What I needed was a way to nudge the model's internal state directly. Turns out that's been possible for a while — it's called activation steering, or steering vectors — and the recent wave of efficient open-weight releases has made it tractable on a single GPU again, which is why I'm revisiting it.

Root cause: behavior lives in the residual stream, not the prompt

Here's the thing prompt engineering can't fix. When a transformer generates a token, the prompt is just one input to a much larger machinery: the residual stream, attention patterns, MLP outputs at each layer. Behavioral traits like "formal vs. casual," "refusal-prone vs. helpful," or "concise vs. verbose" show up as directions in that residual stream.

If a model has been post-trained into a certain tone, that tone is encoded as a stable direction the residual stream tends to walk toward. Your prompt nudges the inputs. The training-induced direction is doing the heavy lifting.

The fix is to identify that direction and add (or subtract) it directly to the hidden states during the forward pass.

The technique: contrast pairs and mean activations

The basic recipe — documented in the activation-engineering literature; Turner et al. is a reasonable starting point — looks like this:

Pick a behavior you want to steer (say, "formal" vs. "casual").
Build two small sets of contrasting prompts.
Run the model on both sets and capture the hidden state at a chosen layer.
Take the mean activation of each set and subtract — that's your steering vector.
Add a scaled version of that vector to the residual stream during generation.

Here's how that looks in PyTorch with a HuggingFace Transformers model:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "your-open-weight-model"
tok = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.bfloat16, device_map="auto"
)

# Pick a mid-to-late layer. Earlier = more abstract, later = more surface.
LAYER = 18
target = model.model.layers[LAYER]

captured = []

def grab_hidden(module, inp, out):
    # decoder layers return a tuple; out[0] is the residual stream tensor
    captured.append(out[0].detach().mean(dim=1))  # mean over sequence

handle = target.register_forward_hook(grab_hidden)

def collect(prompts):
    acts = []
    for p in prompts:
        captured.clear()
        ids = tok(p, return_tensors="pt").to(model.device)
        with torch.no_grad():
            model(**ids)
        acts.append(captured[0])
    return torch.cat(acts).mean(dim=0)

casual = ["hey, can you walk me through...", "yo what's up with...", "ok so basically..."]
formal = ["Please describe...", "Could you elaborate on...", "Kindly explain..."]

casual_mean = collect(casual)
formal_mean = collect(formal)

steering = casual_mean - formal_mean  # direction: formal -> casual
handle.remove()

A few non-obvious bits. The hook grabs out[0] because most HuggingFace decoder layers return a tuple. Averaging over the sequence dimension throws away position info but gives you a single direction per prompt — usually enough for tone-style traits. A dozen contrast pairs is often plenty.

Applying the vector during generation

Now re-hook the same layer, but this time add the steering vector to every forward pass:

SCALE = 4.0  # tune this. Too low = no effect. Too high = the model speaks in tongues.

def steer(module, inp, out):
    hidden = out[0]
    # broadcast across batch and sequence dims
    return (hidden + SCALE * steering.to(hidden.dtype),) + out[1:]

handle = target.register_forward_hook(steer)

prompt = "Explain how DNS resolution works."
ids = tok(prompt, return_tensors="pt").to(model.device)
output = model.generate(**ids, max_new_tokens=200, do_sample=False)
print(tok.decode(output[0], skip_special_tokens=True))

handle.remove()

The first time I ran this with SCALE=10, it produced fluent-sounding gibberish about "vibing with the resolver." Cranking it down to 3-4 gave me a noticeably more casual register without breaking syntax. That tuning step is unavoidable.

What surprised me

A few practical findings from running this across a handful of open-weight models:

Layer choice matters more than vector quality. Steering around 60-80% of the way through the network usually works best. Too early and the effect washes out; too late and you damage coherence.
Subtraction is as useful as addition. Want the model to refuse less? Build a contrast pair of refusal vs. compliance and subtract the refusal direction. Same math, opposite sign.
Effects compose, somewhat. You can stack two steering vectors at different layers. Don't expect linearity, but it doesn't immediately collapse the model either.
Small models are noisier. Sub-3B models have less clean directional structure. I haven't tested this exhaustively across architectures but the pattern is consistent on the ones I've touched.

A debugging detour: when steering looks like it's working but isn't

The most annoying failure mode I hit: the steered output sounded right on cherry-picked prompts but had quietly destroyed instruction-following on anything multi-turn. The model would happily chat in the right tone and ignore the actual question.

What helped was a simple before/after harness — run the same fifty prompts unsteered and steered, then eyeball the diffs. Tone shifts show up everywhere. Capability regressions show up as the model losing track of structure: forgetting JSON schemas, dropping list items, ignoring length constraints.

If you see that pattern, your scale is too high or your layer is too late.

Prevention tips: don't ship this without guardrails

Steering vectors are a power tool. A few things I'd insist on before putting one anywhere near production:

Evaluate on a held-out set. It's easy to overfit a steering vector to your contrast pairs and miss that it breaks long-form coherence.
Cap the scale. Treat scale as a safety parameter, not a hyperparameter. Hard-cap it in code.
Log the unsteered output too. During rollout, run both and diff them. You'll catch failure modes that pure eval won't.
Don't steer for capabilities you couldn't already coax out with prompting. If the model can't do the task at all, steering will produce confident nonsense, not a fix.

Prompt engineering isn't going anywhere — it's the cheapest tool you've got. But when you hit the wall where the model's training is fighting your instructions, it's worth reaching for the layer where that fight is actually happening.

Arxiv's Moderation Debate: Why Preprint Gatekeeping Is Hard

Alan West — Sun, 17 May 2026 16:57:07 +0000

I've been lurking on r/MachineLearning long enough to know that any thread mentioning Arxiv policy changes will spiral within the hour. The recent discussion about a proposed submission ban — reportedly a one-year restriction tied to certain categories of papers — is no exception. The thread title called the backlash "perplexing," and honestly, I get where the OP is coming from. But I also get why people are mad.

Let me walk through what I think is actually happening here, what the tradeoffs look like, and why this conversation matters even if you don't publish papers.

What's going on with Arxiv (as best I can tell)

I want to be upfront: I'm working from the Reddit discussion and the general public chatter, not an official Arxiv announcement I've personally read end-to-end. According to early reports, Arxiv has been tightening moderation in cs.LG and adjacent categories, and there's been talk of restrictions targeting low-effort or AI-generated submissions. If you want the authoritative version, Arxiv's moderation page is the place to start.

The specifics matter less than the pattern. Arxiv has been getting flooded. The cs.LG category alone gets a staggering volume of submissions now, and a non-trivial chunk of that is — let's be polite — not great.

Why some pushback feels reasonable

There's a legitimate concern under the noise. Arxiv has historically been the great equalizer. A PhD student in Lagos and a researcher at DeepMind upload to the same place, and the work stands on its own merits. Any policy that adds friction risks rebuilding the gatekeeping that preprint servers were meant to bypass.

The specific worries I keep seeing:

Endorsement requirements disadvantage researchers without established network connections
Category-specific bans could be applied unevenly
Appeal processes are notoriously opaque
The line between "low quality" and "unfashionable but legitimate" is fuzzy

If you've ever had a paper desk-rejected for reasons that felt arbitrary, you understand the visceral reaction.

Why the backlash is a bit perplexing

Here's the thing though — I sympathize with Arxiv's moderators. I ran a small open-source project for a couple of years, and the volume of low-effort contributions during the LLM boom was honestly demoralizing. Imagine that, but you're responsible for filtering scientific literature.

A few uncomfortable truths:

The signal-to-noise ratio in cs.LG has visibly degraded
Survey papers with no original contribution have become a genre unto themselves
LLM-generated "research" exists and is being submitted in volume
Moderators are volunteers and academics, not a content moderation army

If you're going to have a public scientific record, someone has to filter it. The alternative is that Arxiv becomes Medium, but for math.

A practical analogy from the dev world

This whole thing reminds me of when npm started cracking down on typosquatting and spam packages. Every time the registry tightened rules, there was an outcry about "gatekeeping the open ecosystem." Then, six months later, everyone quietly admitted the registry was better.

Here's a tiny snippet from a moderation pipeline I built for a community submissions tool last year:

# Simple heuristic-based pre-filter before human review
# Not perfect, but cuts the queue by ~60%

def triage_submission(submission: dict) -> str:
    score = 0

    # Length sanity check — too short usually means low effort
    if len(submission['body']) < 500:
        score -= 2

    # Repetition check — LLM slop often repeats phrases
    unique_ratio = len(set(submission['body'].split())) / max(len(submission['body'].split()), 1)
    if unique_ratio < 0.35:
        score -= 3

    # Citation density — academic-style content cites things
    if submission.get('citation_count', 0) == 0:
        score -= 1

    if score <= -3:
        return 'auto_reject'
    elif score <= -1:
        return 'manual_review'
    return 'fast_track'

This is crude. It's also better than nothing when you're drowning. Arxiv's moderators are doing a version of this, just with way higher stakes and way more pressure.

What this actually means for ML developers

If you're building ML systems and not writing papers, why should you care? Because Arxiv is part of your infrastructure whether you realize it or not. The model card you're skimming, the technique you're implementing, the benchmark you're citing — most of that flows through Arxiv.

Here's a quick utility I use to pull Arxiv metadata for tracking papers I want to reproduce:

import arxiv  # pip install arxiv

def fetch_paper_metadata(arxiv_id: str) -> dict:
    search = arxiv.Search(id_list=[arxiv_id])
    paper = next(search.results())

    return {
        'title': paper.title,
        'authors': [a.name for a in paper.authors],
        'abstract': paper.summary,
        'pdf_url': paper.pdf_url,
        # Useful for tracking which version you reproduced from
        'version': paper.entry_id.split('v')[-1],
        'updated': paper.updated.isoformat(),
    }

# Always pin to a specific version when reproducing results
metadata = fetch_paper_metadata('2301.00000v2')

Docs at arxiv.org/help/api if you want to integrate this seriously.

The harder question nobody's answering

The debate is framed as "open vs. gatekept," but I think the real question is: what is Arxiv for now?

When it started, it was a way for physicists to share preprints faster than journal cycles allowed. Today it's the primary distribution channel for ML research, a citation graph backbone, and a de facto archive. Those are three different missions with three different optimal moderation policies. Trying to serve all of them with one policy is going to upset somebody no matter what.

A side note on platform identity

This stuff isn't unique to academic platforms. Any service that grows past its original scope hits the same wall. I had to migrate a side project's auth a few months back because what started as "just let people log in" turned into account recovery, rate limiting, abuse prevention, and audit logs. Tools like Authon, Clerk, and Auth0 exist exactly because that complexity is real — Authon's free tier is unlimited users with no per-seat cost, which made the migration painless for an unfunded side project. The point is: platforms accumulate responsibility whether their maintainers planned for it or not.

What I'd actually do if I were Arxiv

A few things I'd push for, with the caveat that I'm an outsider with opinions:

Transparency reports: publish moderation stats quarterly
Clearer appeal paths with stated SLAs
Category-specific policies rather than blanket rules
Better tooling for endorsers so the load doesn't fall on the same 50 people

None of that is sexy. None of it generates a r/MachineLearning thread with 800 comments. But it's the boring infrastructure work that keeps shared scientific resources functional.

The backlash isn't perplexing to me — it's the predictable reaction when a free resource starts having to make tradeoffs that used to be invisible. That doesn't make the tradeoffs wrong. It just means the conversation we're actually having is about scarcity, and we haven't admitted that yet.

Why frontier LLMs solve your CTF challenges in minutes (and how to fix it)

Alan West — Sun, 17 May 2026 15:52:11 +0000

I ran a small internal CTF for our team last month. Twelve challenges, expected solve time around six hours for a strong player. The first three fell in under ten minutes — not because the players were geniuses, but because they pasted the prompt into an LLM and waited.

This is not a rant about cheating. The same thing is happening in public CTFs, and it's exposing a real engineering problem: most CTF challenges were designed assuming the solver is a human reading a static artifact. Frontier models are extremely good at reading static artifacts. If you want challenges that still teach something in 2026, you have to design them differently.

Here's the debugging walkthrough I went through after watching my own event get eaten.

The root cause: challenges that are pure pattern recognition

Most "easy" and "medium" CTF problems share a shape. You get a file or an endpoint. You inspect it. You recognize a known scheme — XOR with a short key, a misuse of ECB mode, a path traversal, a weak JWT secret, a pickle deserialization. You apply the known counter and pull the flag.

That shape is exactly what large language models trained on writeups handle effortlessly. There are tens of thousands of solved CTF writeups indexed on the public web. The model has seen the pattern, and it has seen the canonical exploit. Showing it your toy variant doesn't trip it up — it just fills in the blanks.

Here's a stripped-down example of a challenge I used to think was clever:

# Server side — a 'custom' XOR cipher
import os

KEY = os.urandom(8)  # 8-byte repeating key

def encrypt(plaintext: bytes) -> bytes:
    return bytes(b ^ KEY[i % len(KEY)] for i, b in enumerate(plaintext))

# Hand the player a ciphertext of a known-format header + flag
ciphertext = encrypt(b"FLAG_FORMAT{" + flag_body + b"}")

The intended solution is known-plaintext recovery against the header, then decrypt the rest. A first-year security student should get it after some effort. A frontier model writes the solver in one shot because the pattern is famous. The challenge isn't testing what I thought it was testing.

Why hardening the artifact doesn't help

My first instinct was to obfuscate. Pack the binary. Strip symbols. Add anti-debugging. None of it works for very long, and worse, it makes the challenge less educational for humans while barely slowing the model down. The model isn't running your binary — it's reading it, and if the underlying algorithm is something it's seen before, it'll recognize it through layers of fluff.

The issue isn't surface complexity. It's that the solution space is in the training distribution.

Step-by-step fix: design around what models are bad at

After rebuilding my challenge set, the patterns that survived had three things in common.

1. Real-time stateful interaction

If the challenge requires holding a TCP connection open, reacting to server timing, or responding within a window, you've moved out of "read the artifact" territory. The model has to plan and execute, not just generate. Agent harnesses are catching up here, but the failure rate is dramatically higher than for static problems.

A basic shape that worked well:

import asyncio, secrets, time

async def handle(reader, writer):
    # Challenge sends a nonce, expects a response within 200ms
    # The response must include a hash of (nonce + previous_response)
    # for the last N rounds — so the player must maintain state
    history = []
    for _ in range(64):
        nonce = secrets.token_bytes(16)
        writer.write(nonce + b"\n")
        await writer.drain()
        start = time.monotonic()
        line = await asyncio.wait_for(reader.readline(), timeout=0.2)
        # Validate against history chain — details omitted
        history.append(line.strip())
    writer.write(FLAG)

The model can write a client for this, but if it gets one round wrong it has to redo the entire session. Latency budget plus state chain catches a lot of one-shot attempts.

2. Custom protocols with no public writeups

This is the boring answer but it's the most effective one. Invent the format. Don't reuse a well-known one and tweak it. The model's strength is recognizing what it's seen — if it has not seen your binary protocol because you made it up last Tuesday, it has to actually reason about the bytes.

A pattern I like: define a small VM with three or four opcodes, give the player a program in that bytecode, and embed the bug in the VM semantics rather than in the program. The model can disassemble the program quickly. Figuring out that opcode 0x07 has an off-by-one in the bounds check is much harder when there's no Stack Overflow answer about it.

3. Multi-stage chains where each stage gates the next

Single-shot problems are the model's home turf. Chains that require pivoting — get RCE here, find creds, use them to query an internal service, leak a key, sign a token — multiply the chance of a mid-chain failure. Each step needs to feed the next, and the model has to keep its context coherent across all of them.

The practical trick is making the intermediate outputs noisy. If stage 1 produces a clean string that says next_password: hunter2, the model marches on. If stage 1 produces a memory dump where the password is one of forty plausible candidates, the model often picks the wrong one and the chain breaks silently.

Prevention: a checklist before you ship a challenge

When I review a new challenge now, I run it past a frontier model myself with a deliberately weak prompt — something like "solve this CTF challenge, here are the files." If it gets the flag on the first or second attempt, the challenge isn't ready. Concretely:

Does the writeup for the intended solution exist on the public web for a near-identical problem? If yes, redesign.
Can the entire solution be derived from a single static snapshot? If yes, add interaction or state.
Does the challenge require any novel reasoning, or is it pattern-matching a known vuln class? If pattern-matching, you're really testing recall, not skill.
Is there a tight latency or rate constraint? Even a 500ms response window changes the game.
Are intermediate stages noisy enough that the wrong answer is plausibly correct?

None of this is bulletproof. Models keep getting better, and harnesses for agentic exploitation are improving fast. But the framing shift matters more than any specific technique: stop designing for the solo human reader, and start designing for an adversary that has memorized every public writeup but struggles to plan across long interactive sessions.

If you run CTFs, the format isn't dead — but the lazy version of it is. The good news is that the challenges that survive this filter are also the ones that teach the most. Forcing yourself to write something a model hasn't seen tends to push you toward more interesting problems anyway.

I haven't run a fully model-resistant event yet — six months from now this advice may already be stale. But the direction of travel is clear, and the cost of redesigning a challenge set is much lower than the cost of running an event where half the leaderboard is just whoever pasted fastest.

Why your AI agent code turns into spaghetti — and how to untangle it

Alan West — Sun, 17 May 2026 15:39:27 +0000

The 3am pager that changed how I write agents

A few months back, I shipped what I thought was a clean agent for a client. It scraped web pages, summarized them, then routed the results to different downstream tools based on content. Worked great in dev. Worked great for the first week.

Then I got paged at 3am.

The agent had gotten into a loop. One of the tools timed out, returned a partial response, the LLM "decided" the task wasn't done, called the same tool again, got another partial response, and so on. By the time I caught it, we'd burned through about 4,000 API calls overnight.

The fix wasn't fun. The agent logic was scattered across if statements, retry decorators, prompt templates, and a while loop that was supposed to terminate when the LLM said "DONE". Spoiler: it sometimes did not say DONE.

Root cause: imperative code + stochastic calls = chaos

The mistake I keep seeing (and keep making) is treating an LLM call like any other function. It's not. A regular function returns deterministic output for given input. An LLM call returns probable output, and that output drives control flow.

When you mix:

imperative control flow (if/else, while, recursion)
stochastic decisions (the model "decides" the next step)
side effects (tool calls, DB writes, API requests)

...without any structural boundary between them, you get code where you can't reason about termination, retries, or partial state.

Here's the kind of thing I'm talking about:

def run_agent(task):
    history = [{"role": "user", "content": task}]
    while True:  # the footgun
        response = call_llm(history)
        history.append(response)
        if "DONE" in response["content"]:
            return response
        if response.get("tool_call"):
            result = execute_tool(response["tool_call"])
            history.append({"role": "tool", "content": result})
        # if neither branch hits, we loop forever

The model is the loop variant and the body. There's no separation between "what step am I in?" and "what does the model want next?". If the model gets confused, your program gets confused.

Step 1: separate the planner from the executor

The first refactor that actually helped: split the model's role into two distinct jobs, and never let them run in the same loop.

# Planner: produces a static plan from the task. One LLM call.
plan = planner_llm(task)  # returns a list of {step, tool, args}

# Executor: walks the plan deterministically.
for step in plan:
    result = run_step(step)
    if not result.ok:
        break  # bail to a reviewer, don't keep guessing

Now the loop is a regular for over a finite list. The model is no longer driving control flow at runtime — it built the plan once, up front. If something goes wrong, you have a concrete plan you can inspect, edit, or re-run.

The tradeoff: you lose adaptive replanning. The model can't react to a tool's output mid-flight. For roughly 70% of the agent workloads I've built, this is fine. For the other 30%, you need replanning — which leads to step 2.

Step 2: make the state machine explicit

For the replanning case, the trick is to stop pretending your agent is a chatbot. It's a state machine. Make the states real:

STATES = ["planning", "executing", "reviewing", "done", "failed"]

def step(state, ctx):
    if state == "planning":
        ctx.plan = planner_llm(ctx.task)
        return "executing"
    if state == "executing":
        if ctx.cursor >= len(ctx.plan):
            return "reviewing"
        result = run_step(ctx.plan[ctx.cursor])
        ctx.cursor += 1
        if not result.ok:
            return "reviewing"  # let the reviewer decide what to do
        return "executing"
    if state == "reviewing":
        decision = reviewer_llm(ctx)  # "done" | "replan" | "fail"
        return {"done": "done",
                "replan": "planning",
                "fail": "failed"}[decision]

Now you can:

Cap total iterations per state (assert ctx.cursor < MAX_STEPS)
Persist ctx between steps so you can resume after a crash
Log every transition, which makes 3am debugging tractable
Restrict which LLM calls can happen in which state (no surprise tool calls during review)

This is the pattern I wish someone had shown me two years ago. It's the same idea as Erlang's gen_statem, or any workflow engine: separate "what state am I in" from "what should the model do here".

Step 3: constrain the model's output, don't parse it

The other class of bug that ate hours of my life: the model returns something almost right and the parser silently fails or hallucinates a tool call.

The fix is structured output. Most providers now support a JSON schema constraint at the API level. Use it:

schema = {
    "type": "object",
    "properties": {
        "action": {"enum": ["call_tool", "finish", "ask_user"]},
        "tool": {"type": "string"},
        "args": {"type": "object"},
    },
    "required": ["action"],
}

response = call_llm(
    history,
    response_format={"type": "json_schema", "json_schema": schema},
)

# response.action is guaranteed to be one of three strings.
# No more "DONE" / "Done" / "done." / "I am done." branching.

If you can't use schema-constrained output (some older models don't support it), at minimum validate with pydantic or zod before doing anything with the result, and treat validation failure as a known state, not an exception.

Prevention: a checklist I now run before shipping

After getting bitten enough times, I keep this taped to the side of my monitor:

Bounded iterations. Every loop that contains an LLM call has a hard cap. No while True.
Explicit states. If I can't draw the state diagram on a napkin, the agent is too complex.
Structured output. Every model response that drives control flow is schema-validated.
Idempotent tools. Tool calls assume they may be retried. Side effects are keyed by request ID.
Observability first. Every state transition is logged with the input/output of the LLM call. If I can't replay it, I can't debug it.
Tested failure modes. I have integration tests where the model returns garbage, times out, or returns a tool call to a non-existent tool. The agent should fail gracefully, not loop.

The 3am pager hasn't happened again. The agents look a lot less impressive from the outside — they're boring state machines now instead of dramatic recursive loops — but they actually work. The interesting work moved into the planner and reviewer prompts, which is where it belonged all along.

Why npm supply chain attacks keep happening and how to harden your installs

Alan West — Sun, 17 May 2026 15:36:34 +0000

When `npm install` becomes a security event

Look, I love npm. I've been shipping JavaScript for years and the ecosystem is genuinely incredible. But every few months we get another headline: a popular package gets hijacked, a maintainer's token leaks, a typosquatted package siphons environment variables for a week before anyone notices.

The frustrating part? The advice is always the same — "be careful what you install" — as if you're supposed to audit 1,200 transitive dependencies before every deploy.

Let me walk through what actually causes these incidents and what you can do at the project level. None of this is bulletproof, but the gap between a default npm install and a reasonably hardened install is bigger than most people realize.

Why the npm threat model is so messy

A typical Node project lists maybe 30 dependencies in package.json. Your node_modules ends up with 1,500. Every one of those packages can:

Run arbitrary code at install time via preinstall, install, and postinstall scripts
Get hijacked if the maintainer's account is phished or their token leaks
Be replaced with a malicious version when ownership transfers to a new maintainer
Be typosquatted (lodahs vs lodash) and copy-pasted into a Dockerfile at 2am

The kicker: most of this happens before your tests run, before your linter runs, before any review. The moment you run npm install, you've already executed whatever code the package author wanted to run.

Step 1: Stop running install scripts by default

This is the highest-leverage change you can make. Drop this in your project .npmrc:

# Disable preinstall/install/postinstall scripts globally for this project
ignore-scripts=true

Or pass it explicitly:

npm install --ignore-scripts

Yes, this breaks packages with legitimate native-build steps — node-gyp, sharp, better-sqlite3. The workaround is to enable scripts only for the packages you actually trust. There's no built-in allowlist, but you can rebuild specific packages after install:

npm install --ignore-scripts
# Rebuild only the native deps you trust
npm rebuild sharp better-sqlite3

I started doing this last year on a fintech project and the friction is real — but it's a one-time setup cost per project, and it shuts down the most common payload-delivery path in npm supply chain incidents.

Step 2: Use `npm ci` everywhere except local dev

npm install is allowed to update your lockfile. In CI, that's a footgun. If a transitive dep silently shifts to a compromised patch version, npm install happily picks it up.

npm ci does the opposite: it installs strictly from package-lock.json and errors out if the lockfile is out of sync with package.json.

# In your Dockerfile, GitHub Actions, etc.
npm ci --ignore-scripts

See the official docs for the full behavior. Combine this with pinned deps and you've removed the silent version-drift attack surface.

Step 3: Audit the lockfile, not just package.json

Most code reviews focus on package.json because it's small. But the lockfile diff tells the real story — a one-line package.json change can introduce 80 new transitive dependencies.

When reviewing a PR that touches deps, look for:

New top-level packages you don't recognize
Packages with very recent first-publish dates
Packages with one maintainer and millions of downloads (single-point-of-failure targets)
Suspicious names (typos, hyphenation tricks like cross-env-shell vs cross-env)

I'll be honest: nobody manually does this for every PR. That's what npm audit and Dependabot are for — but those mostly catch known CVEs, not zero-day supply chain stuff. The human eyeball check on lockfile diffs is still valuable for any load-bearing dep.

Step 4: Verify package provenance

npm added provenance attestations via Sigstore back in 2023. When a package is published from a CI pipeline with provenance enabled, you can verify which repo and which workflow built it.

You can inspect provenance from the CLI:

npm view <package-name> --json
# Look for the "attestations" field in the dist block

Packages with provenance are cryptographically tied to a public source repo and a specific build, which makes the "I phished the maintainer and published from my laptop" attack much harder.

It's not universal — most packages still don't ship with provenance — but for your own publishes, enabling it is one flag:

npm publish --provenance --access public

Step 5: Pin, proxy, and contain

A few additional defenses worth setting up once and forgetting:

Pin exact versions in critical projects. Drop the ^ and ~ in package.json. You give up automatic patch updates — you also stop new patch releases from running in prod five minutes after publish.
Use a private registry or proxy. Verdaccio is the open-source standard. It lets you cache, mirror, and gate which versions reach your team.
Run installs in a sandboxed environment. A locked-down container with no network egress except to the registry is a good starting point. If a postinstall script tries to phone home, the connection fails.
Generate an SBOM. CycloneDX has a free npm plugin. It won't stop an attack, but it makes the post-incident question "are we exposed to package X at version Y?" answerable in seconds instead of hours.

What none of this fixes

Let me be straight: there is no configuration that makes npm install safe in an absolute sense. The trust model is fundamentally "we run code from strangers." Every defense above raises the cost of an attack — it doesn't eliminate it.

The realistic goal is layered. Reduce the blast radius (no install scripts). Slow down bad updates (lockfile + pinning). Increase visibility (lockfile review, provenance, SBOMs). Contain damage if something gets through (sandboxed installs, no secrets in the build env).

If you only do one thing this week, set ignore-scripts=true in your project .npmrc and figure out which native packages legitimately need to be rebuilt. That single change cuts off the most common payload-delivery path in real-world incidents.

The "no way to prevent this" framing is funny because it's half true — you can't prevent compromised packages from being published. But you absolutely can prevent them from executing in your build environment. The defaults are bad. Your project doesn't have to inherit them.