DEV Community: Matthew

I delete code on Fridays

Matthew — Thu, 14 May 2026 09:43:53 +0000

Most weeks, the last thing I do before logging off Friday is open a file and delete something. Not refactor it. Delete it. A dead feature flag, a commented-out block somebody left "just in case" in 2022, a helper function with one caller that could just be inlined. Small stuff, usually. The point isn't the size. The point is the habit.

I started doing this maybe six years ago, after a particularly grim quarter on a service nobody fully understood anymore. We had a config loader that supported three formats. Two of them had no live consumers. I know that now because I eventually checked. At the time, everyone assumed someone, somewhere, depended on the YAML path and the INI path, so we kept carrying both through every change. Every bug fix had to be made three times. Every test had to cover three branches. The code wasn't complicated because the problem was complicated. It was complicated because we were afraid.

So now I delete things on purpose, on a schedule, when I'm calm and not under deadline pressure. Friday afternoon is good for this. The risky work is done, I'm not going to start anything new, and my brain is in a "tidy the workshop" mood instead of a "ship the thing" mood.

The rules I follow are boring, which is the idea. I only delete code I can prove is dead. That means I actually grep, I check the call sites, I look at the dashboards if it's a runtime path. If I can't prove it's dead within about ten minutes, I leave it and write down why I'm suspicious. The note matters as much as the deletion. Half the value here is building a paper trail of "this looked unused on 2026-03-14, here's what I checked." Future me trusts past me more when past me showed the work.

I also keep the deletions in their own commits. One commit, one removal, a message that says what it was and how I confirmed it was safe. If something breaks, the revert is trivial and obvious. Nobody has to untangle a deletion from a feature change. I've had exactly one of these deletions bite me in six years, and the rollback took under a minute because the commit was clean.

Here's what surprised me. The hard part was never the technical risk. Git remembers everything. If I delete something that turns out to matter, it's right there in history, and the build will usually tell me within the hour. The hard part was the feeling. Deleting working code feels like throwing away effort, even when the code does nothing. Somebody wrote that. It compiled. It passed review once. Removing it feels disrespectful in a way that's hard to articulate and completely irrational.

I've made peace with that feeling by reframing what the code costs. Every line in the repo is something a teammate has to read past to find the line they actually need. It's a branch a new hire has to mentally evaluate. It's surface area for a security scan to flag. Dead code isn't neutral. It's a small tax everyone pays forever, and nobody put it on the roadmap.

Twenty-some years in, I've stopped measuring a good week by what I added. Some of my favorite diffs are almost entirely red. A pull request that removes a whole module and the tests still pass is a genuinely satisfying thing to send, and it usually means I understood the system better at the end of the week than I did at the start.

It doesn't have to be Friday. It doesn't have to be weekly. Pick a rhythm you'll actually keep. But find a regular, low-stakes window to remove something you've proven is dead, and write down what you checked. The codebase gets lighter. So, weirdly, do you.

For a lot of libraries, I read the source before I read the docs

Matthew — Mon, 11 May 2026 07:57:16 +0000

I'm going to say something that has gotten me side-eye in every code review I've ever done.

For a lot of libraries, I read the source before I read the docs.

Not because I think I'm too cool for docs. I'm 46 and pretty tired. I'm optimizing for "stop debugging at 11pm on a Thursday," which is a different thing than optimizing for being clever. Docs answer the question the author thought you'd ask. Source answers the question you actually have.

Some specifics so this isn't just vibes.

The retry library that lied about jitter

Last quarter we had a job that was hammering an upstream API on backoff. The docs for the retry library said "exponential backoff with jitter, configurable." Cool. We set jitter to 0.3 expecting standard decorrelated jitter, the kind you'd write yourself if you read the AWS blog post from 2015.

The library's source had about forty lines of retry logic. The "jitter" parameter was multiplied against a single random number generated once at construction. Once. So every retry from a given worker used the same jitter ratio. We had two hundred workers all using the same seed strategy and they were stampeding in lockstep.

I would have found that in twenty minutes of reading source. I lost a day and a half reading docs, posting on a forum, and convincing myself I was holding it wrong.

What the source actually tells you

When I open a file in node_modules or site-packages, I'm usually looking for a few things.

First, how big is this, really. If the "engine" is 300 lines and half of them are TypeScript ceremony, my confidence in the abstraction is different than if it's 12,000 lines of clever metaprogramming. Both are fine. I just want to know which one I'm signing up for.

Then I look at error paths. Docs love the happy path. Source tells you what happens when the input is null, when the network blips, when the config key is missing. I scan for throw, catch, raise, and for any function with the word "default" in the name.

Defaults are honestly the biggest one. Half the bugs I chase are because some library defaults to "be helpful" in a way that's wrong for my context. Pooling, timeout, retry, "auto-reconnect" — the README never quite spells these out and the API reference makes you click through six pages to assemble the picture. The source file has them in one place.

I'm not anti-docs

I'm anti the assumption that docs are the ground truth. Code is the ground truth. Docs are a translation. Sometimes a very good one. The Postgres docs are scripture. The Go stdlib docs are honestly better than the source for understanding intent. But for the random Express middleware your coworker installed in 2021, the source is faster than spelunking through a doc site that hasn't been updated since the rewrite.

The trick that made this practical for me is just installing the editor plugin that lets me jump to the actual file with one keystroke. In VS Code it's "Go to Definition" but pointed at the real installed file, not the type stub. In neovim I've got it bound to gd. Once jumping into node_modules costs me zero friction, I do it constantly. If it costs me three clicks I'll keep reading the doc site instead.

When I do read docs first

I'll be honest about the asymmetry. I read docs first for anything where the wire protocol matters. HTTP clients, database drivers, queue clients. The behavior at the network boundary is what I actually need, and source won't tell me what the server expects.

Same for new languages or runtimes. Source reading without context is just noise. And for anything with a real config story like Webpack or Kubernetes or Terraform, the source is too sprawling to start there.

For everything else, especially the 400 transitive dependencies you didn't choose, the file is right there. Just open it.

The actual ask

If you're new-ish, try this once. Pick a library you use every day and don't really understand. Open the entry point. Read the main file. Don't try to understand everything. Just see how big it is and what the shape is. You'll probably learn more in 15 minutes than from a week of skimming README.

I'd rather know the thing than know what someone wrote about the thing.

What's a library where reading the source surprised you?

Log every dependency version at startup. That's it. That's the post.

Matthew — Sat, 09 May 2026 03:40:59 +0000

I have one piece of advice that has paid for itself, in debugging hours saved, more times than I can count.

When your service starts up, log the version of every external dependency you can identify.

That's it. That's the whole tip.

Database client version. HTTP client version. Major library versions. The runtime version. The OS kernel version, if you can get it. Whatever specific cloud SDK version is loaded. Print it all to your structured log under an application_startup event the first time the process boots.

Why?

Because every single time something works in staging and breaks in production, the very first question I want answered is "what is different about the running binary in those two environments?" And nine times out of ten, the answer is buried in a transitive dependency that resolved to a different version under different lockfiles, base images, or build environments.

I once spent two days chasing a flaky test in a CI environment that turned out to be a glibc version mismatch between the runner image and our app's Docker base. The fix was twenty seconds. The investigation was forty-eight hours.

Another time, a payment integration started silently failing for 0.4% of charges in production while passing every staging test. Eventually we traced it to a Stripe SDK that had been silently bumped from 12.x to 13.x two deploys earlier, and the new version had stricter idempotency-key validation. The startup log of the production process said nothing because we hadn't been logging the version. We had to git blame our way to it.

After that incident I added the dependency-version dump to every service I've shipped. It costs nothing. The log line is generated once per process start. The version string is whatever the package manager already knows. There is no runtime overhead.

What it gives you, the first time something behaves differently between two environments, is the ability to do a literal text diff of the two startup logs and read your own answer back.

Concretely

In a Node service it's something like:

import pkg from "./package.json" with { type: "json" };

log.info({
  event: "application_startup",
  service_version: pkg.version,
  node_version: process.version,
  platform: process.platform,
  arch: process.arch,
  // include whatever else: ENV, REGION, etc.
});

In Python, walk pkg_resources.working_set once at boot. In Go, runtime/debug.ReadBuildInfo(). In every language there is a one-liner that gives you a snapshot of the universe the process woke up in.

If you take exactly one piece of advice from any blog post you read this week, take this one.

The runbook step I always add: "what does normal look like right now?"

Matthew — Thu, 07 May 2026 14:16:19 +0000

Most runbook steps are about what to do when something is broken. The most useful one I've added in years is about what to do before anything is broken: a step that captures what the system looks like when it's healthy, in the moment you're reading the runbook.

It sounds trivial. It is not.

The story

Got paged at 3am. Symptoms: queue depth high, some user reports of slow responses. Opened the runbook. The runbook said:

Check if the queue depth is elevated
Check if any consumers are crashlooping
Check error logs for patterns
Restart consumers if needed

Queue depth was at 12,000. Was that elevated? I had no idea. The runbook didn't say. I went looking through dashboards for a baseline and couldn't find a clean one (the time-series only went back two weeks; anything before that had been rolled up). I burned 25 minutes figuring out that 12,000 was actually slightly below normal for that hour of day, and the real problem was a single slow consumer that happened to be consuming the most expensive job type.

The runbook had me looking at the wrong number with no frame of reference.

The step I add now

The first step in every runbook I write or edit:

Before touching anything: capture "normal" in the same dashboard you're about to use. Open the dashboard. Take a screenshot of the last 7 days. Note the band that "healthy" sits in for the metric you care about. Then compare current to that.

The goal is not the screenshot. The goal is forcing the responder to anchor on a baseline before they look at the live number, because the live number has no meaning without one.

Why it works

It's the engineering equivalent of "before you adjust the thermostat, look at what temperature it currently is." Most pages where the responder makes things worse start with the responder treating an in-band number as out-of-band, or vice versa.

A few related habits that came out of this:

The runbook screenshots get committed alongside the runbook. When the dashboard URL eventually rots (and it will), the screenshot of "what 'healthy' looked like in 2024" is the last surviving baseline.
Every alert threshold in the runbook gets its rationale next to it. "Page if queue > 50,000 (rationale: 99th percentile of last 30 days was 38,000; saw a real incident at 47,000 in March)." When I'm woken at 3am the rationale matters more than the threshold.
The dashboard panel order matches the runbook step order. If step 2 says check consumer health, the consumer health panel is the second one on the dashboard, not buried under nine other panels. This sounds obvious. Almost no team I've worked with does it.

What it changed

My mean-time-to-actually-doing-something went down by maybe 60%. The page-but-no-action-needed rate went up (correctly — I was no longer treating in-band numbers as incidents). The number of times I've made a system worse during an on-call dropped to almost zero.

The expensive failure mode of on-call isn't acting too slowly. It's acting on the wrong information. The "what does normal look like right now" step is cheap insurance against that.

The git bisect run habit I should have learned ten years sooner

Matthew — Mon, 04 May 2026 06:18:18 +0000

Last Tuesday I lost about three hours to a regression in our checkout service. The cart total was off by a cent on certain promo combinations, and the only signal was a Slack ping from finance with a screenshot. No stack trace. No exception. Just wrong numbers.

I did what I always do first. I opened the diff for the last deploy, scrolled, squinted, and tried to feel my way to the bug. Forty minutes in, I had a mental shortlist of suspects and zero proof. The code I was sure caused it had been merged a week earlier and was already reverted in a different branch for unrelated reasons. I was, as my old tech lead used to say, debugging the file I wanted the bug to be in.

Then I remembered git bisect run.

I keep relearning this tool. It's been around forever. I've recommended it to juniors more times than I can count. And yet whenever I'm in the thick of a real outage, I forget it exists for the first hour, because my brain wants to read the diff like a detective novel instead of running an experiment.

Here's the workflow that finally stuck for me. I write a tiny script that exits 0 when the bug is absent and non-zero when it's present. For this regression, the script was twelve lines:

#!/usr/bin/env bash
set -e
go build ./cmd/checkout
./checkout-test --promo=BACK2SCHOOL --items=3 > /tmp/out
grep -q '"total":4297' /tmp/out

Then git bisect start HEAD v4.18.2, git bisect run ./check.sh, and walk to the kitchen for coffee. By the time I came back, git had narrowed it to a single commit. Not the one I'd been staring at. Not even close. It was a four-line change to how we round per-line tax that someone had landed quietly two weeks ago, with a benign-looking commit message about "consistency."

What surprised me wasn't the bug. The bug was small. What surprised me was how much faster the machine was at this than I was. I had been pattern-matching on author, recency, and which file looked spicy. Bisect doesn't care about any of that. It just keeps halving the search space until there's one commit left.

A few things I've learned about making this a habit instead of a heroic last resort:

The check script needs to be cheap. If your build is eight minutes, bisect across two hundred commits is a coffee break and a meeting and lunch. Mock the slow stuff. Build a smaller binary. I keep a bisect/ directory in some repos with pre-baked check scripts for known classes of bugs, so when something feels familiar I'm not writing the harness from scratch under pressure.

Skip commits that don't build. git bisect skip is your friend. You'll hit a few broken middle-of-refactor commits and that's fine; bisect handles them.

Don't trust your gut on the "good" commit. I once spent forty minutes bisecting before realizing my "known good" reference also had the bug, just less visibly. Now I always run the check script against the supposedly-good commit first, before bisect start. Two minutes of paranoia saves an hour of confusion.

Bisect across squash-merged PRs and you might land on a commit that touches forty files. That's still useful. The PR title alone usually points you somewhere. I've also started linking PR numbers in commit bodies more aggressively for exactly this reason.

The reason I keep forgetting this tool, I think, is that it feels like cheating. I want to understand the bug by reading the code, because that's the part that feels like real engineering. Bisect skips the understanding and goes straight to the answer. But the answer is what unblocks the team. Understanding can come after, in a calmer hour, with the offending commit already pinned to the wall.

The Tuesday bug ended up being a one-line fix. The git bisect run took eleven minutes including the coffee. The forty minutes I'd spent reading the diff before remembering the tool existed is the part I want back.

Next time I feel the urge to scroll through a deploy diff while a fire is burning, I'm going to write the check script first. Even if I'm sure I know where the bug is. Especially then.

20 years in, the agent stack finally feels like plumbing

Matthew — Fri, 01 May 2026 14:34:31 +0000

Twenty-something years ago I wrote my first SOAP client. The XML was nested four deep, the WSDL was wrong, the vendor's docs lied, and I still got paged at 2am when something a Java app I didn't write decided the namespace had moved. I think about that morning a lot whenever someone tries to sell me a new "agent framework."

A year ago I'd have rolled my eyes at MCP too. Every cycle has Its Protocol. Most don't outlive their conference talk. But MCP crossed 97 million installs in March, the Linux Foundation just took it under open governance, and OpenAI's newest Responses API speaks it natively. Around when that happened I noticed something a little embarrassing. I'd quietly stopped writing one-off integration scripts. They just sort of fell off the to-do list.

So what's it doing right that the last forty things didn't?

It's boring. That's the whole answer.

There's no clever DSL. No SDK lock-in. No "first-class agent abstractions" that turn out to mean three layers of decorator soup. A server exposes some tools, an agent calls them, the transport doesn't care which model you're paying this month. I run a Postgres MCP server, a Linear one, and a Playwright one. They have absolutely no idea about each other and they don't need to. It's the same dumb composability we got from pipes and from HTTP.

The interesting bit isn't the install count. It's that OpenAI ships compatibility, Google ships compatibility, and now LF owns the spec. We're past the line where any one company can yank it. That is rare. The last protocol I can remember crossing it was OAuth 2.0, and that took years longer than people remember.

A few things I'm still not sold on:

Auth gets reinvented per server and most of them get it wrong
Discovery is hand-wavy. We're going to need something better than "paste this URL"
Tool permissioning is closer to "the user clicks yes" than to anything you'd accept in production

None of that kills it. It just means the next two years probably look like 2010-era REST. Everyone agrees on the wire format and we slowly figure out the tooling around it. CORS will get reinvented. Someone will write a bad spec for tool versioning. We'll all live.

If you've been on the fence, this is a good week to ship a small server. Pick one annoying internal tool you wish your editor knew about. Wrap it. Plug it in. The leverage is genuinely silly once you stop fighting the model into pretending to know things it never could.

Curious what other graybeards on here are using MCP for day to day. The thing I'd most like to see: a battle-tested auth pattern that isn't "static bearer token in env."