I once lost an entire afternoon to a bug that wasn’t real.
The symptom looked like a race condition: two requests landing in the wrong order, the second one quietly overwriting the first. I read the code, formed a theory, threw in a mutex, ran it locally, watched the test pass. Shipped it. The same bug showed up in production the next day, on a different code path, in a way the mutex couldn’t possibly fix.
I had been so confident I was looking at a race that I never tested the theory. I jumped to a conclusion, picked a fix that matched the conclusion, confirmed the conclusion by running the fix, and called it done. That entire afternoon was a four-step ritual with no actual investigation in the middle.
That’s the difference. Not engineering. Development.
What we call development
Most software work in the industry is development, and most of the time we call it engineering. The two are not the same. Development is the practice of moving a system from “doesn’t work” to “works enough to ship.” It’s debugging by intuition, fixing by pattern-matching, writing code that satisfies a Jira ticket, then moving on.
It is not bad work. Most teams need it. Most teams only do it. But it has a specific shape.
Development is reactive. A ticket comes in, you read the description, you form a guess about what’s wrong, you change some code, you check whether the symptom went away. If it did, you ship. If it didn’t, you guess again. The loop runs on intuition and confirmation bias. The unit of progress is “feature looks right” or “bug appears fixed,” and both are judged by eyeball.
Development optimizes for what’s visible: tickets closed, PRs merged, features demoed, business value reported. The work is cosmetic in the literal sense. It is measured by appearance.
This isn’t a slight on developers. Pure development is fine for problems that are well-understood and low-stakes. The trouble is what happens when you apply development thinking to problems that aren’t well-understood, and most production software is exactly that.
What engineering actually means
Engineering applies the scientific method. That’s the whole definition. Anything else is careful development with a better job title.
The scientific method has four parts: a hypothesis you can disprove, an experiment that tests the hypothesis, an observation of the result, and an iteration based on what you learned. Every engineering discipline that has ever earned the name (civil, electrical, mechanical, aerospace) runs on this loop. We borrowed the word engineer from those disciplines. We should at least borrow the method.
What does this look like in code? Take the bug from the opening. Engineering would have started with “I think two requests are racing, here’s how I’d prove it.” Maybe I add timestamps and log them. Maybe I write a test that puts two requests in flight at the same instant and watches for the failure. Maybe I check whether the symptom can show up without a race, which would falsify my theory immediately. Only after the hypothesis survives an actual experiment do I write the fix.
This is slower than the development loop. By a lot, the first few times. It is also the thing that stops you from shipping confident wrong answers.
Engineering measures progress in evidence, not in tickets. The closed ticket is a side effect. The thing you actually delivered is a body of small, reproducible facts about how the system behaves: facts the next person on the team can stand on instead of re-discovering.
Drawing the two loops side by side makes the difference impossible to miss.
The development loop only asks whether the symptom went away. The engineering loop asks whether the hypothesis survived contact with reality.
How the words got tangled
For most of my career, “software engineer” and “software developer” have been used interchangeably. Job titles, conference badges, LinkedIn headlines. It doesn’t matter much in conversation, but it has done damage to the discipline. Once two words mean the same thing, the harder of the two stops being a goal anyone aims for.
In the early 2000s, the craftsmanship movement tried to push back. The Manifesto for Software Craftsmanship talked about “well-crafted software,” “steadily adding value,” “a community of professionals.” All good things. None of them are engineering. Craftsmanship is about pride and skill. Engineering is about method.
A medieval blacksmith was a craftsman. A modern metallurgist is an engineer. The blacksmith makes a better sword by working at it for thirty years. The metallurgist makes a better alloy by running experiments and writing down what happened. Both deserve respect. They are not the same job.
Craftsmanship as a movement crowded out the conversation we should have been having about method. It said: care more, write cleaner code, take pride. That is fine advice and exactly what a tradesperson says. It is not what an engineer says. An engineer says: how would you know if you were wrong?
That question is the whole game.
Dave Farley and the scientific method
The clearest articulation of this I’ve read is Dave Farley’s Modern Software Engineering. Farley argues that software has spent decades calling itself engineering without doing engineering, and the field reinvents the same mistakes every five years as a result.
His definition is straightforward. Engineering is the application of empirical, scientific thinking to solving practical problems. Empirical means based on evidence. Scientific means hypothesis-driven. Practical means it has to work in the real world, not just on a whiteboard.
Farley breaks the practice into two skills. The first is managing complexity: modularity, separation of concerns, abstraction, things most working programmers can name. The second, which gets less attention, is optimizing for learning. Building systems and processes so that you find out you are wrong as quickly as possible.
That second skill is where development and engineering separate. Most developers I have worked with are good at managing complexity within a single feature. Almost none of them are deliberate about how their work generates feedback. Their CI doesn’t tell them what changed. Their tests don’t pin down the behavior they care about. Their deploys don’t reveal which release broke things. They are doing careful work in a system that is structurally bad at learning.
When Farley talks about continuous delivery, he is not talking about a deployment pipeline. He is talking about a feedback loop tight enough that you find out you were wrong before you have forgotten what you were trying to do. The pipeline is a means. The end is fast, honest learning.
If you take one idea from his book, take that one. The job of an engineer is to set up a system in which mistakes become visible quickly, cheaply, and unambiguously. Everything else follows from there.
Agile without data is just guessing
Agile falls into the same trap. The original Agile Manifesto valued working software over comprehensive documentation, responding to change over following a plan. Nothing in there says anything against engineering. In practice though, most of what gets called agile in the industry is a planning ritual with no measurement attached.
You estimate the work in story points. You commit to a sprint. You run a standup. You demo something on Friday. You retrospect, which usually means three people complain about meetings and one person promises to add more tests. None of this is engineering. None of it is even particularly empirical.
Story points are a guess about how long something will take. The sprint commitment is a guess about how much of that guessable work fits in two weeks. The retro is a guess about why the previous sprint went the way it did. Layers of guesses stacked on each other, decorated with the language of method.
You can do agile in an engineering way. You can also do it in a development way. The difference is whether you measure anything. If you track cycle time, defect rate, and percentage of work pulled from the top of the backlog versus interrupts, you have data. You can form hypotheses about why the data looks the way it does, change one thing, and watch the data move. That is engineering inside an agile shell.
If you don’t, you have a calendar with rituals on it. The calendar might make people feel productive. It is not generating knowledge. It is not making the team’s claims about itself falsifiable. Without falsifiability you cannot improve. You can only churn.
The phrase I come back to: if your decisions are made on intuition and your outcomes are judged by feeling, you are not engineering. You might be doing it carefully, but you are guessing.
Verifiable, falsifiable, or it isn’t engineering
Here’s the test I apply to my own work, and to architectures I’m asked to review. Can a claim in this system be proven wrong?
“This service handles 1,000 requests per second.” Verifiable. Measure it.
“This refactor improved performance.” Falsifiable. Compare before and after.
“This pattern makes the code cleaner.” Not falsifiable. Cleaner has no measurement. The claim is aesthetic. It might be true; it cannot be tested.
“Our team is delivering faster after the reorg.” Falsifiable if you measured cycle time before and after. Untestable if you didn’t.
The pattern is not subtle. Engineering claims are claims about the world that the world can refute. If a claim has no condition under which it would be wrong, it isn’t an engineering claim. It is a preference, an opinion, or a vibe. There is nothing wrong with preferences, but treating preferences as engineering decisions is how teams end up with religious wars over things a benchmark could have settled in twenty minutes.
Building the discipline to ask “how would I know if I were wrong” before reaching for the keyboard is the single highest-leverage habit I have ever picked up. It rules out about eighty percent of the things I used to do, including most of my best-loved arguments. The remaining twenty percent gets better, faster.
When I started treating my work like experiments
For the first eight years of my career I thought I was an engineer because my job title said so. I had techniques I trusted. I had taste. I had opinions. I shipped a lot of code I believed in.
Then, around year nine, two things happened in close succession. I read Farley’s book. And I broke production in a way I couldn’t immediately explain.
The bug had been my best guess. I had been confident. The fix had passed local testing. None of that mattered, because the production traffic pattern was nothing like my local one, and I had never tested the assumption that they were similar. The thing I was sure of had never been examined.
After that I changed the way I work. Every non-trivial change starts with a written claim: “I think the system is doing X for reason Y, and if I do Z, the behavior should be A.” Then I figure out the smallest test that would distinguish the world where the claim is right from the world where it isn’t. The test gets written first. The fix gets written second. If the test surprises me, I rewrite the claim, not the test.
This is slow at first. Painfully slow. The instinct to fix it now is strong, and the experiment-first habit fights that instinct every time. Within a few months the loop got faster than my old approach, because I stopped chasing wrong theories for three days at a time. The first cost was steep. The amortized cost is lower than the way I used to work.
The other thing I noticed was that this discipline made my code better in ways I hadn’t expected. When a change has to start with a falsifiable claim, you naturally build hooks for observing it. You add logging. You write tests that pin down behavior rather than coverage. You design interfaces that can be exercised in isolation. The code becomes more legible because you needed it to be legible to run your experiments. Engineering practice produces better artifacts as a side effect.
Everything changed for me after that. Not in a self-help way. In a “I stopped being wrong as often, and when I was wrong, I found out faster” way.
Engineering principles in an agentic workflow
The shift worth flagging is that this discipline matters more, not less, in an agentic coding setup. The agent will happily write a hundred lines of plausible code in response to a vague request. That code will look right. It might even pass the tests that already exist. None of that means the change does what you intended.
The way I run agentic work now: every task starts with a written spec. What is the change. What is the success criterion. What test would fail if the change were wrong. The agent reads the spec, writes the change, and runs the test. If the test passes, the change ships. If it fails, the agent iterates, but the spec is the fixed point.
The spec is the fixed point. The agent iterates against the spec, not the other way around. The spec is human-authored and human-checked; the agent is the substrate the change runs on, not the source of truth.
This sits in my project harness: the configuration files, prompts, and tooling that surround the agent. The harness is where the discipline lives. The agent isn’t going to remember to be empirical for me. It will do whatever the surrounding system makes easy. If the surrounding system rewards “make the test pass and exit,” that is what you get. If the surrounding system rewards “state your hypothesis, run the experiment, report the result,” you get that instead.
My harness includes a few things that exist specifically to enforce the loop. A CLAUDE.md that names the principles. A check skill that runs every time the agent claims to be done. A verify skill that exercises the change at the boundary the user actually touches. The agent is held to the same standard I hold myself to: how do you know this is right? Show me the evidence.
The result is that agentic work produces engineering output, not development output. The agent isn’t smarter than it was; the system around the agent is structurally biased toward the scientific method. That is the same trick that works for human teams. Tools and process shape behavior. If you want engineering, the system has to reward it.
Why agents reward explicit expectations
There is a popular phrase right now: vibe coding. Open the editor, describe the vibe, let the model generate, accept what looks right, move on. It works for prototypes. It does not work for production, for the same reason intuition-driven development never worked. The model is guessing. You are guessing. Nobody is checking.
The cure for vibe coding is the same cure as for vibe engineering: write the claim down before the code. State what the change should do. State what would falsify the claim that the change is correct. Then write the code. The model performs dramatically better when the spec is explicit, because the spec carries the constraints the model otherwise has to invent. You shrink the space of plausible-but-wrong answers.
In my own work, the quality of agent output has tracked, almost one-to-one, the precision of the prompt and the strength of the verifier. Vague prompt with no test: garbage in, garbage out. Precise prompt with a real test: the agent ships work I would have been proud to ship myself, in a fraction of the time.
The principle is older than the agent. It was true when the only programmer in the room was me. The agent makes the principle louder, because the agent will not push back on a vague claim the way a thoughtful coworker would. It will dutifully produce a plausible answer to a question you asked badly. The discipline of asking the question well moves from optional to mandatory.
This is why I think agentic coding rewards engineers and punishes developers. The development habit of “just get it working” hands too much trust to the model. The engineering habit of “state the claim, write the test, observe the result” treats the model as one more component in a feedback loop. The loop is what is doing the work. The model is a node in it.
Five engineering moves you can make this week
If any of this resonates and you want to try it, here are the first few moves. Small enough to fit in a normal week, concrete enough to do.
Pick one bug this week and reproduce it before you fix it. Not “I think I see why this is broken.” A failing test, or a minimal script, or a curl command that produces the bug on demand. If you can’t reproduce it, you can’t claim to have fixed it. This is the cheapest engineering habit to build.
On your next non-trivial change, write the success criterion before the code. One sentence. “When this is done, X will happen instead of Y, and I will know because Z.” Put it at the top of the PR description. You will catch yourself, sometimes, unable to write the sentence. That is the signal: the change isn’t well-defined yet, and writing code now would be guessing.
Pick one team metric to measure honestly for two sprints. Cycle time is the easiest place to start. Don’t try to improve it yet. Just measure it. The act of measuring will tell you things about your process that no retro ever will.
If you use an agent, put your principles in writing where the agent can read them. A CLAUDE.md, an AGENTS.md, a project README, whatever your tooling supports. Name the loop you want. State that work isn’t done until the success criterion is verified. The agent will follow the system you build for it. Build a system that biases toward evidence.
Read Modern Software Engineering by Dave Farley. It is the clearest treatment of this material in print. You don’t have to agree with every chapter. You do have to grapple with the central argument. If you finish it and still think software engineering means what most teams currently call engineering, fine. You will have taken a position instead of defaulting to ignorance.
If you are going to call yourself a software engineer, do the thing the word means. Form claims. Test them. Be willing to be wrong, and set up your work so you find out fast when you are. Everything else comes downstream of that one habit.


Top comments (0)