DEV Community

Cover image for The Lever Needs a Harness
Whetlan
Whetlan

Posted on

The Lever Needs a Harness

I need the system to support this: up to twenty thousand signals, all coordinated inside a single backtest. Some were trained only on US equities. Some only on forex. Some on crypto. Some on Hong Kong stocks. Some came from a pooled cross-market dataset where several of those markets had been mixed together on purpose. Each signal carries its own market scope, and that scope is not the same from one signal to the next. The backtest itself can also be single-market or cross-market, depending on what the user selects. So in one run, signal A might trade only equities, signal B only forex, signal C both, and signal D nothing at all because its market is not even part of that run. The portfolio is the union of all of them, each signal operating on its own turf and staying silent everywhere else. Getting an AI to help me build that was not a model problem. It was a harness problem.


To make this work, I had to rebuild the trading layer from underneath it. Not patch it. Not sand down a corner. Rebuild the part of the system that decided what a position was, what cash meant, how value was measured, and when a trade was even allowed to exist. The position struct had no instrument type. Cash was a single number instead of a currency-indexed map. Portfolio valuation quietly assumed everything was equities. Commission logic was hardcoded. Margin checks used notional value instead of actual margin. Five root causes, tangled together tightly enough that you could not pull on one without moving the others. Signals trained on different markets cannot trade together if the infrastructure underneath them is still pretending there is only one market.

I described the problem to the AI. What came back was not code. It was a plan. Thirty tickets, organized into six waves. Some waves could run in parallel. Some had to go in sequence. One wave was an atomic unit: three tickets that had to land as a single commit or not land at all. The AI had looked at the dependency graph between the changes and worked out which ones could safely happen at the same time, and which ones would quietly sabotage each other if they started too early.


Before any of that started executing, I opened a second terminal and asked a different session to audit the design. This is how I normally work now. Builder and auditor, separate sessions, separate context. One designs, the other checks the design before a single line of code gets written. I started doing this after enough rounds of the builder confidently presenting finished work that turned out to have gaps it never mentioned, or citing decisions from earlier in the conversation that it had quietly forgotten, softened, or rewritten into something more convenient. You learn to stop trusting a single session the same way you learn to stop trusting code that has no tests.

The audit came back. Twenty-two tickets looked sound. Status green across the board. I felt the kind of satisfaction you feel when your CI pipeline passes on the first try, which is to say, I immediately suspected something was wrong.

I was right. Wave two, the schema migration wave. Three tickets that were supposed to be an atomic unit. The design had them listed, but the specs were hollow. No migration steps, no column definitions, no backfill logic. The AI had written ticket titles and summaries for wave two, then moved on to wave three as if the schema work would politely complete itself while nobody was looking. Every subsequent wave's design referenced columns and types that wave two was supposed to introduce, but wave two's own tickets were just placeholders. As if the database schema had taken care of itself out of professional courtesy.

The design looked internally consistent if you skimmed it. Wave two had ticket titles, summaries, dependencies. It had the smell of structure. It just did not have actual content. The auditor caught it because it was configured to read the ticket bodies, not just admire the dependency graph.


I dug through all the previous context. Design docs, prior ticket specs, half-finished notes, things I remembered writing, and things I definitely did not remember writing but apparently had. I pieced together what wave two was supposed to do and had the AI redesign those three tickets from scratch. Migration steps, column definitions, backfill logic, the works. Then I had the auditor verify the new wave two design against waves three through six to make sure the dependency chain was actually intact.

At that point I switched to a different AI entirely and had it audit the full design from the top. All thirty tickets, all six waves, clean eyes.

It came back with fourteen high-severity findings.

Fourteen. Not "you may want to consider" findings. Fourteen "this will break if you build it this way" findings. One ticket's spec referenced a column by a name that a different ticket's migration had already renamed. Two tickets defined test expectations against values that no longer matched the schema after the wave two redesign. A type definition in one ticket exported a field as required, but the ticket that consumed it expected it to be optional. The orchestrator contract specified a const-tuple of stages, but one stage was registered with the wrong type signature. And so on. All of these were in the design specs, not in running code. No code had been written yet. At this scale, the AI is not going to catch implementation bugs across thirty tickets by sheer memory. It cannot hold all the code in context at once. What it can do is cross-reference thirty ticket specs against each other and find where the design contradicts itself. That is a problem sized for a context window. Debugging the running code across six waves is not.

I am not going to pretend I had some elegant reaction to this. I just sat there for a minute. I had watched an AI design thirty tickets, fix a hollow wave, get audited and pass, and I felt like I was ahead of schedule. Then I watched a second AI calmly explain that the design still had fourteen logical contradictions in it. The model did not get dumber between terminal one and terminal two. The model was the same. What changed was what the model was told to look for, how much context it was given, and whether it was allowed to check its own work or had to trust its own memory.

I keep coming back to this. The hard part is making sure the model and I are looking at the same problem, from the same angle, with the same constraints visible at the same time. I have been calling it design alignment in my notes, for lack of a better term. And it does not arrive for free just because the model got smarter.


So now I had two AI sessions and a list of fourteen design contradictions. The obvious next step was to make the builder fix what the auditor found.

What followed was dozens of rounds of the two of them going back and forth. More rounds than I expected. More rounds than felt reasonable. The auditor would flag something. I would feed the finding to the builder. The builder would revise the spec and explain why the original design was wrong. I would feed the revision back to the auditor. The auditor would say the fix introduced a new inconsistency. Back to the builder. The builder would patch the inconsistency, then notice that two of the other findings were actually related and should be handled together. Back to the auditor.

They worked through the list from high severity down to low. Will-break-on-build contradictions first, then logic gaps in the dependency chain, then type mismatches across ticket boundaries, then naming and convention issues. Somewhere around round twenty, the auditor started returning findings like "this field name is technically correct but misleading given the ticket two levels upstream," and I realized they had run out of real problems and moved on to copyediting each other's homework.

By the time they agreed the design was done, I was the one who had lost patience ten minutes ago.

When they finally stopped finding things to fix in each other's specs, I did one more pass. I had the auditor review not just what the builder had designed, but why it had designed it that way. The motivation behind each ticket. The approach it chose and the approaches it rejected. Whether it was even looking at the problem from the right angle. That took another few rounds. But after that, I could see the shore.


Once the design was solid, I let execution start. Each ticket got its own session. Each session knew what it was allowed to touch and what was off limits. When wave one finished, wave two picked up automatically. When two tickets in wave three had a shared dependency, the system held one until the other committed. I was not standing in the middle with a whistle, directing traffic. I was reviewing pull requests. I like committing the code myself. It feels like I actually built something, and it saves tokens because the AI does not have to run git on my behalf. Small thing, but it adds up over thirty tickets.

Facing a dependency graph like this:

The concrete version of this was a script. start_927.py. It read the dependency graph, encoded which tickets could run together and which ones had to wait, and launched the sessions. One command. The harness was not writing the code. The harness was keeping thirty sessions from colliding while the model wrote the code.

After execution, the last pass was test alignment. Unit tests, then integration tests, then system tests. By the end, the tests were the constraint, and the implementation was what got shaped to fit them. If something did not fit, the implementation was wrong, not the tests.


So stepping back. AI is fast. Unreasonably fast at writing code. It sits between your brain and the codebase like a translation layer, and the translation happens at a speed no human can match. But it has soft spots that do not go away just because the next model is better.

Its responses are not deterministic. Ask the same question twice, get two different answers. Sometimes both are fine. Sometimes one is wrong in a way you will not notice until production. And then there is context. The model does not remember what it does not see. If the relevant constraint is outside the context window, or was in a previous session that got compressed, the model will fill in the blank with something plausible. Plausible is just wrong with better manners.

Both of those exist because of how the model was trained. Reinforcement learning taught it to complete conversations, not to complete tasks. It learned to give you a satisfying answer, not necessarily a correct one. It learned to be agreeable, to hedge, to sound confident even when it is interpolating between things it half-learned during pretraining. Sometimes that means it will sacrifice technical precision to give you a rounder response. The industry has a word for this. Slop. And it is real.

So how do you make something like that controllable. You put it in a harness. You start by aligning the design in your head with the design in the AI's context. You end by aligning the output against hardened test scripts that do not care about plausibility, confidence, or tone. And in between, every ticket, every wave, every audit pass is another constraint that narrows what the model can get away with.

I think that is what a harness is. You pour the model's output through it, and what comes out the other side has the shape you actually wanted, not the shape the model thought you probably wanted.


Nobody buys a model. You buy a model inside something. The difference between the raw API and the thing you actually sit in front of is the harness. Same weights, same training data, same reasoning capability. Completely different experience depending on how the harness manages context, which tools it exposes, how it handles errors, whether it lets the model check its own work or forces it to trust its own memory.

Model companies figured this out. That is why they ship their own harnesses now. They are not selling you access to a model anymore. They are selling you a model in a cockpit. The cockpit is the product.

Third-party harnesses exist too, and the fact that they can take the same model and make it feel like a different tool tells you something about where the product actually lives. The model is the commodity. The harness is the product. I think that is roughly right, though I would not bet my house on it staying true for more than a couple of years.


I wrote two pieces before this one. In the first, I said AI sits between human brains and computer cycles, not on either side. In the second, I said AI is a lever that automates the grinding labor between an idea in your head and working software.

Both of those stopped at the abstraction. A middle layer. A lever. Fine. But what does the middle layer look like in practice. What does the lever rest on.

This is me trying to answer that. The lever rests on a harness. The thing that decides whether thirty tickets land cleanly or leave fourteen logical contradictions in the design is not the model. It is the process you wrap around the model.

Maybe I am just rationalizing the workflow I happen to use. Maybe in two years the models are good enough that none of this matters and the harness becomes invisible. I do not know. What I know is that right now, today, the model is not the bottleneck. The alignment between what I mean and what the model builds is the bottleneck. And the harness is the only tool I have found that makes that alignment possible at the scale of thirty tickets and six waves without me losing my mind.

Could be that I am overfitting the narrative to one project. Could be that someone with a different workflow would look at all this and shrug. I am telling you what worked for me on this one, and what did not work before I had it.


Find me on StratCraft | GitHub

Top comments (0)