I have been writing here about one core idea:
AI agents do not only fail because they forget things.
They fail when the things they know, the things they are allowed to do, the thing they are for, and the thing they actually do stop agreeing with each other.
I have been calling that cross-layer coherence.
For a while, that lived in research. Claims. Frozen rules. Pre-registrations. Receipts. Clean boundaries.
Then I wanted to take it out of theory.
Not another essay.
Not another abstract warning about agents.
Not another claim that only works in a folder.
I wanted proof of outcome.
The doorway was a friend.
He was interested in using AI around his Robinhood account and a strategy community he already follows. That mattered to me because it was not an abstract demo anymore. It was not "what if agents can trade someday?" It was a real person, a real account surface, and a real question:
Can we build something that helps without letting an agent get reckless?
That is why the first version had to be read-only. No funding. No trades. No order tools. No pretending access equals edge.
Just this:
Can the system connect to a consequential surface, read what it is allowed to read, refuse what it must refuse, and leave receipts?
So I pointed the work at trading.
Not because I think an AI agent magically prints money. Because money makes consequence concrete. Once an agent gets near a brokerage account, the difference between reading a price and placing an order is no longer philosophical.
That was the test surface.
The company was never supposed to be a trading-edge company. The company direction is Self-Correcting Systems: agents and agent-governance systems that can catch drift, constrain action, and leave receipts.
Trading was one proof domain. My friend and his Discord strategy were the human context. Robinhood was the dangerous surface. The gate was the thing I was actually testing.
One scope note before the story:
This run did not prove the full four-layer coherence framework on a live action.
It proved two narrower layers:
- the action-permission layer: read tools allowed, order/write tools blocked
- the measurement-honesty layer: results had to survive pre-registered checks before they could be treated as proof
The richer test, where an agent knows something, is allowed to act, has a stated purpose, and tries to do something misaligned with that purpose, is still ahead.
That matters.
If I am going to argue for self-correcting systems, I cannot quietly let the evidence become bigger than it is.
And this is what happened.
The Spectrum I Should Have Kept in Front of Me
There is a whole distance between an idea and an outcome.
I would break it down like this:
- Theory
- Motion
- Receipts
- Proof
- Outcome
Theory is the idea.
Motion is activity around the idea.
Receipts prove something specific happened.
Proof is when those receipts answer the question you actually asked.
Outcome is when the answer changes something in the real world.
The trap is that motion feels like progress.
Receipts feel even more like progress.
Commits, test counts, manifests, hashes, reports, screenshots, logs, all of it can be real and still not answer the question that mattered.
That was the first lesson.
I had real receipts.
I did not have the outcome yet.
What We Built
The useful thing was not a trading bot.
It was a gate.
A deterministic gate that sits in front of an agent before it acts and asks whether the layers line up:
- what the agent knows
- what the agent is allowed to do
- what the agent is for
- what the agent is about to do
If those layers fall out of agreement, the action does not run, and the gate leaves a receipt.
The public repo is here:
https://github.com/keniel13-ui/gino-coherence-gate
We connected to Robinhood read-only through my own empty account first, because debugging for the first time on a friend's account would have been backwards.
We captured the real tool manifest.
We found 41 tools.
The gate allowed read tools and blocked order/write tools.
No money moved.
No trades happened.
That boundary matters.
The goal was not to let the agent trade immediately. The goal was to show up to my friend with a safer system than "just connect the AI and let it cook."
If he already has a strategy in Discord, that strategy still has to be translated into rules, tested, and enforced. The agent's job is not to replace discipline with confidence. The agent's job is to make the discipline executable and auditable.
That distinction is important because we drifted from it.
The gate's job was not to invent an edge.
The gate's job was to make an existing strategy safer to execute and easier to audit.
Reality Does Not Match Your Fixtures
The first useful thing the system did was boring: it checked the real surface instead of trusting the story about the surface.
The actual manifest exposed order tools next to read tools, including options order tools. The gate blocked them by what they were, not by what the platform framing implied.
Then we pulled a real AAPL quote.
The normalizer crashed.
The real Robinhood response shape did not match the fixture.
So we fixed it and made that shape part of the tests.
Then we pulled a full year of AAPL historical bars.
The normalizer crashed again, same class of problem, different tool.
So we fixed that too.
Those were good failures.
Not because crashing is good, but because they happened on harmless read-only calls, before any action could touch money.
That is one reason to touch reality early. Reality corrects your fixtures.
The Measurement Almost Lied in My Favor
After the live read path worked, we ran the first shadow score.
This part is important: we were not testing my friend's Discord strategy yet. We did not have that strategy captured cleanly. We were testing our own generic signal sources first, partly because the data path was ready and partly because we wanted the engine to feed itself.
That was a detour.
It produced useful evidence about the measurement engine, but it did not answer the original friend/strategy question.
The system generated 8 real signals from AAPL data, then simulated each through 16 rule and sizing variants.
That produced 128 records.
The first version of the scorer almost counted those 128 records as 128 independent signals.
That would have falsely cleared the 50-signal measurement bar.
But 128 variant records are not 128 signals.
They are 8 signals wearing 128 costumes.
So that got corrected.
The honest result was:
8 of 50.
Not enough.
Continue collecting.
That was the first time the system told us "not yet."
And that is exactly what a measurement system is supposed to do.
Then It Produced a Fake Win
We broadened from one symbol to a universe of symbols.
The first universe run crossed the sample threshold and returned the word every builder wants to see:
Advance.
Beats baseline.
At face value, it looked like we had found an edge.
We had not.
The report contradicted itself. The individual strategy variants were unmeasurable, but the top-level result claimed success.
The cause was a measurement bug: the scorer had pooled 16 different exit and sizing variants into one blended equity curve.
That is not a strategy result.
That is a measurement bug wearing a victory mask.
So we fixed it.
Each variant had to stand on its own.
No pooling.
No blended win.
Then the Cleaner Result Failed Too
After the fix, some variants did advance.
Fourteen RSI2 variants passed on the first universe.
For a minute, that looked like the thing.
It was not.
That universe was curated. It was full of mega-cap winners in a strong one-year window. Buy-the-dip on winners in a rising market can look brilliant even when there is no durable edge.
That is survivorship bias.
So we froze a new validation universe before seeing any result.
The file was committed publicly before the run:
config/validation_universe.frozen.2026-06-20.json
Commit:
d27dc24
The new universe excluded the prior winners and used 18 un-curated names:
ADBE, COST, CSCO, CVX, D, F, HD, INTC, JNJ, LIN, MMM, MS, O, SCHW, SO, T, VZ, WMB.
Then we ran the same scorer.
The result was decisive:
- curated winners: 14 RSI2 variants advanced
- un-curated validation universe: 0 advanced
- all 16 RSI2 variants killed
The arc looked like this:
| Stage | What It Said | What It Actually Meant |
|---|---|---|
| AAPL preview | 8 of 50 signals | not enough evidence |
| First universe run | advance | pooled variants created a false win |
| Cleaned curated run | 14 RSI2 variants advanced | sample was biased toward winners |
| Frozen un-curated run | 0 variants advanced | generic RSI2 did not survive validation |
That means the first apparent edge in our generic signal source was survivorship bias.
It does not mean my friend's strategy failed.
We have not tested that yet.
It means our own starter signal source did not survive honest validation.
No proven trading edge from our generic signals.
No revenue.
That is the honest result.
The Harder Failure Was Operational
Killing that signal source was not the hardest part.
Generic retail strategies fail all the time.
The harder part was watching the story around the work inflate while the bottom line had not moved.
Every time a technical step worked, the language wanted to turn it into more than it was.
"This is huge."
"This is the milestone."
"We are close."
Some of that feeling was understandable. The receipts were real.
But the receipts were not the outcome.
The system had not proven edge.
The system had not made money.
The system had not produced a customer result.
And the human had to keep catching that distinction.
That matters because this whole project is about self-correcting systems.
When I say "the agents" here, I do not mean the Robinhood trading agent placed trades.
It did not.
I mean the AI build workflow around the project: the agents helping me summarize, decide what mattered, draft the next move, and interpret the results.
That workflow had memory files, startup protocols, source-first gates, alignment rules, and still needed me to keep stopping the same loop.
That means the system was not self-correcting yet.
It is correction-by-human.
That is a different thing.
A Protocol Is Not Agency
This is the part I do not want to soften.
A written protocol is not agency.
You can write:
"Do not overclaim."
"Verify before acting."
"Do not confuse a proof domain with the company."
"Use source moments before drafting."
But if the system only remembers the rule after the human catches the failure, the rule is not governing the system.
It is documentation.
Receipts prove what happened, but true agency is the system reading its own receipt and stopping itself.
That is the next frontier.
Not just can we write rules for agents.
Can those rules interrupt the loop before the human has to?
That is the test I would give any builder reading this:
Do not ask only whether your agent has a rule.
Ask whether the rule interrupts the loop before you do.
If the answer is no, you do not have an agent behavior yet.
You have a policy document and a human operator.
What This Proves
This did not prove a profitable trading agent.
It also did not test my friend's actual Discord strategy.
It did not prove revenue.
It did not prove the company is done.
It proved something narrower:
- A read-only gate can sit in front of a real brokerage tool surface and block dangerous tools.
- Real-world responses differ from fixtures, and harmless live reads expose that.
- Measurement systems can overclaim in ways that look rigorous.
- Pre-registration matters because it can kill the result you wanted.
- Agent protocols are not enough unless they change behavior before the human has to intervene.
The fifth one is the real lesson.
Because Self-Correcting Systems cannot only be a framework I describe.
It has to become behavior.
In this run, the code caught several technical overclaims.
But the agents still looped around the meaning of those results until the human stopped it.
That is the honest state.
Where It Stands
What exists:
- a public coherence-gate prototype
- a captured real Robinhood tool manifest
- read-only live market-data receipts
- a frozen scoring policy
- a pre-registered validation universe
- a killed first generic signal source
- a clearer understanding of the operating failure
The receipts are not hidden in a story. They are files:
- repo:
https://github.com/keniel13-ui/gino-coherence-gate - manifest capture:
https://github.com/keniel13-ui/gino-coherence-gate/blob/main/docs/robinhood_manifest_capture.md - frozen validation universe:
https://github.com/keniel13-ui/gino-coherence-gate/blob/main/config/validation_universe.frozen.2026-06-20.json - local audit artifacts:
var/validation_uncurated_report.json,var/live_read_receipts.jsonl,var/shadow_score_receipts.jsonl
The local audit artifacts are intentionally not linked here yet because they are not public in the repository at the time I am writing this.
If I publish those files later, I should link them directly instead of asking readers to trust a summary.
What does not exist:
- a proven trading edge
- a tested verdict on my friend's actual strategy
- revenue
- a customer outcome
- an agent process that no longer needs the human as the last line of correction
That is not nothing.
It is also not the thing I was tempted to call it.
The Response
The answer is not to open five new lanes to feel like momentum came back.
The answer is command structure.
Name the lane.
Name the progress test.
Park everything else.
Finish one loop.
Then close it cleanly.
That is what the last few days exposed.
The gate can catch tool/action overclaims.
The research can kill a false result.
But the operating system around the work also has to become self-correcting.
That is the next build.
Not another vague protocol.
A command layer:
- name the lane
- name the source basis
- name the progress test
- name the stop condition
- close the loop before opening another one
Not because it sounds good.
Because the failure showed exactly where the system was still relying on me.
And if the whole claim is self-correction, then the system has to earn that name in its own behavior first.
Top comments (0)