My bot logged hundreds of trades it never made — so I built something to check if it was lying

#python #algotrading #webdev #buildinpublic

I have a rule for new strategies: observe before you bet. Before a single dollar (paper or otherwise) moves, the strategy runs in "would-have-traded" mode — every time it thinks it sees an edge, it writes a row to a log instead of placing an order. Decision, timestamp, the side it would have taken, and the edge it believed it had. You let that run, then you go back and check whether the bot was right.

This is the story of going back to check, and finding out the bot was lying to me in two different ways at once.

The setup

The strategy prices short-duration crypto "up or down" binary markets — will the price be higher at the top of the hour than it was at the start? It builds a fair-value probability from a volatility model and compares it to what the market is charging. When the gap clears fees, it logs a decision.

After a day of observing, the feed looked busy — lots of green, lots of "+5.2¢ edge" rows. And one number jumped out when I tallied it up: the bot was choosing "NO" over "YES" about 4 to 1.

I immediately had a story. My volatility estimate, sourced from one exchange's recent prints, probably runs a little hot — and an overestimate of volatility makes the unlikely side of a binary look underpriced. So the bot keeps "buying" the cheap tail. Made sense. I was about ten minutes from turning down the volatility input and calling it a fix.

That would have been a mistake. The 4:1 number was a hypothesis built on raw counts, and I hadn't checked a single one of those decisions against what actually happened.

The harness

So I built the thing I should have built first: a script that takes each logged decision, looks up the actual outcome of that market (did it close up or down?), and scores it. Win or loss. Then it aggregates — realized win rate vs. the win rate the model predicted, broken out by side and by confidence bucket.

The first run covered 35 resolved decisions. Here's what came back (all paper, all hypothetical — don't @ me about the size):

OVERALL   win 45.7% (16/35)   predicted 49.1%   net -$31.91

Net negative. The strategy I'd been admiring in the feed would have lost money. That alone was worth knowing before risking anything. But the two breakdowns underneath are where it got interesting.

Lie #1: the 4:1 skew was a measurement artifact

I split the decisions by side, deduped to one per opportunity:

buy_no    18
buy_yes   17

Even. Basically a coin flip.

So where did 4:1 come from? The bot re-evaluates every market on every scan, and in observe mode it was logging a decision each time a market still qualified — not once per opportunity. A market that sat in "NO looks cheap" territory for twenty minutes got logged dozens of times; a market that flickered into "YES" for one scan got logged once. The raw feed wasn't measuring my model's bias. It was measuring how long each opportunity lingered.

The "overestimated volatility → buy NO" story was a confident explanation for a number that was pure logging noise. Dedup first, then analyze. I'd skipped the first step and nearly tuned a real model parameter to chase a histogram artifact.

Lie #2: the losses were hiding in the longshots

The other breakdown bucketed every decision by the model's own predicted probability for the side it took:

predicted fair < 0.40   ->  0 wins out of 12
predicted fair 0.4-0.6  ->  64.7% win  (model said 50.4%)

There it is. Every single bet where the model itself rated the chosen side a longshot — taken purely because the asking price was even cheaper than that long shot — lost. Zero for twelve. Meanwhile the coin-flip-ish bets in the middle were actually fine, even good.

That's a different bug than "volatility too high everywhere." It's specifically: don't take a side your own model thinks will probably lose, just because it's on sale. The cheap-tail edge was an illusion of the pricing model on exactly the bets where the model is least trustworthy.

The fix (and the part where I don't trust my own fix)

The change wasn't a volatility knob. It was a floor: don't bet a side the model rates below 40% to win, no matter how cheap. Surgical — it removes the 0-for-12 segment and leaves the working middle alone.

Re-scored with the floor applied, the same data goes from −$31.91 to +$89.61, 69.6% win rate. Which sounds great, and which I am deliberately not celebrating, because that number is in-sample: I picked the 0.40 threshold by looking at this exact dataset. Of course it improves the dataset it was fit to. That's not evidence the floor works. It's evidence I can draw a line through points I already have.

The real test is fresh data the threshold has never seen. So the bot keeps observing — now with the floor live and the logging deduped — and in a few days I re-run the harness on decisions it couldn't have been tuned against. If it's still positive and balanced out of sample, the strategy earns a shot at paper execution. If not, back to the volatility model. Either way, I'll have measured it instead of guessed.

What I'd tell past me

Two things, and they're really the same thing.

A raw count is not a measurement. Before you explain a number, make sure the number is counting what you think it's counting. My "4:1 bias" was a logging cadence in a trench coat.

A result you fit your parameter to is not a result. In-sample improvement is the easiest thing in the world to manufacture and the easiest thing to fool yourself with. The only honest verdict comes from data the decision never touched.

The strategy might still be a dud. I genuinely don't know yet — and that "I don't know yet, here's how I'll find out" is the whole point. Observe before you bet. Then actually check the observations. Then check them again on data you can't have cheated on.