<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Cameron Meese</title>
    <description>The latest articles on DEV Community by Cameron Meese (@cameronmeese).</description>
    <link>https://dev.to/cameronmeese</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3947992%2Fcfdd654c-c27f-41fd-ae59-0076a9cc499c.png</url>
      <title>DEV Community: Cameron Meese</title>
      <link>https://dev.to/cameronmeese</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/cameronmeese"/>
    <language>en</language>
    <item>
      <title>My bot rejected every trade for being 'too wide' — and the gate was measuring the wrong thing</title>
      <dc:creator>Cameron Meese</dc:creator>
      <pubDate>Sun, 21 Jun 2026 18:12:31 +0000</pubDate>
      <link>https://dev.to/cameronmeese/my-bot-rejected-every-trade-for-being-too-wide-and-the-gate-was-measuring-the-wrong-thing-3da2</link>
      <guid>https://dev.to/cameronmeese/my-bot-rejected-every-trade-for-being-too-wide-and-the-gate-was-measuring-the-wrong-thing-3da2</guid>
      <description>&lt;p&gt;This is the third post in an accidental series about my paper-trading bot finding new and creative ways to do nothing. First it ran for 48 hours and rejected every trade. Then it logged hundreds of trades it never made. This time it had a number to hit — 100 completed trades before I'd even consider real money — and it had crawled to 87 and stalled. Days passed. The counter didn't move.&lt;/p&gt;

&lt;p&gt;So, again: I dumped the rejection log and bucketed by reason.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'/rejected/'&lt;/span&gt; state/decisions.jsonl | jq &lt;span class="nt"&gt;-r&lt;/span&gt; .reason | &lt;span class="nb"&gt;sort&lt;/span&gt; | &lt;span class="nb"&gt;uniq&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt;
    246 wide_spread
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not "mostly." &lt;em&gt;Everything.&lt;/em&gt; Every single rejected entry, one reason: the spread was too wide. The bot has a rule — if a market's bid-ask spread is wider than a threshold, don't trade it, because crossing a wide spread is expensive. Reasonable rule. It was now vetoing 100% of opportunities.&lt;/p&gt;

&lt;p&gt;My first assumption was the obvious one: the markets I was watching had simply gotten illiquid. Wide spreads are real. Maybe the bot was right to refuse.&lt;/p&gt;

&lt;p&gt;It was not right. And the reason why is the most useful thing I've learned about market microstructure all month.&lt;/p&gt;

&lt;h2&gt;
  
  
  Touch-spread is not fill cost
&lt;/h2&gt;

&lt;p&gt;Here's the thing I had quietly conflated. The "spread" my gate measured was the &lt;strong&gt;touch&lt;/strong&gt; — the gap between the best bid and the best ask:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;spread_bps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;best_ask&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;best_bid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;mid&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;10_000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That math is correct. I checked it three times. And the values were real — these markets genuinely showed a wide touch. So where was the bug?&lt;/p&gt;

&lt;p&gt;The bug was in believing that a wide touch means an expensive fill. It doesn't. &lt;strong&gt;What you actually pay is the slippage from the price you cross at as you walk down the book&lt;/strong&gt; — not the distance between the two best quotes. A market can have a wide touch and still fill you cheaply, if there's real size sitting right at the best ask.&lt;/p&gt;

&lt;p&gt;And then there was the kicker: most of the symbols tripping this gate were &lt;em&gt;cheap&lt;/em&gt; tokens — sub-dollar, some sub-penny. When a token trades at $0.03 and the exchange's minimum price increment (the tick) is $0.0001, a &lt;strong&gt;one-tick spread is already ~33 basis points&lt;/strong&gt;. Two ticks, 66. Not because the book is thin — because the price is small and bps is a ratio. My gate was set at a level that, for a three-cent coin, flagged a perfectly healthy book as "too wide" on tick granularity alone.&lt;/p&gt;

&lt;p&gt;I was, in effect, refusing to trade any cheap asset on the grounds that cheap assets have coarse ticks. That's not a liquidity filter. That's a unit-of-measure bug wearing a liquidity filter's clothes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The gate that already knew the answer
&lt;/h2&gt;

&lt;p&gt;Here's the part that stung. The bot &lt;strong&gt;already&lt;/strong&gt; had a real liquidity check — a good one. When it builds its watchlist, it does a depth probe: it walks the actual order book and confirms it can fill a target size within a tight slippage budget. Every symbol on the list had &lt;em&gt;passed that probe.&lt;/em&gt; The bot had walked the book, proven these markets fill cheaply, put them on the watchlist... and then refused to trade them at entry because the touch looked wide.&lt;/p&gt;

&lt;p&gt;Two checks, measuring two different things, disagreeing — and I'd let the cruder one (the touch) overrule the one that actually walks the book.&lt;/p&gt;

&lt;p&gt;Worse, the wide-spread gate was firing &lt;em&gt;before&lt;/em&gt; the bot's real economic check — the one that weighs the genuine round-trip cost against the expected edge and rejects a trade only if it can't pay for itself. By bailing out early on a bps threshold, the crude gate never let the smart, money-aware check have an opinion.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix: demote the proxy, trust the real guards
&lt;/h2&gt;

&lt;p&gt;I didn't delete the gate — a genuinely broken, one-sided, gapped book &lt;em&gt;is&lt;/em&gt; a real thing worth refusing on sight. I demoted it. I moved the threshold out to a level that only catches pathological books, and let the checks that measure cost &lt;em&gt;correctly&lt;/em&gt; make the marginal calls:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The &lt;strong&gt;economic check&lt;/strong&gt; prices the real round-trip cost against the expected edge.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;fill-time slippage budget&lt;/strong&gt; walks the actual book and rejects any fill whose real slippage blows the budget — the true protection, at the moment it matters.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;build-time depth probe&lt;/strong&gt; keeps structurally thin markets off the list entirely.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The validation was immediate. I restarted and watched the first decision on a symbol that had been rejected hundreds of times for &lt;code&gt;wide_spread&lt;/code&gt;. New verdict: it cleared the spread gate and moved on to the next, real check. The wall was gone.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;The lesson isn't "set the threshold higher." It's that a guard is only as good as the thing it measures, and it's dangerously easy to ship a proxy that &lt;em&gt;looks&lt;/em&gt; like the real quantity. Touch-spread looks like cost — it's shaped like cost, denominated in bps like cost. It is not cost. The bot even had the correct measurement sitting right next to the wrong one.&lt;/p&gt;

&lt;p&gt;When a gate is rejecting everything, the bug isn't always the threshold. Sometimes it's that you're measuring the wrong number with great precision.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Part of an ongoing build-in-public log about building a small algorithmic trading bot the slow, paper-first way.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>algotrading</category>
      <category>python</category>
      <category>debugging</category>
      <category>buildinpublic</category>
    </item>
    <item>
      <title>My bot logged hundreds of trades it never made — so I built something to check if it was lying</title>
      <dc:creator>Cameron Meese</dc:creator>
      <pubDate>Sun, 31 May 2026 01:39:49 +0000</pubDate>
      <link>https://dev.to/cameronmeese/my-bot-logged-hundreds-of-trades-it-never-made-so-i-built-something-to-check-if-it-was-lying-81j</link>
      <guid>https://dev.to/cameronmeese/my-bot-logged-hundreds-of-trades-it-never-made-so-i-built-something-to-check-if-it-was-lying-81j</guid>
      <description>&lt;p&gt;I have a rule for new strategies: &lt;strong&gt;observe before you bet.&lt;/strong&gt; Before a single dollar (paper or otherwise) moves, the strategy runs in "would-have-traded" mode — every time it thinks it sees an edge, it writes a row to a log instead of placing an order. Decision, timestamp, the side it would have taken, and the edge it believed it had. You let that run, then you go back and check whether the bot was right.&lt;/p&gt;

&lt;p&gt;This is the story of going back to check, and finding out the bot was lying to me in two different ways at once.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;The strategy prices short-duration crypto "up or down" binary markets — will the price be higher at the top of the hour than it was at the start? It builds a fair-value probability from a volatility model and compares it to what the market is charging. When the gap clears fees, it logs a decision.&lt;/p&gt;

&lt;p&gt;After a day of observing, the feed looked &lt;em&gt;busy&lt;/em&gt; — lots of green, lots of "+5.2¢ edge" rows. And one number jumped out when I tallied it up: the bot was choosing &lt;strong&gt;"NO" over "YES" about 4 to 1.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I immediately had a story. My volatility estimate, sourced from one exchange's recent prints, probably runs a little hot — and an overestimate of volatility makes the &lt;em&gt;unlikely&lt;/em&gt; side of a binary look underpriced. So the bot keeps "buying" the cheap tail. Made sense. I was about ten minutes from turning down the volatility input and calling it a fix.&lt;/p&gt;

&lt;p&gt;That would have been a mistake. The 4:1 number was a hypothesis built on raw counts, and I hadn't checked a single one of those decisions against what actually happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  The harness
&lt;/h2&gt;

&lt;p&gt;So I built the thing I should have built first: a script that takes each logged decision, looks up the &lt;strong&gt;actual outcome&lt;/strong&gt; of that market (did it close up or down?), and scores it. Win or loss. Then it aggregates — realized win rate vs. the win rate the model &lt;em&gt;predicted&lt;/em&gt;, broken out by side and by confidence bucket.&lt;/p&gt;

&lt;p&gt;The first run covered 35 resolved decisions. Here's what came back (all paper, all hypothetical — don't @ me about the size):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OVERALL   win 45.7% (16/35)   predicted 49.1%   net -$31.91
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Net negative. The strategy I'd been admiring in the feed would have &lt;strong&gt;lost money&lt;/strong&gt;. That alone was worth knowing before risking anything. But the two breakdowns underneath are where it got interesting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lie #1: the 4:1 skew was a measurement artifact
&lt;/h2&gt;

&lt;p&gt;I split the decisions by side, deduped to one per opportunity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;buy_no    18
buy_yes   17
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even. Basically a coin flip.&lt;/p&gt;

&lt;p&gt;So where did 4:1 come from? The bot re-evaluates every market on every scan, and in observe mode it was logging a decision &lt;em&gt;each time&lt;/em&gt; a market still qualified — not once per opportunity. A market that sat in "NO looks cheap" territory for twenty minutes got logged dozens of times; a market that flickered into "YES" for one scan got logged once. The raw feed wasn't measuring my model's &lt;em&gt;bias&lt;/em&gt;. It was measuring how long each opportunity &lt;em&gt;lingered.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The "overestimated volatility → buy NO" story was a confident explanation for a number that was pure logging noise. Dedup first, &lt;em&gt;then&lt;/em&gt; analyze. I'd skipped the first step and nearly tuned a real model parameter to chase a histogram artifact.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lie #2: the losses were hiding in the longshots
&lt;/h2&gt;

&lt;p&gt;The other breakdown bucketed every decision by the model's own predicted probability for the side it took:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;predicted fair &amp;lt; 0.40   -&amp;gt;  0 wins out of 12
predicted fair 0.4-0.6  -&amp;gt;  64.7% win  (model said 50.4%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There it is. Every single bet where the model itself rated the chosen side a &lt;em&gt;longshot&lt;/em&gt; — taken purely because the asking price was even cheaper than that long shot — &lt;strong&gt;lost.&lt;/strong&gt; Zero for twelve. Meanwhile the coin-flip-ish bets in the middle were actually fine, even good.&lt;/p&gt;

&lt;p&gt;That's a different bug than "volatility too high everywhere." It's specifically: &lt;em&gt;don't take a side your own model thinks will probably lose, just because it's on sale.&lt;/em&gt; The cheap-tail edge was an illusion of the pricing model on exactly the bets where the model is least trustworthy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix (and the part where I don't trust my own fix)
&lt;/h2&gt;

&lt;p&gt;The change wasn't a volatility knob. It was a floor: &lt;strong&gt;don't bet a side the model rates below 40% to win&lt;/strong&gt;, no matter how cheap. Surgical — it removes the 0-for-12 segment and leaves the working middle alone.&lt;/p&gt;

&lt;p&gt;Re-scored with the floor applied, the same data goes from −$31.91 to +$89.61, 69.6% win rate. Which sounds great, and which I am &lt;em&gt;deliberately not celebrating&lt;/em&gt;, because that number is &lt;strong&gt;in-sample&lt;/strong&gt;: I picked the 0.40 threshold &lt;em&gt;by looking at this exact dataset&lt;/em&gt;. Of course it improves the dataset it was fit to. That's not evidence the floor works. It's evidence I can draw a line through points I already have.&lt;/p&gt;

&lt;p&gt;The real test is fresh data the threshold has never seen. So the bot keeps observing — now with the floor live and the logging deduped — and in a few days I re-run the harness on decisions it couldn't have been tuned against. If it's still positive and balanced out of sample, the strategy earns a shot at paper execution. If not, back to the volatility model. Either way, I'll have measured it instead of guessed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd tell past me
&lt;/h2&gt;

&lt;p&gt;Two things, and they're really the same thing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A raw count is not a measurement.&lt;/strong&gt; Before you explain a number, make sure the number is counting what you think it's counting. My "4:1 bias" was a logging cadence in a trench coat.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A result you fit your parameter to is not a result.&lt;/strong&gt; In-sample improvement is the easiest thing in the world to manufacture and the easiest thing to fool yourself with. The only honest verdict comes from data the decision never touched.&lt;/p&gt;

&lt;p&gt;The strategy might still be a dud. I genuinely don't know yet — and that "I don't know yet, here's how I'll find out" is the whole point. Observe before you bet. Then actually check the observations. Then check them again on data you can't have cheated on.&lt;/p&gt;

</description>
      <category>python</category>
      <category>algotrading</category>
      <category>webdev</category>
      <category>buildinpublic</category>
    </item>
    <item>
      <title>My bot ran for 48 hours and didn't do a thing</title>
      <dc:creator>Cameron Meese</dc:creator>
      <pubDate>Sat, 23 May 2026 17:17:19 +0000</pubDate>
      <link>https://dev.to/cameronmeese/my-bot-ran-for-48-hours-and-didnt-do-a-thing-1bh</link>
      <guid>https://dev.to/cameronmeese/my-bot-ran-for-48-hours-and-didnt-do-a-thing-1bh</guid>
      <description>&lt;p&gt;I'd been watching a paper-trading bot I've been building for two days. Just paper — no real money at stake — but the silence was getting loud. Zero trades. Not "no opportunities" zero — &lt;em&gt;actively rejected every single one&lt;/em&gt; zero. The bot logged 1,262 entry attempts in 24 hours. Every one bounced.&lt;/p&gt;

&lt;p&gt;This is the post-mortem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hook: why was nothing happening?
&lt;/h2&gt;

&lt;p&gt;The bot's job is to spot setups across a handful of trading pairs and open positions when conditions line up. It had been working. Then I tightened the universe of symbols it watched — added some thinner, more volatile candidates I wanted to test against — and from that moment, nothing.&lt;/p&gt;

&lt;p&gt;First instinct: market regime. Maybe nothing was qualifying. So I dumped the rejection log and bucketed by reason.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'/rejected/'&lt;/span&gt; state/decisions.jsonl | jq &lt;span class="nt"&gt;-r&lt;/span&gt; .reason | &lt;span class="nb"&gt;sort&lt;/span&gt; | &lt;span class="nb"&gt;uniq&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt;
   1015 stale_quote
    176 insufficient_inventory
     71 max_concurrent_reached
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Stale quotes? On 14 actively-watched symbols, across three exchanges, in the middle of a normal trading day? That number didn't pass the sniff test.&lt;/p&gt;

&lt;h2&gt;
  
  
  The investigation: chasing a lying number
&lt;/h2&gt;

&lt;p&gt;Two things were happening, and they were stacking.&lt;/p&gt;

&lt;p&gt;First: the bot tracks "freshness" of price quotes per symbol — if the last quote from an exchange is older than ~60 seconds, you don't trust it for sizing. Reasonable rule.&lt;/p&gt;

&lt;p&gt;But to &lt;em&gt;get&lt;/em&gt; fresh quotes, the bot polls the exchange's orderbook (via the wonderful but occasionally-temperamental &lt;a href="https://github.com/ccxt/ccxt" rel="noopener noreferrer"&gt;ccxt&lt;/a&gt; library). And those polls were timing out — silently, in batches. Five-minute window: 215 orderbook timeouts. Same five-minute window: zero successful quote refreshes.&lt;/p&gt;

&lt;p&gt;OK, so the bot has bad quotes. Why doesn't it just… wait and retry?&lt;/p&gt;

&lt;p&gt;It does. Sort of. Here's the part that had been working fine for weeks and quietly became the bomb:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# After 3 consecutive orderbook timeouts on (venue, symbol),
# stop scheduling that pair until the bot restarts.
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;failure_count&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;venue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;quarantine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;venue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A reasonable defensive measure. If a &lt;code&gt;(venue, symbol)&lt;/code&gt; is wedged, stop wasting cycles trying it. Restart-only recovery means a human is paying attention before it retries.&lt;/p&gt;

&lt;p&gt;The bug isn't in the code. The bug is in the &lt;em&gt;assumption&lt;/em&gt; the code encodes: "the only way this fails 3 times in a row is if something is permanently broken." That's true 99% of the time. The 1% is when an exchange has a 30-second session warmup on a cold start, and three consecutive 15-second timeouts trip every symbol you're trying to load.&lt;/p&gt;

&lt;p&gt;9 of my 14 symbols got quarantined inside the first 65 seconds of boot. They stayed quarantined for the next 10 hours, until I noticed and restarted the bot.&lt;/p&gt;

&lt;h2&gt;
  
  
  The second bug, which lied to me about the first
&lt;/h2&gt;

&lt;p&gt;While I was in there, I noticed something else weird. A lot of the rejections were tagged &lt;code&gt;stale_quote&lt;/code&gt;, but they shouldn't have been — for some of those candidates, the bot didn't even have inventory available. The "do you have inventory?" check should have rejected first.&lt;/p&gt;

&lt;p&gt;It &lt;em&gt;was&lt;/em&gt; checking. In the wrong order. The freshness check ran before the inventory check, and a stale quote (which, we now know, was caused by the quarantine) was masking the real reason. So the rejection log was &lt;em&gt;lying&lt;/em&gt; to me — over a thousand &lt;code&gt;stale_quote&lt;/code&gt; entries were really &lt;code&gt;insufficient_inventory&lt;/code&gt; events I couldn't see.&lt;/p&gt;

&lt;p&gt;This is the part of debugging nobody writes about: you find one bug, and it was hiding two more. Reorder the gate stack, surface the truth, suddenly the histogram tells a different story.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix: auto-recovering quarantine
&lt;/h2&gt;

&lt;p&gt;The real fix was conceptual. Permanent-until-restart is the wrong shape. What I wanted:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sideline a flaky &lt;code&gt;(venue, symbol)&lt;/code&gt; after N consecutive failures (keep this part)&lt;/li&gt;
&lt;li&gt;After a cooldown, &lt;strong&gt;carefully retry&lt;/strong&gt; (the new part)&lt;/li&gt;
&lt;li&gt;If still broken, re-quarantine &lt;em&gt;immediately&lt;/em&gt; — not after another N failures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last invariant turned out to be the one that mattered. If you reset the failure counter on cooldown expiry, a permanently-broken resource costs &lt;code&gt;N × cycles&lt;/code&gt; failures over the lifetime of your process. If you &lt;em&gt;preserve&lt;/em&gt; the count, it costs exactly one failure per cycle.&lt;/p&gt;

&lt;p&gt;Here's the whole thing, about 50 lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AutoRecoveringQuarantine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Generic&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recovery_seconds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="n"&gt;clock&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;&lt;span class="p"&gt;[[],&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_recovery_seconds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;recovery_seconds&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_clock&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;clock&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_failure_count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_skip_until&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;is_quarantined&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;until&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_skip_until&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;until&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_clock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;until&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="c1"&gt;# Window expired — drop the deadline, KEEP the failure count.
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_skip_until&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;record_success&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_failure_count&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_skip_until&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;record_failure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_failure_count&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_failure_count&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_skip_until&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_clock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_recovery_seconds&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;K&lt;/code&gt; is whatever hashable key identifies your "thing that flakes" — &lt;code&gt;(venue, symbol)&lt;/code&gt;, a tenant ID, a customer hash, whatever.&lt;/p&gt;

&lt;p&gt;I pulled it out into its own repo: &lt;a href="https://github.com/CR8C0NT1NUM/ccxt-auto-recovering-quarantine" rel="noopener noreferrer"&gt;&lt;code&gt;ccxt-auto-recovering-quarantine&lt;/code&gt;&lt;/a&gt;. Stdlib only. Drop it into any project where one flaky key shouldn't take out the rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  The validation: the numbers don't lie this time
&lt;/h2&gt;

&lt;p&gt;Before the fix, on the worst day of the cluster:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;130 orderbook timeouts (one exchange)&lt;/li&gt;
&lt;li&gt;1,286 &lt;code&gt;stale_quote&lt;/code&gt; rejections (mostly lies, as we now know)&lt;/li&gt;
&lt;li&gt;0 successful trades&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After deploying the fix, over the next five days:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;0 orderbook timeouts&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;stale_quote&lt;/code&gt; rejections collapsed: 464 → 56 → 0 → 0 → 0&lt;/li&gt;
&lt;li&gt;First profitable trade closed: +$0.041 net of fees, held 15h 51min. (Paper money. Don't @ me about the size.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Five days, four successful trades, 100% win rate on paper. The pattern is doing what it should, and the bot is no longer pretending to work while quietly doing nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd tell past me
&lt;/h2&gt;

&lt;p&gt;If you're writing defensive code that says "after N failures, give up," ask one more question: &lt;em&gt;what is the expected lifetime of "broken"?&lt;/em&gt; If it's "forever," your defense is correct. If it's "until something transient clears" — and most things are — you need a way back.&lt;/p&gt;

&lt;p&gt;Permanent isn't always the right kind of safe.&lt;/p&gt;

</description>
      <category>python</category>
      <category>algotrading</category>
      <category>discuss</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
