Mike Czerwinski

Posted on Jun 28

A published win rate is the actor auditing itself

#python #crypto #trading #llmops

A published win rate is the actor auditing itself

A signal channel that publishes its own win rate is grading its own homework. The number it advertises comes from the part of the record that survived being shown. That does not prove fraud. It proves a measurement problem: the actor writing the record is also the actor being audited. I built the instrument that could see around it, pointed it at the channels everyone screenshots, and this is what it found.

The setup

I build autonomous crypto trading systems in Python. The one running today is live on its own strategies, and has been since June 4, 2026. But before any source earns real capital it has to clear shadow mode first: the full pipeline runs on live market data with realistic frictions, 8bps fees and 5bps slippage, every signal logged as "would have entered at X" and tracked to its outcome, no real order placed.

Shadow mode is the whole trick. It lets you measure a source against outcomes it does not control, instead of against the receipts it chooses to post.

Telegram was one of the first sources I wired up. Dozens of crypto signal channels, some with hundreds of thousands of subscribers, many claiming 70 to 80 percent win rates. When the bot connected it pulled in the channel history along with the live feed, so the record reaches back well before the bot existed: 9,312 messages spanning 17 months, February 2025 to June 2026.

I wanted to measure these channels properly rather than trust the screenshots. I measured them, then I dropped them. This post is the measurement that made that an easy call.

The pipeline

Most signals never reach evaluation, and where they die is itself the finding.

Telegram message received
   -> LLM parsing (DeepSeek): extract pair, side, entry, TP, SL
   -> Staleness check: is the entry still reachable?
   -> Veto filter: RSI sanity, news, Fear and Greed, regime gates
   -> Risk budget: daily loss limit, cooldown, correlation
   -> Shadow execution: log "would have entered at X", track to TP/SL/timeout

The system tracked 7 channels. Full collection, queried live from the production DB on Jun 27, 2026:

Channel	Messages	Parsed	Parse fail	Period
Crypto_Whales_Pumps_Guide	2,643	513	122	Feb 2025 - Jun 2026
Binance_Futures_Trades	2,445	164	1,852	May - Jun 2026
Trading_Crypto_Signals_Bitcoin	1,808	164	1,619	May - Jun 2026
cryptoninjas_trading_anm	1,351	241	273	Jul 2025 - Jun 2026
Tofan_Trade	1,008	222	750	May - Jun 2026
claycryp	34	8	8	Feb - Jun 2026
rarecryptosignals	23	6	4	Feb - Jun 2026
Total	9,312	1,318	4,628	Feb 2025 - Jun 2026

The gap between Messages and Parsed + Parse fail is mostly non-signal content filtered before extraction: chatter, announcements, result posts, teasers, and price updates without tradeable levels.

The funnel

Here is what happened to those 9,312 messages:

9,312   raw messages received
1,318   parseable (a valid trade idea)        <- 14.2% of raw
  109   timely (still actionable)             <- 8.3% of parseable
   17   reached a trade decision
    0   actually executed                     <- 0%

Only 14.2 percent of messages contained a parseable trade idea. The rest was noise: memes, "GM", price alerts without levels, result updates, locked teasers. And of the trade ideas that did parse, only 109 of 1,318 were still actionable by the time my pipeline could act. That is 91.7 percent stale.

A word on that number, because staleness depends entirely on what you put under the line. The 91.7 percent is timeliness measured against parseable signals: 109 of 1,318. Measured instead against the broader set of candidate messages the pipeline actually ran a staleness check on, it is 97.4 percent: 4,007 of 4,116. Both are real. They answer different questions.

The number that is wrong is 43 percent, which you get by dividing the stale count by all 9,312 raw messages, quietly swapping a staleness denominator for a raw-volume one. I am showing all three on purpose. The moment you let a single denominator go unstated, you are back to grading your own homework.

The reason is not slow code. It is that a broadcast channel posts a signal as the move starts, and tens of thousands of people see it at the same instant. By the time anything is parseable and checked, the information is already in the price. Staleness is not a bug in my pipeline. It is the defining property of the product.

What is actually inside the surviving signals

Of the 92 timely signals the router skipped, the rejection codes tell the story:

Rejection reason	Count	What it means
`result_message`	45	Post-trade update ("TP1 hit") not a new signal
`locked_teaser`	28	Levels hidden behind a paywall
(no reason)	19	Router skipped without classifying

Roughly 79 percent of the surviving skipped signals were not signals. They were either announcements of trades already closed or advertisements for the paid tier. I left the unclassified bucket in the table because hiding unknowns would reproduce the exact reporting problem this post is about.

A locked teaser looks like this:

SIGNAL: ETHUSDT SHORT
Entry: [Unlock in Premium]
TP:    [Unlock in Premium]
SL:    [Unlock in Premium]

The model can read the pair and the direction. Without levels it is not tradeable. The free tier exists to show you that signals exist, not what they are.

The result_message half is the same trick from the other side: flood the feed with win announcements to manufacture social proof while the entries stay paywalled. This is the mechanism kenielzep97 described as receipts that are not outcomes, caught in the act. The channel is curating its own track record in real time, and the feed makes the curation read like live flow.

The scorecard, measured against price

The live router executed zero trades. That is the timeliness funnel talking: nothing survived staleness and the veto filters in time to act. Whether the channels had any edge at all is a separate question, so I backtested the parseable signals against historical klines with the same frictions. Only 846 of the 1,318 had klines available to score against, so that is the sample.

Zero executed is about my pipeline. The scorecard below is about the source. This is the number the channels cannot post, because it comes from outside their reporting loop.

Channel	n	Win%	Avg PnL	Note
Crypto_Whales_Pumps_Guide	646	46.6%	+0.52%	Only statistically meaningful sample
cryptoninjas_trading_anm	155	45.2%	+0.11%	Marginal edge, low confidence
Binance_Futures_Trades	27	40.7%	-0.22%	Insufficient sample
claycryp	7	85.7%	+2.70%	Too small
rarecryptosignals	6	50.0%	+0.15%	Too small
Tofan_Trade	3	0%	-212%	One RIVERUSDT at -636%
Trading_Crypto_Signals_Bitcoin	2	0%	0.0%	Empty signals

PnL here is measured against each signal's stop and target model, not a spot buy-and-hold return, so a single bad move on a volatile pair can print below -100 percent. Tofan's -212 percent is one RIVERUSDT trade at -636 percent over n=3, which is a degenerate sample, not a measurement. Only the top two rows have enough trades to mean anything.

Now put the advertised number next to the measured one, for the two channels where I have both. The advertised figures are the channels' own parsed win rates from an earlier audit; the measured figures are from the backtest above.

Channel	Advertised	Measured	n (measured)
Crypto_Whales_Pumps_Guide	78.9%	46.6%	646
cryptoninjas_trading_anm	76.3%	45.2%	155

I want to be precise about what this gap is and is not. It is not a fabricated win rate. Crypto_Whales actually cleared a positive +0.52 percent average after fees. The gap is survivorship plus staleness: the advertised number is computed over the trades the channel chose to show, after the fact, on a record it authored. The measured number is computed over everything, against prices it did not control.

Same source, two different records, because two different parties held the pen.

The finding the channel cannot see about itself

For Crypto_Whales, the only channel with enough data, breaking down by direction and year:

Year	Side	n	Win%	Avg PnL
2025	LONG	365	46.3%	+1.06%
2025	SHORT	86	54.7%	+1.83%
2026	LONG	120	28.3%	-2.23%
2026	SHORT	75	68.0%	+0.77%

SHORTs beat LONGs in both years, and the 2026 LONG collapse tracks a regime shift where altcoin longs got crushed. The edge in the data was on the short side. The channel brands itself as a "whale pump" tracker, which points its readers at longs. The free tier was advertising the opposite direction to where the measured edge actually was.

Not out of malice. The channel has no way to know this, because it never measures its own outcomes against price. It only sees the trades it posted.

This is the whole point. Without tagging BTC regime at the moment each signal arrived, the 2026 collapse would have looked like the channel getting worse. With it, you can see it was a regime effect that any long-biased source would have suffered. Regime context only exists if you stamp it at signal time. Reconstruct it afterward and you inherit the same blind spot as the channel.

Why a published win rate cannot audit itself

Every layer here is the same shape. The channel decides which trades to announce and also reports on how those trades did. The decider and the reporter are the same party, so the record is flattering by construction, the same way a compliance checker that keeps signing off on its own work looks clean to everything downstream.

Arpit Gupta put the general version of it well: any system where the component that decides to act is also the component that reports on whether it should have is structurally blind to this exact failure.

The only reason I could see any of it is that the measurement lived somewhere the channel could not write to. Shadow mode against real prices is the external observer. Pull that out and you are left grading the channel on the channel's own receipts, which is no measurement at all.

Why I moved on

In May 2026 I deprecated Telegram as a source and pivoted to bot-footprint signals: liquidation cascades, open-interest surges, funding divergence, on-chain whale tape.

The intuition is to stop following what channels say and start following what large traders actually do, as revealed by their market footprint. A footprint is a consequence the actor cannot author. A win-rate screenshot is a record the actor authors completely.

The 97 percent staleness rate is empirical evidence that by the time a broadcast reaches you, the information is usually already priced in.

The honest claim

I did not prove the channels lie. I proved that the record I was allowed to check was incomplete in exactly the direction that makes the source look safer than it is. The advertised win rate is real, in the same way a green screenshot is real. It is a true record of the moments someone chose to write down.

The outcome is what happens after the last update, and that is the part nobody posts.

If you publish the win rate, you do not get to be the audit of it.

Top comments (1)

Mike Czerwinski • Jun 28

Postscript on what replaced it. The system runs on Hyperliquid perps now, live and shadow at the same time. Live puts real capital on a narrow set of selected signals from a few strategies. Shadow runs everything else as what-if: every candidate move logged and tracked to the outcome it does not control. Shadow is the measurement layer, the part that can see what live's selection leaves on the table. And the sources changed: bot-footprint signals instead of broadcasts, liquidation cascades, open-interest surges, funding divergence, on-chain whale tape. The move is to stop reading what channels say and start reading what large traders cannot help but reveal. A footprint is a consequence the actor cannot author. A win-rate screenshot is a record the actor authors completely. I will write that one up when I have 17 months of it too.