DEV Community

maymay5692
maymay5692

Posted on

I Killed My Crypto Signal Service Before Launching It — Walk-Forward Said Every Strategy Failed

Disclaimer: This article is for educational purposes only. It is not investment advice. Crypto trading carries significant risk. Do your own research and comply with your local regulations.

I was about to launch a paid crypto signal service. Three strategies, Telegram delivery, the whole thing.

I killed it. Here's why.

The Plan

I'd been writing publicly about a "3-strategy consensus" bot for months. The idea was simple. Pick three strategies with solid individual Sharpe ratios, fire a signal only when at least two agree, push that to a paid Telegram channel. Clean, conservative, explainable.

The three strategies:

Strategy Sharpe What it does
EMA Crossover (12/26) 1.30 Trend following
Parabolic SAR 1.25 Trend reversal
MACD (12/26/9) 1.17 Momentum

All three numbers came from three years of BTC/USDT daily backtests. Paid launch was scheduled. Infrastructure was ready. I was maybe a week away from flipping a switch.

The First Red Flag

From April 1 to April 8, I watched the free channel to sanity-check the signals. Three pairs (BTC, ETH, SOL) times eight days equals 24 opportunities.

Buy: 0. Sell: 0. Hold: 24.

Eight straight days of nothing. Not a single signal across three pairs.

I assumed quiet markets. Checked the code. It was working as designed. Each strategy only fires on the exact day a crossover happens, and requiring two strategies to cross on the same day turns out to be statistically rare — roughly once or twice per quarter. Not a market issue. A design issue.

That was the moment I stopped and asked a harder question.

The Question I Should Have Asked Earlier

I had individual backtests for each of the three strategies. I had never actually backtested the consensus rule itself. "Two strategies agreeing" is a new composite strategy, and I had no out-of-sample evidence that it worked.

So I ran Walk-Forward Optimization.

Walk-Forward Setup

Rolling windows, 365-day in-sample, 90-day out-of-sample, 90-day step. Three pairs (BTC/USDT, ETH/USDT, SOL/USDT). Three candidate strategies:

  1. Current consensus — fire when buy_count >= 2
  2. Variant A: EMA only — drop the consensus, use the best individual strategy
  3. Variant B: Sharpe-weighted — weight each strategy by its historical Sharpe, threshold ±1.0

Data: Binance daily, Jan 2023 to Mar 2026. About 1,100+ bars per pair.

For significance I used the Deflated Sharpe Ratio (Bailey & López de Prado, 2014). DSR corrects for the selection bias that happens when you try multiple strategies and pick the best one. My threshold was DSR ≥ 0.95.

Here's the WFO config, in case it's useful:

from dataclasses import dataclass

@dataclass(frozen=True)
class WfoConfig:
    in_sample_days: int = 365
    out_of_sample_days: int = 90
    rolling_step_days: int = 90
    data_start: str = "2023-01-01"
    data_end: str = "2026-03-21"
Enter fullscreen mode Exit fullscreen mode

Pretty standard. Nothing clever.

The Results

I'm going to show all nine cells. No cherry-picking.

Strategy Pair OOS Sharpe Trades DSR IS→OOS Decay Verdict
Current consensus BTC -0.812 14 0.005 2.05x FAIL
Current consensus ETH -1.736 11 0.000 1.89x FAIL
Current consensus SOL -1.974 10 0.000 3.12x FAIL
Variant A: EMA BTC -10.409 18 0.000 11.38x FAIL-FAIL
Variant A: EMA ETH -0.029 15 0.064 0.40x FAIL
Variant A: EMA SOL -2.157 16 0.000 2.91x FAIL
Variant B: Sharpe-weighted BTC 0.066 42 0.082 0.69x MARGINAL
Variant B: Sharpe-weighted ETH -1.599 38 0.000 1.92x FAIL
Variant B: Sharpe-weighted SOL 0.612 37 0.279 0.57x MARGINAL

Seven out of nine cells have negative out-of-sample Sharpe. The two positive cells don't clear the DSR significance threshold. Zero cells pass.

The worst cell is Variant A EMA on BTC: OOS Sharpe of -10.409 with an IS→OOS decay of 11.38x. That's the textbook shape of "the model memorized the in-sample period and the out-of-sample period walked directly into the opposite wall."

The Gate Review

I have a tool called gate-reviewer that runs a 4-Pitfall / 3-Gate framework against strategy results. I ran this one through it. The 3-Gate summary:

Gate 1 (Explainability):      FAIL
  "Multiple indicators agreeing = signal" is not empirically justified.
  There is no structural reason the edge should exist.

Gate 2 (Tail Safety):         FAIL
  Max drawdown threshold is 15%. Estimated DD across variants: -25% to -70%.
  No strategy-level circuit breaker.

Gate 3 (OOS Reproducibility): FAIL
  Sharpe min > 0: 7 out of 9 cells negative.
  Trade count per window: 10-42. Sample too small.

TOTAL VERDICT: DISCARD
Enter fullscreen mode Exit fullscreen mode

All three gates fail. The verdict is blunt: discard the strategy.

Why This Failed (Honest Post-Mortem)

Four things, in rough order of blame.

1. The data window was a bull market.
2023 through early 2026 was a structurally bullish period for BTC. All three strategies are trend-following. You'd expect them to look fine in-sample. The second the walk-forward window shifts to even a modestly choppier period, the model collapses. That's the signature of regime dependence.

2. The three strategies were not independent.
EMA, SAR, and MACD are all moving-average / momentum family. They see similar information. "Three strategies agreeing" sounded like three independent votes. It was closer to the same vote counted three times. There was no real ensemble diversity.

3. Design fires only on crossover instants.
Signals only trigger on the exact day a crossover happens, so the bot is constantly hunting for trend starting points. When the starting point turns out to be a reversal, the bot takes the wrong side. Inside a sustained trend, it fires nothing.

4. I ignored my own multiple-testing warning.
I wrote a whole article about the Deflated Sharpe Ratio and how trying N strategies and picking the best one inflates the apparent edge. Then I built a signal service that tries three strategies and picks the best-looking combination. I did not apply DSR correction to my own design. I wrote the warning label and then ate the poison.

That last one is the one that actually bothers me.

What I'm Doing Instead

Three options were on the table:

  • A. Rebuild strategies, rerun Phase 0.
  • B. Kill the signal service. Sell the verification process instead.
  • C. Shut the whole thing down.

I ran this through an internal AI council I use for hard calls (three personas judging independently). Unanimous for B.

My own reasoning landed in the same place. Option A would be me rerunning the same multiple-testing loop I just failed, which is exactly the trap DSR exists to warn you about. Option C wastes a working backtest engine, a Telegram audience, and a 13-article public trail. Option B converts the Phase 0 failure from a dead end into the most honest piece of content I've ever written.

So from now on, I'm publishing:

  • A weekly "verification journal" on the free Telegram channel — every new strategy I test, pass or fail, with the numbers attached.
  • Technical deep-dives on Walk-Forward Optimization and DSR.
  • A paid report ("Phase 0: The Discard Record") covering the full WFO logs, all gate-reviewer outputs, and every reason the strategies failed. Coming soon.

Selling process instead of signals. Process is the thing I actually have.

What You Can Take From This

If you're building a signal service or any paid quantitative product, three takeaways.

Walk-forward your composite rules, not just your components. A consensus of good strategies is not automatically a good strategy. The composition has its own degrees of freedom and needs its own out-of-sample test.

Use DSR even for small N. Three strategies is already enough for selection bias to inflate your apparent Sharpe. Apply the correction. It takes ten lines of Python.

Ship the failure. A public "here's the strategy I was going to sell and here's why I didn't" is worth more than another confident-sounding bot tutorial. There are a lot of the second kind. There are almost zero of the first.

I'll take "honest" over "impressive" every time now. Especially after this.


Links


Reference

  • Bailey, D. H., & López de Prado, M. (2014). The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting and Non-Normality. Journal of Portfolio Management.

Top comments (0)