Paarthurnax

Posted on Mar 22

Why Your Crypto Bot Keeps Failing: The Data Quality Problem (And How to Fix It)

#cryptocurrency #python #ai #algorithms

Why Your Crypto Bot Keeps Failing: The Data Quality Problem (And How to Fix It)

Your crypto bot keeps failing and you've already tried everything: tweaking the strategy, adjusting parameters, switching indicators. But the problem might not be your strategy at all. The most overlooked cause of crypto bot failures is data quality — garbage in, garbage out. This guide breaks down the five most common data quality problems that kill crypto bots, and exactly how to fix each one.

The Uncomfortable Truth About Crypto Data

Here's what nobody tells you when you start building a crypto bot: the data is terrible. Unlike equity markets with regulated exchanges and standardized data formats, crypto data is:

Fragmented across 500+ exchanges with different APIs
Missing gaps during high-volatility periods when you need it most
Contaminated with wash trading and fake volume
Inconsistently timestamped across time zones
Subject to exchange outages (exactly when you'd want to trade)

Your bot can have perfect logic and still fail because it's making decisions based on corrupted inputs.

Problem #1: Missing Candles

The most common data quality issue: missing OHLC candles during high-volume periods. When BTC makes a 10% move, exchanges get hammered with traffic. Their APIs start dropping requests. Your bot receives a stream with holes in it.

What it looks like:

2024-01-15 14:00: BTC = $43,200
2024-01-15 14:05: BTC = $43,800
2024-01-15 14:10: [MISSING]
2024-01-15 14:15: [MISSING]  
2024-01-15 14:20: BTC = $41,100  ← Looks like a crash, was actually a gap

How it breaks your bot:
Your RSI calculation now has gaps. The rolling average gets confused. Your bot sees an apparent crash and fires a sell signal that shouldn't exist.

The fix:

import pandas as pd
import numpy as np

def validate_and_fill_gaps(df, expected_interval_minutes=5, max_gap_minutes=30):
    """
    Detect and fill missing candles in OHLC data.

    Args:
        df: DataFrame with DatetimeIndex and OHLC columns
        expected_interval_minutes: Expected time between candles
        max_gap_minutes: Max gap to fill (larger gaps = data outage, skip)
    """
    df = df.copy()
    df.index = pd.to_datetime(df.index)
    df = df.sort_index()

    # Create expected full timeline
    full_index = pd.date_range(
        start=df.index[0],
        end=df.index[-1],
        freq=f"{expected_interval_minutes}min"
    )

    # Find missing candles
    missing = full_index.difference(df.index)
    if len(missing) > 0:
        print(f"⚠️ Found {len(missing)} missing candles")

        # Only fill small gaps (API hiccups)
        # Don't fill large gaps (exchange outages) - mark as unreliable
        gaps = []
        for ts in missing:
            nearest_before = df.index[df.index < ts]
            nearest_after = df.index[df.index > ts]

            if len(nearest_before) > 0 and len(nearest_after) > 0:
                gap_size = (nearest_after[0] - nearest_before[-1]).total_seconds() / 60
                gaps.append({"timestamp": ts, "gap_minutes": gap_size})

        gaps_df = pd.DataFrame(gaps)

        # Fill small gaps with forward fill (acceptable approximation)
        small_gaps = gaps_df[gaps_df["gap_minutes"] <= max_gap_minutes]["timestamp"]
        large_gaps = gaps_df[gaps_df["gap_minutes"] > max_gap_minutes]["timestamp"]

        if len(large_gaps) > 0:
            print(f"🔴 {len(large_gaps)} large gaps detected (>30 min). Data unreliable during these periods.")
            # Mark surrounding candles as unreliable
            df["reliable"] = True
            for gap_ts in large_gaps:
                # Mark ±5 candles around large gaps as unreliable
                window_start = gap_ts - pd.Timedelta(minutes=expected_interval_minutes * 5)
                window_end = gap_ts + pd.Timedelta(minutes=expected_interval_minutes * 5)
                df.loc[window_start:window_end, "reliable"] = False

        # Reindex with small gap filling
        df = df.reindex(full_index)
        df["reliable"] = df["reliable"].fillna(True)  # New candles assumed reliable

        # Forward fill OHLC for small gaps
        for col in ["open", "high", "low", "close", "volume"]:
            if col in df.columns:
                df[col] = df[col].fillna(method="ffill")

        print(f"✅ Filled {len(small_gaps)} small gaps with forward fill")

    else:
        print("✅ No missing candles detected")
        df["reliable"] = True

    return df

# Usage
raw_data = get_binance_klines("BTCUSDT", "5m", 1000)
clean_data = validate_and_fill_gaps(raw_data, expected_interval_minutes=5)

# Filter to only reliable candles for strategy decisions
reliable_data = clean_data[clean_data.get("reliable", True) == True]

Problem #2: Timestamp Inconsistency

Different exchanges use different timestamp formats. Some return Unix milliseconds, some Unix seconds, some ISO 8601 strings. Your bot processes them all and doesn't notice when one is off by 1000x (milliseconds vs seconds).

The horror:

# Exchange A returns: 1705320000000 (milliseconds)
# Exchange B returns: 1705320000 (seconds)
# Your bot doesn't notice... until:
pd.to_datetime(1705320000000, unit="s")  # Year 2024 + 27,000 years = broken

The fix:

def normalize_timestamp(ts):
    """Safely convert any timestamp format to pandas Timestamp."""
    if isinstance(ts, str):
        # ISO 8601 string
        return pd.to_datetime(ts, utc=True)
    elif isinstance(ts, (int, float)):
        # Unix timestamp
        if ts > 1e12:  # Milliseconds (> year 2001 in ms)
            return pd.to_datetime(ts, unit="ms", utc=True)
        else:  # Seconds
            return pd.to_datetime(ts, unit="s", utc=True)
    else:
        return pd.to_datetime(ts, utc=True)

def validate_timestamps(df):
    """Check for obvious timestamp problems."""
    now = pd.Timestamp.now(tz="UTC")

    if df.index[0] > now:
        raise ValueError(f"Future timestamps detected: {df.index[0]}")

    if df.index[0] < pd.Timestamp("2009-01-01", tz="UTC"):
        raise ValueError(f"Pre-Bitcoin timestamps detected: {df.index[0]}")

    # Check for out-of-order timestamps
    if not df.index.is_monotonic_increasing:
        print("⚠️ Out-of-order timestamps detected. Sorting...")
        df = df.sort_index()

    return df

Problem #3: Wash Trading and Fake Volume

An estimated 50-80% of reported crypto volume is fake. Exchanges wash-trade to appear more liquid. This matters for your bot because volume signals (like "volume surge = momentum") become meaningless.

The symptoms:

Your volume-based strategy fires constantly on low-liquidity altcoins
Volume spikes that don't correlate with price movement
Identical volume numbers repeated across consecutive candles

The fix: Use CMC/CoinGecko's adjusted volume instead of raw exchange data:

def get_adjusted_volume(coin_id="bitcoin"):
    """
    Use CoinGecko's market data which applies Nightingale scoring
    to filter wash trading and fake volume.
    """
    url = f"https://api.coingecko.com/api/v3/coins/{coin_id}"
    params = {"localization": "false", "tickers": "false", "community_data": "false"}

    data = requests.get(url, params=params).json()

    market_data = data.get("market_data", {})

    return {
        "total_volume_usd": market_data.get("total_volume", {}).get("usd", 0),
        "market_cap": market_data.get("market_cap", {}).get("usd", 0),
        "volume_to_mcap_ratio": (
            market_data.get("total_volume", {}).get("usd", 0) /
            max(market_data.get("market_cap", {}).get("usd", 1), 1)
        )
    }

def is_volume_suspicious(volume_to_mcap_ratio):
    """
    A volume/mcap ratio > 1.0 daily is extremely suspicious.
    Most legitimate coins: 0.01 - 0.3
    Suspicious: > 0.5
    Almost certainly wash trading: > 1.0
    """
    if volume_to_mcap_ratio > 1.0:
        return True, "🚨 Volume > Market Cap — almost certainly wash trading"
    elif volume_to_mcap_ratio > 0.5:
        return True, "⚠️ High volume/mcap ratio — possible wash trading"
    else:
        return False, "✅ Volume looks legitimate"

Problem #4: Stale Data in Production

Your backtest uses perfectly clean historical data. Your live bot uses whatever the API returns right now — which might be cached, delayed, or from a different trading pair than you expect.

The fix: Always validate freshness:

from datetime import datetime, timedelta

def validate_data_freshness(df, max_age_minutes=5):
    """Ensure data is fresh enough for live trading signals."""
    now = datetime.utcnow()
    latest_candle = df.index[-1].to_pydatetime().replace(tzinfo=None)
    age_minutes = (now - latest_candle).total_seconds() / 60

    if age_minutes > max_age_minutes:
        raise DataFreshnessError(
            f"Data is {age_minutes:.1f} minutes old. Max allowed: {max_age_minutes}. "
            f"Possible API outage or network issue. Skipping signal generation."
        )

    return True

class DataFreshnessError(Exception):
    pass

# Usage in your bot loop
try:
    df = get_live_data()
    validate_data_freshness(df, max_age_minutes=5)
    signal = generate_signal(df)
except DataFreshnessError as e:
    send_telegram_alert(f"⚠️ Data quality issue: {e}")
    # Don't generate signals on stale data

Problem #5: Survivorship Bias in Backtesting

Your backtest only uses coins that exist today. The 200 coins that went to zero in 2022 aren't in your dataset. So your "diversified portfolio" backtest looks great because it only includes survivors.

The fix: Include delisted coins in your backtest universe. This is harder, but CoinGecko includes some historical data for delisted coins. At minimum, be aware that your backtest portfolio is optimistic.

Building a Data Quality Dashboard

Combine all these checks into a daily data quality report:

def run_data_quality_check(coins=["bitcoin", "ethereum", "solana"]):
    """Run comprehensive data quality check and send Telegram report."""
    results = []

    for coin in coins:
        try:
            # Check volume legitimacy
            vol_data = get_adjusted_volume(coin)
            suspicious, reason = is_volume_suspicious(vol_data["volume_to_mcap_ratio"])

            # Check data freshness (for live systems)
            # ... additional checks

            results.append({
                "coin": coin,
                "volume_mcap_ratio": vol_data["volume_to_mcap_ratio"],
                "suspicious": suspicious,
                "reason": reason
            })
        except Exception as e:
            results.append({"coin": coin, "error": str(e)})

    # Format and send report
    report = "📊 Data Quality Report\n\n"
    for r in results:
        if "error" in r:
            report += f"❌ {r['coin']}: Error - {r['error']}\n"
        else:
            emoji = "⚠️" if r["suspicious"] else "✅"
            report += f"{emoji} {r['coin']}: V/MC={r['volume_mcap_ratio']:.3f}\n"

    send_telegram_alert(report)

run_data_quality_check()

Next Steps

Data quality is the unglamorous foundation that every successful trading system is built on. Fix these five problems and your bot will immediately fail less — not because the strategy improved, but because it's making decisions based on reality instead of corrupted inputs.

The OpenClaw skills marketplace includes the TechAnalyzer and CryptoScanner skills which both include basic data validation: https://paarthurnax970-debug.github.io/cryptoclawskills/

Get the full data validation toolkit included with the Home AI Agent Kit.

Disclaimer: Cryptocurrency trading involves substantial risk of loss. Data quality improvements do not guarantee profitable trading. This article is for educational purposes only and does not constitute financial or investment advice.

DEV Community

Why Your Crypto Bot Keeps Failing: The Data Quality Problem (And How to Fix It)

Why Your Crypto Bot Keeps Failing: The Data Quality Problem (And How to Fix It)

The Uncomfortable Truth About Crypto Data

Problem #1: Missing Candles

Problem #2: Timestamp Inconsistency

Problem #3: Wash Trading and Fake Volume

Problem #4: Stale Data in Production

Problem #5: Survivorship Bias in Backtesting

Building a Data Quality Dashboard

Next Steps

Top comments (0)