DEV Community

Amos Augo
Amos Augo

Posted on

Real-Time Crypto Data Pipeline with Change Data Capture (CDC) Using PostgreSQL, Kafka, Cassandra, and Grafana

Introduction

In this project, I built a complete real-time cryptocurrency analytics system from the ground up, capable of ingesting, storing, and visualizing live crypto market data.

The system collects price and volume data from the Binance Exchange, streams it through Kafka (with Debezium CDC), stores it in Cassandra, and visualizes it live in Grafana.

This setup simulates a lightweight version of the kind of real-time infrastructure used by trading platforms, financial dashboards, and risk monitoring systems, emphasizing scalability, fault-tolerance, and live data analysis.


System Architecture Overview

Here’s the high-level flow of data through the pipeline:

Binance API → PostgreSQL → Debezium (CDC) → Kafka → Cassandra → Grafana
Enter fullscreen mode Exit fullscreen mode

Components Breakdown

Component Technology Function
Data Source Binance API Provides live cryptocurrency data
CDC Engine Debezium Captures real-time changes in PostgreSQL
Message Broker Apache Kafka Streams change events
Database Apache Cassandra Stores processed market data
Visualization Grafana Real-time dashboard and analytics

Design Choice: No Sink Connector

In a typical Kafka-based architecture, a sink connector is used to automatically write streamed data into a destination like Cassandra.

However, in this project, I did not use a sink connector. Instead, I wrote a custom consumer script that listens to Kafka topics and inserts data manually into Cassandra.

Why I Chose This Approach

  1. Full Control Over Data Transformation: We could clean, transform, and enrich the messages before writing them into Cassandra. (For example, filtering noise, flattening JSON payloads, converting timestamps, etc.)
  2. Better Error Handling and Debugging: We could log failed inserts, handle schema mismatches gracefully, and replay specific batches.
  3. Simpler to Debug During Development: For educational and experimental purposes, it was easier to see what was flowing through each stage of the pipeline.
  4. Future Extensibility: This approach lets us modify the consumer to perform real-time computations or alerts before persisting the data.

How Data Flows Through the System

Binance WebSocket Feeds continuously push ticker, order book, and kline (candlestick) updates.
These updates are first written into PostgreSQL tables for structured storage.

 Producer inserting data into Postgres

Debezium monitors PostgreSQL using Change Data Capture (CDC) and streams every row change into Kafka topics.

 Kafka topics

Our Kafka consumer (a Python script) reads from those topics and writes structured data into Cassandra tables.

 Consumer streaming changes into Cassandra

 Data in a Cassandra table

Finally, Grafana queries either the Flask API (via Infinity plugin) or Cassandra directly to visualize the live data.


Database Design

Cassandra Keyspace: binance_keyspace

1. crypto_24h_stats

Stores 24-hour performance metrics for each asset.

CREATE TABLE crypto_24h_stats (
    symbol text PRIMARY KEY,
    price_change_percent double,
    last_price double,
    high_price double,
    low_price double,
    volume double,
    quote_volume double,
    updated_at timestamp
);
Enter fullscreen mode Exit fullscreen mode

2. latest_prices

Holds the latest streaming prices for every cryptocurrency.

CREATE TABLE latest_prices (
    symbol text,
    price double,
    updated_at timestamp,
    PRIMARY KEY (symbol, updated_at)
) WITH CLUSTERING ORDER BY (updated_at DESC);
Enter fullscreen mode Exit fullscreen mode

3. klines

Stores candlestick (OHLCV) data for financial charting.

CREATE TABLE klines (
    symbol text,
    open_time timestamp,
    open_price double,
    high_price double,
    low_price double,
    close_price double,
    volume double,
    number_of_trades int,
    PRIMARY KEY ((symbol), open_time)
);
Enter fullscreen mode Exit fullscreen mode

4. order_book

Captures market depth snapshots, showing the distribution of buy (bids) and sell (asks) orders for each asset.

CREATE TABLE order_book (
    symbol text,
    bids text,
    asks text,
    updated_at timestamp,
    PRIMARY KEY ((symbol), updated_at)
) WITH CLUSTERING ORDER BY (updated_at DESC);
Enter fullscreen mode Exit fullscreen mode

5. recent_trades

Logs the most recent individual trades for each symbol, including their price, quantity, and timestamps.

CREATE TABLE recent_trades (
    symbol text,
    trade_id bigint,
    price double,
    quantity double,
    trade_time timestamp,
    is_buyer_maker boolean,
    PRIMARY KEY ((symbol), trade_id)
);
Enter fullscreen mode Exit fullscreen mode

Grafana Integration

Grafana was connected in two ways:

  1. Cassandra Data Source Plugin – for direct database queries.
  2. Infinity Plugin (HTTP JSON) – I experimented with Flask API endpoints.

This dual approach allowed us to:

  • Set up a back-up connection of pre-aggregated data through the API.
  • Query Cassandra directly for time-series or historical analysis.

Key Visualizations

1. Top 10 Most Traded Cryptocurrencies

Displays the highest trading volume across all assets over the last 24 hours.

Query:

SELECT symbol, volume FROM crypto_24h_stats;
Enter fullscreen mode Exit fullscreen mode

Visualization: Bar Chart
Insight: Quickly identifies the most active markets.

Example Output:

2. Trade Count Overview (ETHUSDT)

Shows the number of trades executed for ETHUSDT over time.

Query:

SELECT open_time, number_of_trades 
FROM klines 
WHERE symbol = 'ETHUSDT' 
ALLOW FILTERING;
Enter fullscreen mode Exit fullscreen mode

Visualization: Time Series / Line Chart
Insight: Useful for spotting spikes in market activity.

Example Output:

Advantages of This Architecture

Advantage Description
Modularity Each component can scale independently.
Transparency Every transformation stage (PostgreSQL → Kafka → Cassandra) is observable.
Custom Logic The manual consumer allows you to filter or compute before persisting.
Grafana Flexibility Supports both direct and API-based data sources.
Resilience Cassandra ensures high availability and fast queries for time-series data.

Challenges Faced & How I Solved Them

Challenge Solution
ORDER BY not supported for non-clustering columns in Cassandra Redesigned tables with clustering keys (updated_at)
JSON array fields (order book) hard to query Flattened data during Kafka consumer insertion
Grafana timeouts using localhost Switched to host.docker.internal for API access

Results

  • Real-time data ingestion from Binance API
  • Fully automated CDC from PostgreSQL to Cassandra
  • Dynamic dashboards for price, volume, and trade count analytics
  • Live financial-style visualizations similar to exchange UIs

The final Grafana dashboard effectively shows market volatility, trading activity, and price trends, all in real time.


Future Enhancements

  • Use Kafka Streams for real-time computations before Cassandra.
  • Add alerting rules in Grafana (e.g., trigger if ETHUSDT price drops 5%).
  • Introduce machine learning models for predictive analytics.

Conclusion

This project demonstrates how open-source tools can be combined to build a real-time financial data analytics system with high performance and flexibility.

By avoiding a sink connector, I gained greater transparency, data control, and adaptability, all while maintaining a clean and observable data pipeline.

GitHub Repo

Top comments (0)