Amos Augo

Posted on Nov 3

Real-Time Crypto Data Pipeline with Change Data Capture (CDC) Using PostgreSQL, Kafka, Cassandra, and Grafana

#changedatacapture #kafka #cassandra

Introduction

In this project, I built a complete real-time cryptocurrency analytics system from the ground up, capable of ingesting, storing, and visualizing live crypto market data.

The system collects price and volume data from the Binance Exchange, streams it through Kafka (with Debezium CDC), stores it in Cassandra, and visualizes it live in Grafana.

This setup simulates a lightweight version of the kind of real-time infrastructure used by trading platforms, financial dashboards, and risk monitoring systems, emphasizing scalability, fault-tolerance, and live data analysis.

System Architecture Overview

Here’s the high-level flow of data through the pipeline:

Binance API → PostgreSQL → Debezium (CDC) → Kafka → Cassandra → Grafana

Components Breakdown

Component	Technology	Function
Data Source	Binance API	Provides live cryptocurrency data
CDC Engine	Debezium	Captures real-time changes in PostgreSQL
Message Broker	Apache Kafka	Streams change events
Database	Apache Cassandra	Stores processed market data
Visualization	Grafana	Real-time dashboard and analytics

Design Choice: No Sink Connector

In a typical Kafka-based architecture, a sink connector is used to automatically write streamed data into a destination like Cassandra.

However, in this project, I did not use a sink connector. Instead, I wrote a custom consumer script that listens to Kafka topics and inserts data manually into Cassandra.

Why I Chose This Approach

Full Control Over Data Transformation: We could clean, transform, and enrich the messages before writing them into Cassandra. (For example, filtering noise, flattening JSON payloads, converting timestamps, etc.)
Better Error Handling and Debugging: We could log failed inserts, handle schema mismatches gracefully, and replay specific batches.
Simpler to Debug During Development: For educational and experimental purposes, it was easier to see what was flowing through each stage of the pipeline.
Future Extensibility: This approach lets us modify the consumer to perform real-time computations or alerts before persisting the data.

How Data Flows Through the System

Binance WebSocket Feeds continuously push ticker, order book, and kline (candlestick) updates.
These updates are first written into PostgreSQL tables for structured storage.

Producer inserting data into Postgres

Debezium monitors PostgreSQL using Change Data Capture (CDC) and streams every row change into Kafka topics.

Kafka topics

Our Kafka consumer (a Python script) reads from those topics and writes structured data into Cassandra tables.

Consumer streaming changes into Cassandra

Data in a Cassandra table

Finally, Grafana queries either the Flask API (via Infinity plugin) or Cassandra directly to visualize the live data.

Database Design

Cassandra Keyspace: `binance_keyspace`

1. `crypto_24h_stats`

Stores 24-hour performance metrics for each asset.

CREATE TABLE crypto_24h_stats (
    symbol text PRIMARY KEY,
    price_change_percent double,
    last_price double,
    high_price double,
    low_price double,
    volume double,
    quote_volume double,
    updated_at timestamp
);

2. `latest_prices`

Holds the latest streaming prices for every cryptocurrency.

CREATE TABLE latest_prices (
    symbol text,
    price double,
    updated_at timestamp,
    PRIMARY KEY (symbol, updated_at)
) WITH CLUSTERING ORDER BY (updated_at DESC);

3. `klines`

Stores candlestick (OHLCV) data for financial charting.

CREATE TABLE klines (
    symbol text,
    open_time timestamp,
    open_price double,
    high_price double,
    low_price double,
    close_price double,
    volume double,
    number_of_trades int,
    PRIMARY KEY ((symbol), open_time)
);

4. `order_book`

Captures market depth snapshots, showing the distribution of buy (bids) and sell (asks) orders for each asset.

CREATE TABLE order_book (
    symbol text,
    bids text,
    asks text,
    updated_at timestamp,
    PRIMARY KEY ((symbol), updated_at)
) WITH CLUSTERING ORDER BY (updated_at DESC);

5. `recent_trades`

Logs the most recent individual trades for each symbol, including their price, quantity, and timestamps.

CREATE TABLE recent_trades (
    symbol text,
    trade_id bigint,
    price double,
    quantity double,
    trade_time timestamp,
    is_buyer_maker boolean,
    PRIMARY KEY ((symbol), trade_id)
);

Grafana Integration

Grafana was connected in two ways:

Cassandra Data Source Plugin – for direct database queries.
Infinity Plugin (HTTP JSON) – I experimented with Flask API endpoints.

This dual approach allowed us to:

Set up a back-up connection of pre-aggregated data through the API.
Query Cassandra directly for time-series or historical analysis.

Key Visualizations

1. Top 10 Most Traded Cryptocurrencies

Displays the highest trading volume across all assets over the last 24 hours.

Query:

SELECT symbol, volume FROM crypto_24h_stats;

Visualization: Bar Chart
Insight: Quickly identifies the most active markets.

Example Output:

2. Trade Count Overview (ETHUSDT)

Shows the number of trades executed for ETHUSDT over time.

Query:

SELECT open_time, number_of_trades 
FROM klines 
WHERE symbol = 'ETHUSDT' 
ALLOW FILTERING;

Visualization: Time Series / Line Chart
Insight: Useful for spotting spikes in market activity.

Example Output:

Advantages of This Architecture

Advantage	Description
Modularity	Each component can scale independently.
Transparency	Every transformation stage (PostgreSQL → Kafka → Cassandra) is observable.
Custom Logic	The manual consumer allows you to filter or compute before persisting.
Grafana Flexibility	Supports both direct and API-based data sources.
Resilience	Cassandra ensures high availability and fast queries for time-series data.

Challenges Faced & How I Solved Them

Challenge	Solution
`ORDER BY` not supported for non-clustering columns in Cassandra	Redesigned tables with clustering keys (`updated_at`)
JSON array fields (order book) hard to query	Flattened data during Kafka consumer insertion
Grafana timeouts using localhost	Switched to `host.docker.internal` for API access

Results

Real-time data ingestion from Binance API
Fully automated CDC from PostgreSQL to Cassandra
Dynamic dashboards for price, volume, and trade count analytics
Live financial-style visualizations similar to exchange UIs

The final Grafana dashboard effectively shows market volatility, trading activity, and price trends, all in real time.

Future Enhancements

Use Kafka Streams for real-time computations before Cassandra.
Add alerting rules in Grafana (e.g., trigger if ETHUSDT price drops 5%).
Introduce machine learning models for predictive analytics.

Conclusion

This project demonstrates how open-source tools can be combined to build a real-time financial data analytics system with high performance and flexibility.

By avoiding a sink connector, I gained greater transparency, data control, and adaptability, all while maintaining a clean and observable data pipeline.

GitHub Repo