DEV Community: Siraj Syed

Part 2: Syncing Normalized PostgreSQL Data to Denormalized ClickHouse Using Airbyte + DBT

Siraj Syed — Sat, 10 May 2025 15:02:37 +0000

From Transactional Trenches to Analytical Ascent: PostgreSQL to ClickHouse with Airbyte and DBT

In Part 1, we delved into the fundamental reasons why shoehorning your PostgreSQL data model directly into ClickHouse is a recipe for analytical sluggishness. We highlighted the contrasting strengths of row-oriented OLTP databases like PostgreSQL and column-oriented OLAP powerhouses like ClickHouse.

Now, let's roll up our sleeves and translate that theory into a tangible, real-world solution. In this article, we'll embark on a journey to build a robust data pipeline that seamlessly syncs your normalized Online Transaction Processing (OLTP) data residing in PostgreSQL into a highly performant, denormalized schema optimized for Online Analytical Processing (OLAP) within ClickHouse.

Our trusty companions on this expedition will be Airbyte for Change Data Capture (CDC) based ingestion and dbt (data build tool) for elegant transformations and nimble schema evolution.

Our Architectural Blueprint

Here's a visual representation of the data flow we'll be constructing:

Step 1: Laying the Foundation - Defining Your Source Schema in PostgreSQL

Let's consider a common scenario: a basic e-commerce application. Our transactional data in PostgreSQL is structured in a normalized fashion, ensuring data integrity and minimizing redundancy for efficient writes.

-- users
CREATE TABLE users (
  id UUID PRIMARY KEY,
  name TEXT,
  email TEXT
);

-- orders
CREATE TABLE orders (
  id UUID PRIMARY KEY,
  user_id UUID REFERENCES users(id),
  total_amount NUMERIC,
  created_at TIMESTAMPTZ
);

This normalized structure, with separate users and orders tables linked by foreign keys, is ideal for handling transactional operations.

Step 2: Setting Sail with Airbyte - Ingesting Data from Postgres

Airbyte steps in as our reliable vessel for data ingestion. Its robust support for CDC (Change Data Capture) via the PostgreSQL Write-Ahead Log (WAL) allows us to stream changes in near real-time into ClickHouse. This approach ensures low latency and captures every modification made to our source data.

To get this working, you'll need to configure Airbyte with the following:

Source: Connect to your PostgreSQL instance. Ensure you've enabled logical replication and created a replication slot, as these are prerequisites for CDC.
Destination: Configure your ClickHouse instance as the destination. Leverage the HTTP destination with compression for efficient data transfer.
Sync Mode: Choose a sync mode that supports incremental updates with change tracking. "Incremental + Append" or a dedicated "CDC" mode (if available for the Postgres connector) are suitable options.

Upon successful configuration, Airbyte will land the raw data in ClickHouse within tables named something like:

_airbyte_raw_users
_airbyte_raw_orders

The data within these tables will typically be structured as raw JSON blobs, with each row containing metadata and the actual data:

{
  "_ab_id": "a unique identifier for the Airbyte record",
  "_ab_emitted_at": "timestamp of when Airbyte processed the record",
  "data": {
    "id": "...",
    "user_id": "...",
    "...": "..."
  }
}

Step 3: Crafting Insights with DBT - Transformation and Denormalization

Now comes the crucial step of shaping this raw data into an analytical powerhouse. This is where dbt shines. By connecting dbt to your ClickHouse instance (using adapters like dbt-clickhouse), you can write SQL-based models to extract, transform, and load the data into your desired denormalized schema.

Let's look at an example dbt model, orders_flat.sql, that denormalizes the orders data by joining it with relevant information from the users table:

WITH raw_orders AS (
  SELECT
    JSONExtractString(_airbyte_data, 'id') AS order_id,
    JSONExtractString(_airbyte_data, 'user_id') AS user_id,
    toDecimal128OrZero(JSONExtractString(_airbyte_data, 'total_amount')) AS total_amount,
    parseDateTimeBestEffort(JSONExtractString(_airbyte_data, 'created_at')) AS created_at
  FROM _airbyte_raw_orders
),

enriched_orders AS (
  SELECT
    o.order_id,
    o.user_id,
    u.name AS user_name,
    u.email AS user_email,
    o.total_amount,
    o.created_at
  FROM raw_orders o
  LEFT JOIN (
    SELECT
      JSONExtractString(_airbyte_data, 'id') AS user_id,
      JSONExtractString(_airbyte_data, 'name') AS name,
      JSONExtractString(_airbyte_data, 'email') AS email
    FROM _airbyte_raw_users
  ) u ON o.user_id = u.user_id
)

SELECT * FROM enriched_orders

In this model:

We first extract the relevant fields from the raw JSON data ingested by Airbyte using ClickHouse's JSON functions like JSONExtractString. We also perform basic type casting.
Then, we join the extracted orders data with the relevant fields from the users data based on the user_id. This denormalizes the data, bringing related information into a single table.

Step 4: Optimizing for Speed in ClickHouse - Partitioning and Materialization

To truly unlock ClickHouse's analytical prowess, we need to structure our tables for optimal query performance. Partitioning and ordering are key techniques. Let's materialize our enriched_orders model into a ClickHouse table with these optimizations:

CREATE TABLE orders_flat
ENGINE = MergeTree
PARTITION BY toYYYYMM(created_at)
ORDER BY (user_id, created_at)
AS
SELECT * FROM enriched_orders;

Here's why this is important:

ENGINE = MergeTree: This is a family of powerful table engines in ClickHouse designed for high-performance data processing and analytics.
PARTITION BY toYYYYMM(created_at): Partitioning the data by year and month of the created_at column allows ClickHouse to efficiently skip irrelevant data during queries that filter by date ranges.
ORDER BY (user_id, created_at): Specifying an order key helps ClickHouse organize the data within each partition, enabling faster data retrieval for queries that filter or sort by these columns.
AS SELECT * FROM enriched_orders: This creates the table and populates it with the results of our dbt transformation.

Step 5: Unleashing Analytical Power - Querying Your Denormalized Data

With our denormalized and optimized data now residing in ClickHouse, we can execute analytical queries that would be prohibitively slow on our normalized PostgreSQL database, especially on large datasets.

For example, to count daily orders and calculate total revenue over the last 30 days:

SELECT
  toDate(created_at) AS order_date,
  COUNT(*) AS order_count,
  SUM(total_amount) AS total_revenue
FROM orders_flat
WHERE created_at >= now() - INTERVAL '30 day'
GROUP BY order_date
ORDER BY order_date;

This query, leveraging ClickHouse's columnar storage and indexing capabilities, will execute almost instantly, providing valuable business insights.

Automating Your Data Pipeline with CI/CD

To ensure a smooth and reliable data flow, consider automating your dbt transformations. You can achieve this by:

Using DBT Cloud: This managed service provides a web-based interface for developing, scheduling, and monitoring your dbt projects.
Implementing CI/CD Pipelines: Integrate your dbt runs into your Continuous Integration/Continuous Deployment (CI/CD) pipelines (e.g., using GitLab CI, GitHub Actions) to automatically trigger transformations whenever new code is merged or on a scheduled basis.
Data Contracts and Schema Registry (Optional): For more complex environments, consider implementing data contracts or using a schema registry to track and manage schema changes across your PostgreSQL and ClickHouse systems, preventing breaking changes.

Gotchas to Navigate

While this pattern is powerful, here are some common pitfalls and their solutions:

Issue	Solution
ClickHouse schema drift	Use DBT to re-materialize views and avoid dynamic columns
Timestamp mismatches	Normalize to UTC early in the pipeline (ideally within dbt)
NULL vs empty string in JSON	Use `ifNullOrDefault()` or `assumeNotNull()` carefully in ClickHouse queries
Data bloat in Airbyte raw tables	Apply retention policies or configure auto-dropping of raw staging tables

Final Thoughts: A Powerful Paradigm

This architecture, leveraging the strengths of PostgreSQL for transactional integrity and ClickHouse for analytical speed, orchestrated by Airbyte for seamless ingestion and dbt for elegant transformation, offers a compelling solution for modern data pipelines.

It allows you to:

Safeguard your OLTP workloads in PostgreSQL without compromising analytical performance.
Offload demanding analytics to the lightning-fast columnar engine of ClickHouse.
Maintain a clear separation of concerns, avoiding the complexities of trying to fit an analytical workload onto a transactional database.
Future-proof your data infrastructure by adopting decoupled and specialized tools.

With Airbyte and dbt as your allies, your data becomes a fluid asset, readily transformed and analyzed to drive meaningful insights.

Next Up: Sharing Your Analytical Treasures

In Part 3, we'll explore how to expose your meticulously crafted ClickHouse OLAP layer as an API or embeddable dashboard for your customers, all while implementing robust access controls and cost attribution strategies. Stay tuned!

Designing Data Models That Work for Both PostgreSQL and ClickHouse: A Developer’s Guide

Siraj Syed — Fri, 09 May 2025 02:27:58 +0000

Modern applications are increasingly architected with PostgreSQL for OLTP (transactions) and ClickHouse for OLAP (analytics). This hybrid design gives you the best of both worlds: reliable writes and blazing-fast reads.

But here’s the catch—you can’t model data the same way in both. A normalized model that’s perfect for Postgres could kill performance in ClickHouse. And a denormalized, flattened schema for ClickHouse might break constraints and business logic in Postgres.

So how do you design a model that works well enough across both?

Let’s walk through the key principles, trade-offs, and best practices for dual-target data modeling that won’t leave you regretting schema decisions six months later.

Why Use PostgreSQL + ClickHouse?

Before diving into data modeling, here’s the high-level architecture:

PostgreSQL: Primary source of truth. Handles transactions, constraints, and app-level logic.
ClickHouse: Secondary analytical store. Optimized for fast aggregates, filtering, time-series analysis, and dashboards.

Common patterns for data syncing:

Sync via Debezium / Kafka
Periodic ETL using Airflow or DBT
Event-based architecture using CDC (Change Data Capture)

1. Normalize in PostgreSQL, Denormalize in ClickHouse

PostgreSQL loves third normal form. ClickHouse doesn’t.

Feature	PostgreSQL	ClickHouse
Join performance	Efficient with indexes	Costly, especially over large tables
Normalization	Encouraged (FKs, constraints)	Discouraged (flatten your data)
Write latency	ACID-compliant, slower but reliable	Fast inserts, optimized for batches
Analytics	Slow on large joins	Optimized for OLAP queries

Best Practice:

Model your source data in normalized form in Postgres. When syncing to ClickHouse, flatten your facts and materialize your dimensions.

2. Watch for Type Compatibility (JSON, UUID, Timestamps)

Some Postgres types don’t map cleanly to ClickHouse. Here are common gotchas:

PostgreSQL Type	ClickHouse Equivalent	Notes
`UUID`	`String`	No native UUID support, stringify it
`JSONB`	`String` or `Nested`	Consider flattening or casting to string
`TIMESTAMPTZ`	`DateTime64`	Ensure timezones are handled correctly
`NUMERIC`	`Decimal(18,4)`	Match precision explicitly

Pro Tip:

Use a schema registry or intermediate layer (like DBT or protobuf) to enforce compatibility across both systems.

3. Optimize Time-Based Partitioning for ClickHouse

ClickHouse thrives when data is partitioned and sorted effectively.

PostgreSQL:
- Use created_at or updated_at for tracking changes.
- Use indexes on frequently filtered fields.
ClickHouse:
- Partition by toYYYYMM(created_at) or toYYYYMMDD().
- Sort key: (user_id, created_at) for segment elimination (faster filtering).

4. Avoid Foreign Keys in ClickHouse

ClickHouse does not support foreign key constraints. This means you need to flatten joins ahead of time.

Example:

PostgreSQL Schema:

users(id, name, email)
orders(id, user_id, total_amount, created_at)

ClickHouse Flattened Table:

orders_flat (
  order_id UUID,
  user_id UUID,
  user_email String,
  user_name String,
  total_amount Decimal(10,2),
  created_at DateTime64
)

ETL pipelines should enrich the data before writing to ClickHouse.

5. Model for Read Patterns, Not Write Patterns

ClickHouse thrives on append-only, query-optimized data.

Instead of this:

SELECT COUNT(*) FROM events WHERE user_id = 'abc123';

Pre-aggregate it:

CREATE MATERIALIZED VIEW daily_event_counts AS
SELECT user_id, toDate(event_time) AS event_date, count(*) AS daily_count
FROM events
GROUP BY user_id, event_date;

Then query from daily_event_counts instead.

6. Be Cautious with Schema Evolution

PostgreSQL handles schema changes gracefully. ClickHouse… doesn’t (yet).

Tips:

Avoid adding columns frequently in ClickHouse.
Use wide-table design (predefine a large schema if possible).
Prefer additive changes (append-only, soft deletes).

To manage schema drift:

Use tools like DBT, Airbyte, or LakeSoul with versioned schemas.
Log schema changes and sync them across systems in CI/CD.

7. Use Separate ETL Pipelines for OLTP and OLAP

Instead of writing the same data model into both systems directly, maintain two pipelines:

OLTP write → PostgreSQL
OLTP sync → Enriched + transformed → ClickHouse

This decouples constraints, allows async processing, and optimizes each layer for its strength.

TL;DR: Unified Modeling Principles

Principle	PostgreSQL (OLTP)	ClickHouse (OLAP)
Normalization	✅ Yes	❌ No
Joins	✅ Fast	❌ Avoid
Constraints	✅ Enforced	❌ Manual
Types	✅ Strong, varied	❌ Simpler
Writes	✅ Row-based	❌ Batch-based
Queries	✅ Indexed row access	❌ Columnar, vectorized
Evolution	✅ Flexible	❌ Careful planning needed

Final Thoughts

Designing data models across PostgreSQL and ClickHouse isn’t about picking one approach—it’s about understanding what each engine excels at and designing your sync + transformations accordingly.

Get this right, and you’ll enjoy the flexibility of Postgres with the speed and scalability of ClickHouse—without constantly fixing pipelines or rewriting queries.

Coming Soon:

💡 In Part 2, we’ll build a real-world example syncing a normalized Postgres schema into a denormalized ClickHouse table using Airbyte + DBT. Stay tuned!