Burnside Project

Posted on Mar 31

pg-warehouse - A local-first data warehouse at scale without over Engineering that mirrors PostgreSQL data - no pipelines needed!

#postgres #data #warehouse #duckdb

PostgreSQL → DuckDB → SQL Engine → Parquet →And beyond ....

A Local-First Analytics Pipeline

Data teams often spend more time operating infrastructure than actually building features.

To construct an AI feature pipeline, organizations frequently spin up heavy stacks consisting of:

Large cloud VMs
Distributed compute clusters
Streaming infrastructure
Data warehouses
Orchestration systems

These systems consume significant engineering effort and infrastructure cost before a single feature is produced.

The irony is that the core goal of most pipelines is simple: transform operational data into features for analytics or machine learning.

Yet the industry default architecture looks like this:

PostgreSQL → Kafka → Spark/Flink → Data Warehouse → Feature Store

This architecture is powerful — but also massively over-engineered for many workloads.

Most teams simply want to:

Mirror production data
Run SQL transformations
Generate datasets
Export them to analytics or AI pipelines

But instead they end up managing:

Kafka clusters
Spark jobs
Cloud warehouses
Orchestration systems
Infrastructure costs

For many workloads, this complexity isn’t necessary.

That observation led to the creation of pg-warehouse.

The Idea Behind pg-warehouse

pg-warehouse is a local-first pipeline engine that mirrors PostgreSQL OLTP data into a local DuckDB warehouse (analytical database).

It captures both:

The initial snapshot of tables
Incremental changes from PostgreSQL replication

Then developers can run SQL pipelines on top of the mirrored data.

The result:

PostgreSQL (OLTP)
      ↓
PostgreSQL replication stream
      ↓
pg-warehouse sync engine
      ↓
DuckDB local warehouse
      ↓
SQL feature pipelines
      ↓
Parquet / CSV datasets

Everything runs locally.

No Kafka.

No Spark.

No warehouse cluster.

Just:

PostgreSQL
DuckDB
SQL

Why PostgreSQL Replication?

Instead of polling tables or running ETL queries, pg-warehouse uses PostgreSQL’s native replication capabilities.

PostgreSQL exposes replication metadata through:

Write-Ahead Log (WAL)
Logical replication slots
LSN offsets

This allows pg-warehouse to:

Capture a consistent snapshot of selected tables
Track WAL changes after the snapshot
Apply incremental updates to DuckDB

This approach has several advantages:

minimal load on the OLTP database
exactly-once incremental progress
restart-safe replication

The sync engine tracks replication progress in a local state database.

Why DuckDB?

DuckDB is an ideal engine for local analytics workloads.

Key properties:

columnar storage
vectorized execution
high-performance SQL
embedded runtime

This allows pg-warehouse to transform row-oriented PostgreSQL data into *columnar analytics tables.
*
Example workflow:

SELECT
    user_id,
    COUNT(*) AS plays,
    MAX(created_at) AS last_activity
FROM raw.events
GROUP BY user_id

These pipelines generate derived feature tables that can be exported to Parquet.

Does its Scale?

What 90% of AI Data Pipelines Actually Do ..?

When people talk about PostgreSQL AI data pipelines, the architecture often looks intimidating:

PostgreSQL → SOME HEAVY DISTRIBUTED SYSTEMS → SOME EXPENSE CLOUD DATA WAREHOUSE → Feature Store → ML Pipeline

But if you examine the actual work performed inside most pipelines, the reality is much simpler.

Most pipelines are just SQL transformations. In practice, 90% of pipelines reduce to a few simple steps:

Typical Pipeline Operations

filter rows
aggregate events
join metadata
compute features

These are operations that columnar engines have optimized for decades. You don’t need a distributed compute cluster to do them.

Feature Tables Are Much Smaller Than Raw Data

Another reason pipelines are frequently over-engineered is a misunderstanding of how data volumes evolve across pipeline layers.

Raw event streams can be large, but feature tables are dramatically smaller.

If you ingest 200 GB of raw events per day, your final feature tables might only be 2–10 GB. That is well within the capabilities of a single-node columnar engine like DuckDB.

The 90% Pipeline Design Target

If you optimize a system for the following workload profile:

You can cover roughly 90% of real-world AI feature pipelines.
Everything beyond that tends to fall into hyperscale edge cases, such as:

trillion-event streaming systems
global ad-tech platforms
massive graph processing
real-time distributed ML training

These are important problems — but they are not the common case.

The Over-Engineering Problem

Modern data infrastructure often assumes that every pipeline must be built with distributed systems.

This leads to stacks like:

Kafka
Spark
Flink
Airflow
Data Warehouse
Feature Store

While powerful, these systems introduce significant overhead:

infrastructure complexity
operational cost
latency
specialized expertise

For many teams, this complexity is unnecessary.

A simpler architecture is often sufficient:

PostgreSQL
   ↓
DuckDB
   ↓
SQL feature pipelines
   ↓
Parquet datasets

Where pg-warehouse fits ?

pg-warehouse is designed specifically for this 90% case.

It mirrors PostgreSQL OLTP data into a local DuckDB warehouse, where developers can run SQL feature pipelines and export datasets to Parquet.

Instead of building complex distributed pipelines, developers can focus on the transformations that actually matter.

PostgreSQL
   ↓
Replication stream
   ↓
pg-warehouse
   ↓
DuckDB
   ↓
SQL feature pipelines
   ↓
Parquet datasets

The result is a local-first analytics stack that removes unnecessary infrastructure while still supporting the core transformations used in most AI pipelines.

Architecture

pg-warehouse is designed around a clean separation of concerns.

Open Core components include:

Sync Engine

Responsible for:

PostgreSQL snapshot
replication stream processing
applying changes to DuckDB

Warehouse Layer

DuckDB acts as the columnar analytics engine.

Tables are stored under the raw schema.

Pipeline Engine

Developers write SQL files describing feature pipelines.

Example:

pipelines/user_features.sql

These pipelines read from raw tables and produce derived datasets.

Export Engine

Exports datasets to:

Parquet
CSV
external storage

Design Principles

pg-warehouse follows several design constraints.

Local-First

The entire pipeline runs locally.

No cloud infrastructure required.

Single Binary

pg-warehouse is implemented in GoLang and compiled as a single executable.

This simplifies deployment dramatically.

Config-Driven

All workflows are configured using a single YAML file.

Example:

project:
  name: my_warehouse

postgres:
  url: postgres://warehouse:password@pg-host:5432/mydb
  schema: public

duckdb:
  raw: ./raw.duckdb
  silver: ./silver.duckdb
  feature: ./feature.duckdb

cdc:
  enabled: true
  publication_name: pgwh_pub
  slot_name: pgwh_slot
  tables:
    - public.orders
    - public.customers

sync:
  mode: incremental
  tables:
    - name: public.orders
      target_schema: raw
      primary_key: [id]
      watermark_column: updated_at
    - name: public.customers
      target_schema: raw
      primary_key: [id]
      watermark_column: updated_at

State Durability

Replication progress is stored in a SQLite state database.

This allows:

warehouse rebuilds
crash recovery
deterministic incremental sync

You can delete and rebuild your DuckDB warehouse without losing sync state.

Hexagonal Architecture

pg-warehouse follows a ports and adapters architecture.

Core logic is isolated from:

PostgreSQL adapter
DuckDB adapter
export adapters

This allows enterprise extensions without modifying core logic.

Open Core Model

pg-warehouse is an open core project.

Open Source includes:

PostgreSQL replication-based data sync
DuckDB local analytics warehouse
SQL feature pipelines
Parquet / CSV dataset export
Local-first pipeline execution

Enterprise features will include:

Cloud storage pipelines (S3 / GCS / Azure)
Version-controlled SQL pipelines
Distributed synchronization across nodes
Production observability and monitoring

Why This Matters

Data engineering is slowly shifting toward simpler architectures.

Modern embedded engines like DuckDB make it possible to run serious analytics workloads without clusters.

pg-warehouse aims to cover the 80% case for building datasets and AI features from PostgreSQL.

For many teams:

PostgreSQL → DuckDB → Parquet
Is enough.

Project

pg-warehouse is open core and available on GitHub.

Looking for contributors!

Repository

Quick Start

Development Workflow

DEV Community

pg-warehouse - A local-first data warehouse at scale without over Engineering that mirrors PostgreSQL data - no pipelines needed!

Top comments (0)