PostgreSQL → DuckDB → SQL Engine → Parquet →And beyond ....
A Local-First Analytics Pipeline
Data teams often spend more time operating infrastructure than actually building features.
To construct an AI feature pipeline, organizations frequently spin up heavy stacks consisting of:
Large cloud VMs
Distributed compute clusters
Streaming infrastructure
Data warehouses
Orchestration systems
These systems consume significant engineering effort and infrastructure cost before a single feature is produced.
The irony is that the core goal of most pipelines is simple: transform operational data into features for analytics or machine learning.
Yet the industry default architecture looks like this:
PostgreSQL → Kafka → Spark/Flink → Data Warehouse → Feature Store
This architecture is powerful — but also massively over-engineered for many workloads.
Most teams simply want to:
- Mirror production data
- Run SQL transformations
- Generate datasets
- Export them to analytics or AI pipelines
But instead they end up managing:
- Kafka clusters
- Spark jobs
- Cloud warehouses
- Orchestration systems
- Infrastructure costs
For many workloads, this complexity isn’t necessary.
That observation led to the creation of pg-warehouse.
The Idea Behind pg-warehouse
pg-warehouse is a local-first pipeline engine that mirrors PostgreSQL OLTP data into a local DuckDB warehouse (analytical database).
It captures both:
- The initial snapshot of tables
- Incremental changes from PostgreSQL replication
Then developers can run SQL pipelines on top of the mirrored data.
The result:
PostgreSQL (OLTP)
↓
PostgreSQL replication stream
↓
pg-warehouse sync engine
↓
DuckDB local warehouse
↓
SQL feature pipelines
↓
Parquet / CSV datasets
Everything runs locally.
No Kafka.
No Spark.
No warehouse cluster.
Just:
PostgreSQL
DuckDB
SQL
Why PostgreSQL Replication?
Instead of polling tables or running ETL queries, pg-warehouse uses PostgreSQL’s native replication capabilities.
PostgreSQL exposes replication metadata through:
- Write-Ahead Log (WAL)
- Logical replication slots
- LSN offsets
This allows pg-warehouse to:
- Capture a consistent snapshot of selected tables
- Track WAL changes after the snapshot
- Apply incremental updates to DuckDB
This approach has several advantages:
- minimal load on the OLTP database
- exactly-once incremental progress
- restart-safe replication
The sync engine tracks replication progress in a local state database.
Why DuckDB?
DuckDB is an ideal engine for local analytics workloads.
Key properties:
- columnar storage
- vectorized execution
- high-performance SQL
- embedded runtime
This allows pg-warehouse to transform row-oriented PostgreSQL data into *columnar analytics tables.
*
Example workflow:
SELECT
user_id,
COUNT(*) AS plays,
MAX(created_at) AS last_activity
FROM raw.events
GROUP BY user_id
These pipelines generate derived feature tables that can be exported to Parquet.
Does its Scale?
What 90% of AI Data Pipelines Actually Do ..?
When people talk about PostgreSQL AI data pipelines, the architecture often looks intimidating:
PostgreSQL → SOME HEAVY DISTRIBUTED SYSTEMS → SOME EXPENSE CLOUD DATA WAREHOUSE → Feature Store → ML Pipeline
But if you examine the actual work performed inside most pipelines, the reality is much simpler.
Most pipelines are just SQL transformations. In practice, 90% of pipelines reduce to a few simple steps:
Typical Pipeline Operations
- filter rows
- aggregate events
- join metadata
- compute features
These are operations that columnar engines have optimized for decades. You don’t need a distributed compute cluster to do them.
Feature Tables Are Much Smaller Than Raw Data
Another reason pipelines are frequently over-engineered is a misunderstanding of how data volumes evolve across pipeline layers.
Raw event streams can be large, but feature tables are dramatically smaller.
If you ingest 200 GB of raw events per day, your final feature tables might only be 2–10 GB. That is well within the capabilities of a single-node columnar engine like DuckDB.
The 90% Pipeline Design Target
If you optimize a system for the following workload profile:
You can cover roughly 90% of real-world AI feature pipelines.
Everything beyond that tends to fall into hyperscale edge cases, such as:
- trillion-event streaming systems
- global ad-tech platforms
- massive graph processing
- real-time distributed ML training
These are important problems — but they are not the common case.
The Over-Engineering Problem
Modern data infrastructure often assumes that every pipeline must be built with distributed systems.
This leads to stacks like:
Kafka
Spark
Flink
Airflow
Data Warehouse
Feature Store
While powerful, these systems introduce significant overhead:
infrastructure complexity
operational cost
latency
specialized expertise
For many teams, this complexity is unnecessary.
A simpler architecture is often sufficient:
PostgreSQL
↓
DuckDB
↓
SQL feature pipelines
↓
Parquet datasets
Where pg-warehouse fits ?
pg-warehouse is designed specifically for this 90% case.
It mirrors PostgreSQL OLTP data into a local DuckDB warehouse, where developers can run SQL feature pipelines and export datasets to Parquet.
Instead of building complex distributed pipelines, developers can focus on the transformations that actually matter.
PostgreSQL
↓
Replication stream
↓
pg-warehouse
↓
DuckDB
↓
SQL feature pipelines
↓
Parquet datasets
The result is a local-first analytics stack that removes unnecessary infrastructure while still supporting the core transformations used in most AI pipelines.
Architecture
pg-warehouse is designed around a clean separation of concerns.
Open Core components include:
Sync Engine
Responsible for:
- PostgreSQL snapshot
- replication stream processing
- applying changes to DuckDB
Warehouse Layer
DuckDB acts as the columnar analytics engine.
Tables are stored under the raw schema.
Pipeline Engine
Developers write SQL files describing feature pipelines.
Example:
pipelines/user_features.sql
These pipelines read from raw tables and produce derived datasets.
Export Engine
Exports datasets to:
- Parquet
- CSV
- external storage
Design Principles
pg-warehouse follows several design constraints.
Local-First
The entire pipeline runs locally.
No cloud infrastructure required.
Single Binary
pg-warehouse is implemented in GoLang and compiled as a single executable.
This simplifies deployment dramatically.
Config-Driven
All workflows are configured using a single YAML file.
Example:
project:
name: my_warehouse
postgres:
url: postgres://warehouse:password@pg-host:5432/mydb
schema: public
duckdb:
raw: ./raw.duckdb
silver: ./silver.duckdb
feature: ./feature.duckdb
cdc:
enabled: true
publication_name: pgwh_pub
slot_name: pgwh_slot
tables:
- public.orders
- public.customers
sync:
mode: incremental
tables:
- name: public.orders
target_schema: raw
primary_key: [id]
watermark_column: updated_at
- name: public.customers
target_schema: raw
primary_key: [id]
watermark_column: updated_at
State Durability
Replication progress is stored in a SQLite state database.
This allows:
- warehouse rebuilds
- crash recovery
- deterministic incremental sync
You can delete and rebuild your DuckDB warehouse without losing sync state.
Hexagonal Architecture
pg-warehouse follows a ports and adapters architecture.
Core logic is isolated from:
- PostgreSQL adapter
- DuckDB adapter
- export adapters
This allows enterprise extensions without modifying core logic.
Open Core Model
pg-warehouse is an open core project.
Open Source includes:
- PostgreSQL replication-based data sync
- DuckDB local analytics warehouse
- SQL feature pipelines
- Parquet / CSV dataset export
- Local-first pipeline execution
Enterprise features will include:
- Cloud storage pipelines (S3 / GCS / Azure)
- Version-controlled SQL pipelines
- Distributed synchronization across nodes
- Production observability and monitoring
Why This Matters
Data engineering is slowly shifting toward simpler architectures.
Modern embedded engines like DuckDB make it possible to run serious analytics workloads without clusters.
pg-warehouse aims to cover the 80% case for building datasets and AI features from PostgreSQL.
For many teams:
PostgreSQL → DuckDB → Parquet
Is enough.
Project
pg-warehouse is open core and available on GitHub.
Looking for contributors!




Top comments (0)