DEV Community: Burnside Project

The AI-Safe Pipeline From PostgreSQL to Governed Data

Burnside Project — Mon, 04 May 2026 18:27:48 +0000

Build a one-way pipeline from PostgreSQL into an air-gapped, governed data zone — where access is controlled at the database, table, and column level.

The New Problem: AI + Production Databases

Modern stacks are quietly introducing a dangerous pattern:
AI Agent → PostgreSQL

Even when wrapped with APIs, proxies, or query layers, the reality is:

The agent still reaches production
It still executes queries
It still operates inside your blast radius

This is fundamentally unsafe.

Not because AI is malicious — but because it is non-deterministic.

The Design Principle

Query with mcp server

pg-cdc is built around a simple rule:

AI should never have network access to production databases.

Instead of securing access…

We remove it entirely.

What pg-cdc Actually Is

pg-cdc is:

A PostgreSQL logical replication (WAL) consumer
A Parquet writer with compaction
A governance-aware ingestion layer
A security boundary between production and consumers

The Air Gap

The most important property:

pg-cdc creates a physical and logical air gap

Properties

No return path
WAL is unidirectional
Storage is append/immutable
No database credentials
Consumers use IAM
No connection strings
No shared execution environment
No queries run on PostgreSQL
Governed surface only
Access mediated via catalog + tags
Auditable: Every read can be tracked

If You’re Building This Kind of System

Think in layers:

Production system (OLTP) → isolated
CDC boundary (pg-cdc) → one-way
Governed storage → immutable + tagged
Query layer → controlled access
AI / analytics → consumers

I’d love to hear — these are exactly the problems we’re solving.

Repository: GitHub repo

Connect:LinkedIn

Book a demo:

PostgreSQL #CDC #DataEngineering #AIInfrastructure #Lakehouse #Security #MLOps #DuckDB #Parquet #OpenSource

pg-stress — Stress Testing PostgreSQL with Claude-powered advisory

Burnside Project — Thu, 02 Apr 2026 15:38:50 +0000

Test it like It’s a Machine

When I started building pg-collector (another project that uses heavy stress test using pg-test), I ran into a problem very quickly:

I didn’t have a reliable way to break PostgreSQL on demand. I looked and there are lots of generating synthetic data but no comprehensive stress test tools.

And that’s when the idea for pg-stress was born.

⸻

Built from “eat your own dog food”

pg-stress didn’t start as a product idea.
It started as a necessity.

While building pg-collector, I needed:
• real workload patterns
• repeatable failure scenarios
• controlled environments to observe behavior
• run these test in random pattern continously 7/24 for 2 weeks !

So I built a tool to stress PostgreSQL intentionally — not just benchmark it.

That tool became pg-stress.

⸻

Think of it like automobile testing

When a new car is built, it’s not just driven on smooth roads.

It’s tested in:
• freezing cold
• extreme heat
• rough terrain
• high-speed stress conditions

Why?

Before releasing a new query in the **WILD - inject that query in pg-stress. Stress test with 100s of connections, test the joints, insert 10M rows(kidding ! Not really .. I inserted 30M records …)Output the findings to Claude Chat box for advisory!

⸻

Databases are no different.

Production issues show up under:
• traffic spikes
• bad queries
• ORM inefficiencies
• long-running transactions
• unpredictable write patterns

pg-stress is built to simulate exactly that.

⸻

What pg-stress actually tests

This is not just “run SELECT 1 in a loop.”

We push real-world chaos into PostgreSQL:

ORM-like behavior

• bursty inserts / updates / deletes
• inefficient query patterns
• transactional noise

⸻

Jitter + randomness

• non-uniform traffic
• unpredictable workloads
• concurrency spikes

⸻

Bloat & connection pressure

• table/index bloat scenarios
• connection exhaustion
• lock contention

⸻

Query stress

• slow queries under load
• joins at scale
• degradation over time

⸻

From stress → context → intelligence

Most stress tools stop here:

“Here are your TPS and latency numbers.”

That’s not enough.

pg-stress produces structured output designed for context:
• query behavior
• system response
• degradation patterns

⸻

Built for AI-assisted diagnosis

The output of pg-stress is optimized for modern workflows:

Feed it directly into models like Claude

And you can get:
• query tuning recommendations
• index suggestions
• capacity predictions
• failure explanations

⸻

Part of a bigger system
• pg-stress → generates pressure
• pg-collector → observes + learns
• (future) → AI layer predicts and advises

⸻

Final thought

If you can’t break your database in a controlled environment…

It will break in production — on its own terms.

⸻

Repo:

DataEngineering

AIInfrastructure

MachineLearning

PostgreSQL

OpenSource

AI-powered PostgreSQL observability

Burnside Project — Tue, 31 Mar 2026 03:43:07 +0000

pg-collector streams live PostgreSQL telemetry into a 7-dimension state machine that predicts failures, detects query regressions, and answers the 5 questions every DBA asks — automatically.

Is my database healthy?
Single-sentence verdict with confidence level, time-in-state, and 7-dimension breakdown. No interpretation needed.

What changed?
Causal narrative linking state transitions to query regressions, workload shifts, and configuration changes with timestamps.

What will break next?
Ranked risk register with 'days to breach' projections. Vacuum wraparound in 18 days. Connection exhaustion by April 12.

What caused this spike?
Root cause attribution chains: query workload change -> cache eviction -> checkpoint storm -> lock cascade. Automatic cross-dimension correlation.

How is performance trending?
30-day health report with per-dimension trajectories, volatility metrics, week-over-week comparisons, and prediction accuracy tracking.

Get an Early Access
Git Repo

pg-warehouse - A local-first data warehouse at scale without over Engineering that mirrors PostgreSQL data - no pipelines needed!

Burnside Project — Tue, 31 Mar 2026 03:29:14 +0000

PostgreSQL → DuckDB → SQL Engine → Parquet →And beyond ....

A Local-First Analytics Pipeline

Data teams often spend more time operating infrastructure than actually building features.

To construct an AI feature pipeline, organizations frequently spin up heavy stacks consisting of:

Large cloud VMs
Distributed compute clusters
Streaming infrastructure
Data warehouses
Orchestration systems

These systems consume significant engineering effort and infrastructure cost before a single feature is produced.

The irony is that the core goal of most pipelines is simple: transform operational data into features for analytics or machine learning.

Yet the industry default architecture looks like this:

PostgreSQL → Kafka → Spark/Flink → Data Warehouse → Feature Store

This architecture is powerful — but also massively over-engineered for many workloads.

Most teams simply want to:

Mirror production data
Run SQL transformations
Generate datasets
Export them to analytics or AI pipelines

But instead they end up managing:

Kafka clusters
Spark jobs
Cloud warehouses
Orchestration systems
Infrastructure costs

For many workloads, this complexity isn’t necessary.

That observation led to the creation of pg-warehouse.

The Idea Behind pg-warehouse

pg-warehouse is a local-first pipeline engine that mirrors PostgreSQL OLTP data into a local DuckDB warehouse (analytical database).

It captures both:

The initial snapshot of tables
Incremental changes from PostgreSQL replication

Then developers can run SQL pipelines on top of the mirrored data.

The result:

PostgreSQL (OLTP)
      ↓
PostgreSQL replication stream
      ↓
pg-warehouse sync engine
      ↓
DuckDB local warehouse
      ↓
SQL feature pipelines
      ↓
Parquet / CSV datasets

Everything runs locally.

No Kafka.

No Spark.

No warehouse cluster.

Just:

PostgreSQL
DuckDB
SQL

Why PostgreSQL Replication?

Instead of polling tables or running ETL queries, pg-warehouse uses PostgreSQL’s native replication capabilities.

PostgreSQL exposes replication metadata through:

Write-Ahead Log (WAL)
Logical replication slots
LSN offsets

This allows pg-warehouse to:

Capture a consistent snapshot of selected tables
Track WAL changes after the snapshot
Apply incremental updates to DuckDB

This approach has several advantages:

minimal load on the OLTP database
exactly-once incremental progress
restart-safe replication

The sync engine tracks replication progress in a local state database.

Why DuckDB?

DuckDB is an ideal engine for local analytics workloads.

Key properties:

columnar storage
vectorized execution
high-performance SQL
embedded runtime

This allows pg-warehouse to transform row-oriented PostgreSQL data into *columnar analytics tables.
*
Example workflow:

SELECT
    user_id,
    COUNT(*) AS plays,
    MAX(created_at) AS last_activity
FROM raw.events
GROUP BY user_id

These pipelines generate derived feature tables that can be exported to Parquet.

Does its Scale?

What 90% of AI Data Pipelines Actually Do ..?

When people talk about PostgreSQL AI data pipelines, the architecture often looks intimidating:

PostgreSQL → SOME HEAVY DISTRIBUTED SYSTEMS → SOME EXPENSE CLOUD DATA WAREHOUSE → Feature Store → ML Pipeline

But if you examine the actual work performed inside most pipelines, the reality is much simpler.

Most pipelines are just SQL transformations. In practice, 90% of pipelines reduce to a few simple steps:

Typical Pipeline Operations

filter rows
aggregate events
join metadata
compute features

These are operations that columnar engines have optimized for decades. You don’t need a distributed compute cluster to do them.

Feature Tables Are Much Smaller Than Raw Data

Another reason pipelines are frequently over-engineered is a misunderstanding of how data volumes evolve across pipeline layers.

Raw event streams can be large, but feature tables are dramatically smaller.

If you ingest 200 GB of raw events per day, your final feature tables might only be 2–10 GB. That is well within the capabilities of a single-node columnar engine like DuckDB.

The 90% Pipeline Design Target

If you optimize a system for the following workload profile:

You can cover roughly 90% of real-world AI feature pipelines.
Everything beyond that tends to fall into hyperscale edge cases, such as:

trillion-event streaming systems
global ad-tech platforms
massive graph processing
real-time distributed ML training

These are important problems — but they are not the common case.

The Over-Engineering Problem

Modern data infrastructure often assumes that every pipeline must be built with distributed systems.

This leads to stacks like:

Kafka
Spark
Flink
Airflow
Data Warehouse
Feature Store

While powerful, these systems introduce significant overhead:

infrastructure complexity
operational cost
latency
specialized expertise

For many teams, this complexity is unnecessary.

A simpler architecture is often sufficient:

PostgreSQL
   ↓
DuckDB
   ↓
SQL feature pipelines
   ↓
Parquet datasets

Where pg-warehouse fits ?

pg-warehouse is designed specifically for this 90% case.

It mirrors PostgreSQL OLTP data into a local DuckDB warehouse, where developers can run SQL feature pipelines and export datasets to Parquet.

Instead of building complex distributed pipelines, developers can focus on the transformations that actually matter.

PostgreSQL
   ↓
Replication stream
   ↓
pg-warehouse
   ↓
DuckDB
   ↓
SQL feature pipelines
   ↓
Parquet datasets

The result is a local-first analytics stack that removes unnecessary infrastructure while still supporting the core transformations used in most AI pipelines.

Architecture

pg-warehouse is designed around a clean separation of concerns.

Open Core components include:

Sync Engine

Responsible for:

PostgreSQL snapshot
replication stream processing
applying changes to DuckDB

Warehouse Layer

DuckDB acts as the columnar analytics engine.

Tables are stored under the raw schema.

Pipeline Engine

Developers write SQL files describing feature pipelines.

Example:

pipelines/user_features.sql

These pipelines read from raw tables and produce derived datasets.

Export Engine

Exports datasets to:

Parquet
CSV
external storage

Design Principles

pg-warehouse follows several design constraints.

Local-First

The entire pipeline runs locally.

No cloud infrastructure required.

Single Binary

pg-warehouse is implemented in GoLang and compiled as a single executable.

This simplifies deployment dramatically.

Config-Driven

All workflows are configured using a single YAML file.

Example:

project:
  name: my_warehouse

postgres:
  url: postgres://warehouse:password@pg-host:5432/mydb
  schema: public

duckdb:
  raw: ./raw.duckdb
  silver: ./silver.duckdb
  feature: ./feature.duckdb

cdc:
  enabled: true
  publication_name: pgwh_pub
  slot_name: pgwh_slot
  tables:
    - public.orders
    - public.customers

sync:
  mode: incremental
  tables:
    - name: public.orders
      target_schema: raw
      primary_key: [id]
      watermark_column: updated_at
    - name: public.customers
      target_schema: raw
      primary_key: [id]
      watermark_column: updated_at

State Durability

Replication progress is stored in a SQLite state database.

This allows:

warehouse rebuilds
crash recovery
deterministic incremental sync

You can delete and rebuild your DuckDB warehouse without losing sync state.

Hexagonal Architecture

pg-warehouse follows a ports and adapters architecture.

Core logic is isolated from:

PostgreSQL adapter
DuckDB adapter
export adapters

This allows enterprise extensions without modifying core logic.

Open Core Model

pg-warehouse is an open core project.

Open Source includes:

PostgreSQL replication-based data sync
DuckDB local analytics warehouse
SQL feature pipelines
Parquet / CSV dataset export
Local-first pipeline execution

Enterprise features will include:

Cloud storage pipelines (S3 / GCS / Azure)
Version-controlled SQL pipelines
Distributed synchronization across nodes
Production observability and monitoring

Why This Matters

Data engineering is slowly shifting toward simpler architectures.

Modern embedded engines like DuckDB make it possible to run serious analytics workloads without clusters.

pg-warehouse aims to cover the 80% case for building datasets and AI features from PostgreSQL.

For many teams:

PostgreSQL → DuckDB → Parquet
Is enough.

Project

pg-warehouse is open core and available on GitHub.

Looking for contributors!

Repository

Quick Start

Development Workflow