Dmitry Narizhnyhkh

Posted on May 12 • Originally published at streams.dbconvert.com

DuckDB, Postgres, and Parquet: when one SQL query becomes a workflow

#database #duckdb #postgres #dataengineering

DuckDB changed what counts as a "simple query".

Not long ago, joining data from different places usually meant building some awkward temporary workflow first.

Postgres table here.

Parquet files there.

Some files in S3.

Then came the usual mess: exports, staging tables, notebooks, temporary scripts, local folders with names like final_final_2, and a small amount of shame.

DuckDB made a lot of that feel unnecessary.

It can read Parquet directly, work with object storage through httpfs, and connect to PostgreSQL through the PostgreSQL extension. A lot of work that used to require exports or a small ETL job can now start as plain SQL.

That is the good part.

The annoying part starts after the query works.

The question is no longer only:

Can this engine query the data?

Often the harder question is:

Can this workflow be trusted again next week without rebuilding everything around it?

DuckDB solved a large part of the query engine problem.

The workflow around it is still scattered.

DuckDB made cross-source SQL practical

The appeal is simple: query data where it already lives.

A database, a file, and object storage no longer have to mean three separate workflows. Instead of moving data first and asking questions later, DuckDB lets you ask the question closer to the source.

For example, the interesting part is not that you can write a query like this:

SELECT
  u.id,
  u.email,
  count(e.event_id) AS event_count
FROM pg_prod.public.users u
JOIN read_parquet('s3://analytics/events/*.parquet') e
  ON e.user_id = u.id
GROUP BY u.id, u.email
ORDER BY event_count DESC;

The interesting part is that this kind of query does not need to start with:

"first, export everything."

No warehouse staging step.

No manual CSV detour.

No throwaway import table.

No small pipeline just to compare two datasets.

At least, not at first.

When the wrapper starts owning the job

The first version is usually innocent.

A small Python script wraps a DuckDB query, loads credentials, attaches a database, reads a few files, and writes the result somewhere.

That is a perfectly reasonable way to explore. Python is fast to change, DuckDB fits naturally into local scripts, and the first useful result arrives quickly.

The problem starts when the wrapper keeps collecting responsibilities.

Now it owns:

quick_test.py
  ├─ credentials
  ├─ S3 paths
  ├─ aliases
  ├─ export logic
  ├─ logging
  ├─ retries
  └─ schedule

Nobody planned to build a pipeline.

But the script quietly became one.

That is usually the point where the query is no longer the problem. The SQL still works. The fragile part is everything needed to run it again with confidence.

Schema inspection is part of that confidence too.

Before trusting a cross-source query, you still need to know which tables exist, what columns are available, what the Parquet file contains, and whether user_id is an integer, UUID, string, or some historical accident with leading zeroes.

A raw script can do all of this eventually.

But now the script is not just running SQL.

It is managing context.

What this looks like in DBConvert Streams

This is the part DBConvert Streams takes out of the script.

Instead of hiding the workflow in code, the sources stay visible:

database connections
file and S3 sources
schemas
rows
query results
export or load targets

For example, one query can combine:

MySQL film catalog
S3 Parquet actor data
PostgreSQL rentals and payments

The result is a single table with top-grossing films, rating, rental count, revenue, and cast list.

No export from MySQL.

No staging table for Parquet.

No temporary PostgreSQL import.

No separate script just to glue the sources together.

In a script-first workflow, the SQL is only one piece. The rest is hidden in code: connection strings, credentials, S3 paths, aliases, output location, cleanup logic, and whatever gets added after the query starts being reused.

In DBConvert Streams, those parts live in the workspace instead.

The database connection is saved.

The file or S3 source is visible.

Schemas and rows can be inspected before writing the query.

The SQL still runs through DuckDB, but the surrounding workflow is no longer scattered across a script, a database IDE, and a separate export or migration tool.

The result can also become a source for a Stream.

That matters when the query is not just analysis, but preparation: join data from several places, filter it, validate the shape, and then load the result into another database or file target.

That is the positioning:

DuckDB is the engine. DBConvert Streams is the workspace around the engine.

The goal is not to replace DuckDB.

The goal is to stop turning every useful DuckDB query into a small custom tool that somebody has to maintain.

A simple rule of thumb

Plain DuckDB is enough when the task is local, temporary, and owned by one person.

A workflow layer starts to make sense when the same sources come back again, credentials matter, schemas need to be inspected, or the query result becomes something people depend on.

That is the line to watch.

Not the moment the query becomes complex.

The moment the workflow becomes worth preserving.

DBConvert Streams uses DuckDB as the query engine for cross-source SQL across databases, local files, and S3-compatible storage.

More on the Cross-Database SQL workflow:

https://streams.dbconvert.com/cross-database-sql