850x Faster PostgreSQL Writes With One Line of Python

#python #postgres #dataengineering #opensource

Every Python developer loading data into PostgreSQL hits the same wall.

executemany() with 1M rows? 16 minutes. df.to_sql()? Same thing — it generates INSERT statements under the hood. Even method='multi' with chunking is slow because each batch is still a SQL statement parsed by the server.

PostgreSQL has had a faster path since version 7.x: the COPY protocol. It bypasses the SQL parser entirely and streams CSV or binary data directly to the storage engine. But wiring it up yourself means dealing with copy_expert(), CSV serialization, NULL handling, and type mapping.

I built arrowjet to wrap that in one line.

The numbers

Benchmarked on RDS PostgreSQL 16.6, EC2 instance in the same region, 1M rows (3 columns: BIGINT, DOUBLE PRECISION, VARCHAR):

Writes:

Approach	Time	vs Arrowjet
`executemany` (batch 1000)	~16 min	850x slower
Multi-row VALUES (batch 1000)	8.4s	7.4x slower
Arrowjet	1.13s	baseline

Reads:

Approach	Time	vs Arrowjet
`cursor.fetchall()`	1.00s	1.5x slower
Arrowjet	0.65s	baseline

The write speedup is the headline: 850x. That's not a typo. executemany sends each row as a separate protocol-level operation. COPY sends the entire dataset in one streaming operation.

How it works

import arrowjet
import psycopg2

conn = psycopg2.connect(
    host="your-host", dbname="mydb",
    user="user", password="...",
)

engine = arrowjet.Engine(provider="postgresql")

# Write — COPY FROM STDIN (850x faster than executemany)
engine.write_dataframe(conn, df, "target_table")

# Read — COPY TO STDOUT (1.5x faster than fetchall)
result = engine.read_bulk(conn, "SELECT * FROM target_table")
df = result.to_pandas()

No S3 bucket. No IAM role. No staging config. Just your existing psycopg2 connection.

Works everywhere

The COPY protocol is core PostgreSQL — not an AWS extension. This works with:

Aurora PostgreSQL
RDS PostgreSQL
Self-hosted PostgreSQL
Docker PostgreSQL
Supabase, Neon, CockroachDB

If psycopg2 can connect to it, arrowjet can bulk-load it.

Bring your own connection

If you already have connection management (Airflow DAGs, ETL scripts, Django), you don't need to change it:

import arrowjet

# Your existing connection — keep it
conn = your_existing_pg_connection()

engine = arrowjet.Engine(provider="postgresql")
engine.write_dataframe(conn, df, "my_table")
result = engine.read_bulk(conn, "SELECT * FROM my_table")

Arrowjet doesn't own the connection. It just does the bulk part.

Also supports Redshift

Arrowjet started as a Redshift bulk engine (COPY/UNLOAD via S3). The PostgreSQL provider is new in v0.3.0. Same API, different execution strategy:

# PostgreSQL — COPY protocol, no staging
pg_engine = arrowjet.Engine(provider="postgresql")

# Redshift — COPY/UNLOAD via S3
rs_engine = arrowjet.Engine(
    provider="redshift",
    staging_bucket="your-bucket",
    staging_iam_role="arn:aws:iam::123:role/RedshiftS3",
    staging_region="us-east-1",
)

# Same API for both
engine.write_dataframe(conn, df, "my_table")
result = engine.read_bulk(conn, "SELECT * FROM my_table")

Redshift benchmarks: 4x faster reads, 14,000x faster writes.

CLI

arrowjet export --provider postgresql \
  --query "SELECT * FROM users" \
  --to ./users.parquet \
  --host your-host --password your-pass

What it doesn't do

It's PostgreSQL and Redshift only for now. MySQL is next on the roadmap.
The read speedup (1.5x) is modest compared to writes (850x). The COPY protocol advantage for reads grows with data size and network latency.
You need psycopg2 or psycopg3. Standard library connections won't work (no COPY protocol support).