DEV Community

abdu masah
abdu masah

Posted on

850x Faster PostgreSQL Writes With One Line of Python

Every Python developer loading data into PostgreSQL hits the same wall.

executemany() with 1M rows? 16 minutes. df.to_sql()? Same thing — it generates INSERT statements under the hood. Even method='multi' with chunking is slow because each batch is still a SQL statement parsed by the server.

PostgreSQL has had a faster path since version 7.x: the COPY protocol. It bypasses the SQL parser entirely and streams CSV or binary data directly to the storage engine. But wiring it up yourself means dealing with copy_expert(), CSV serialization, NULL handling, and type mapping.

I built arrowjet to wrap that in one line.

The numbers

Benchmarked on RDS PostgreSQL 16.6, EC2 instance in the same region, 1M rows (3 columns: BIGINT, DOUBLE PRECISION, VARCHAR):

Writes:

Approach Time vs Arrowjet
executemany (batch 1000) ~16 min 850x slower
Multi-row VALUES (batch 1000) 8.4s 7.4x slower
Arrowjet 1.13s baseline

Reads:

Approach Time vs Arrowjet
cursor.fetchall() 1.00s 1.5x slower
Arrowjet 0.65s baseline

The write speedup is the headline: 850x. That's not a typo. executemany sends each row as a separate protocol-level operation. COPY sends the entire dataset in one streaming operation.

How it works

import arrowjet
import psycopg2

conn = psycopg2.connect(
    host="your-host", dbname="mydb",
    user="user", password="...",
)

engine = arrowjet.Engine(provider="postgresql")

# Write — COPY FROM STDIN (850x faster than executemany)
engine.write_dataframe(conn, df, "target_table")

# Read — COPY TO STDOUT (1.5x faster than fetchall)
result = engine.read_bulk(conn, "SELECT * FROM target_table")
df = result.to_pandas()
Enter fullscreen mode Exit fullscreen mode

No S3 bucket. No IAM role. No staging config. Just your existing psycopg2 connection.

Works everywhere

The COPY protocol is core PostgreSQL — not an AWS extension. This works with:

  • Aurora PostgreSQL
  • RDS PostgreSQL
  • Self-hosted PostgreSQL
  • Docker PostgreSQL
  • Supabase, Neon, CockroachDB

If psycopg2 can connect to it, arrowjet can bulk-load it.

Bring your own connection

If you already have connection management (Airflow DAGs, ETL scripts, Django), you don't need to change it:

import arrowjet

# Your existing connection — keep it
conn = your_existing_pg_connection()

engine = arrowjet.Engine(provider="postgresql")
engine.write_dataframe(conn, df, "my_table")
result = engine.read_bulk(conn, "SELECT * FROM my_table")
Enter fullscreen mode Exit fullscreen mode

Arrowjet doesn't own the connection. It just does the bulk part.

Also supports Redshift

Arrowjet started as a Redshift bulk engine (COPY/UNLOAD via S3). The PostgreSQL provider is new in v0.3.0. Same API, different execution strategy:

# PostgreSQL — COPY protocol, no staging
pg_engine = arrowjet.Engine(provider="postgresql")

# Redshift — COPY/UNLOAD via S3
rs_engine = arrowjet.Engine(
    provider="redshift",
    staging_bucket="your-bucket",
    staging_iam_role="arn:aws:iam::123:role/RedshiftS3",
    staging_region="us-east-1",
)

# Same API for both
engine.write_dataframe(conn, df, "my_table")
result = engine.read_bulk(conn, "SELECT * FROM my_table")
Enter fullscreen mode Exit fullscreen mode

Redshift benchmarks: 4x faster reads, 14,000x faster writes.

CLI

arrowjet export --provider postgresql \
  --query "SELECT * FROM users" \
  --to ./users.parquet \
  --host your-host --password your-pass
Enter fullscreen mode Exit fullscreen mode

What it doesn't do

  • It's PostgreSQL and Redshift only for now. MySQL is next on the roadmap.
  • The read speedup (1.5x) is modest compared to writes (850x). The COPY protocol advantage for reads grows with data size and network latency.
  • You need psycopg2 or psycopg3. Standard library connections won't work (no COPY protocol support).

Try it

pip install arrowjet
Enter fullscreen mode Exit fullscreen mode

MIT. Feedback welcome — especially from anyone doing bulk PostgreSQL loads at scale.

Top comments (0)