Every Python developer loading data into PostgreSQL hits the same wall.
executemany() with 1M rows? 16 minutes. df.to_sql()? Same thing — it generates INSERT statements under the hood. Even method='multi' with chunking is slow because each batch is still a SQL statement parsed by the server.
PostgreSQL has had a faster path since version 7.x: the COPY protocol. It bypasses the SQL parser entirely and streams CSV or binary data directly to the storage engine. But wiring it up yourself means dealing with copy_expert(), CSV serialization, NULL handling, and type mapping.
I built arrowjet to wrap that in one line.
The numbers
Benchmarked on RDS PostgreSQL 16.6, EC2 instance in the same region, 1M rows (3 columns: BIGINT, DOUBLE PRECISION, VARCHAR):
Writes:
| Approach | Time | vs Arrowjet |
|---|---|---|
executemany (batch 1000) |
~16 min | 850x slower |
| Multi-row VALUES (batch 1000) | 8.4s | 7.4x slower |
| Arrowjet | 1.13s | baseline |
Reads:
| Approach | Time | vs Arrowjet |
|---|---|---|
cursor.fetchall() |
1.00s | 1.5x slower |
| Arrowjet | 0.65s | baseline |
The write speedup is the headline: 850x. That's not a typo. executemany sends each row as a separate protocol-level operation. COPY sends the entire dataset in one streaming operation.
How it works
import arrowjet
import psycopg2
conn = psycopg2.connect(
host="your-host", dbname="mydb",
user="user", password="...",
)
engine = arrowjet.Engine(provider="postgresql")
# Write — COPY FROM STDIN (850x faster than executemany)
engine.write_dataframe(conn, df, "target_table")
# Read — COPY TO STDOUT (1.5x faster than fetchall)
result = engine.read_bulk(conn, "SELECT * FROM target_table")
df = result.to_pandas()
No S3 bucket. No IAM role. No staging config. Just your existing psycopg2 connection.
Works everywhere
The COPY protocol is core PostgreSQL — not an AWS extension. This works with:
- Aurora PostgreSQL
- RDS PostgreSQL
- Self-hosted PostgreSQL
- Docker PostgreSQL
- Supabase, Neon, CockroachDB
If psycopg2 can connect to it, arrowjet can bulk-load it.
Bring your own connection
If you already have connection management (Airflow DAGs, ETL scripts, Django), you don't need to change it:
import arrowjet
# Your existing connection — keep it
conn = your_existing_pg_connection()
engine = arrowjet.Engine(provider="postgresql")
engine.write_dataframe(conn, df, "my_table")
result = engine.read_bulk(conn, "SELECT * FROM my_table")
Arrowjet doesn't own the connection. It just does the bulk part.
Also supports Redshift
Arrowjet started as a Redshift bulk engine (COPY/UNLOAD via S3). The PostgreSQL provider is new in v0.3.0. Same API, different execution strategy:
# PostgreSQL — COPY protocol, no staging
pg_engine = arrowjet.Engine(provider="postgresql")
# Redshift — COPY/UNLOAD via S3
rs_engine = arrowjet.Engine(
provider="redshift",
staging_bucket="your-bucket",
staging_iam_role="arn:aws:iam::123:role/RedshiftS3",
staging_region="us-east-1",
)
# Same API for both
engine.write_dataframe(conn, df, "my_table")
result = engine.read_bulk(conn, "SELECT * FROM my_table")
Redshift benchmarks: 4x faster reads, 14,000x faster writes.
CLI
arrowjet export --provider postgresql \
--query "SELECT * FROM users" \
--to ./users.parquet \
--host your-host --password your-pass
What it doesn't do
- It's PostgreSQL and Redshift only for now. MySQL is next on the roadmap.
- The read speedup (1.5x) is modest compared to writes (850x). The COPY protocol advantage for reads grows with data size and network latency.
- You need
psycopg2orpsycopg3. Standard library connections won't work (no COPY protocol support).
Try it
pip install arrowjet
- GitHub: github.com/arrowjet/arrowjet
- Docs: configuration, IAM setup
- Redshift benchmarks: 4x faster reads blog post
MIT. Feedback welcome — especially from anyone doing bulk PostgreSQL loads at scale.

Top comments (0)