abdu masah

Posted on Apr 21

4x Faster Redshift Reads With One Line of Python

#python #aws #dataengineering #opensource

Arrowjet wraps Redshift's COPY/UNLOAD commands in a simple Python API. Benchmarked at 4x faster reads and 3,000x faster writes than standard drivers.

Standard Redshift drivers fetch data row-by-row over the PostgreSQL wire protocol. For 10M rows, cursor.fetchall() takes about 145 seconds on a 4-node cluster.

That's not a Redshift problem. That's a wire protocol problem.

Redshift has a much faster path: UNLOAD to S3 as Parquet, then read the files. This is how AWS moves data internally. But wiring it up yourself means 30-50 lines of boilerplate every time — S3 paths, IAM roles, Parquet conversion, cleanup, error handling.

I built Arrowjet to wrap that in one line.

The numbers

I benchmarked on a 4-node ra3.large cluster with an EC2 instance in the same region. Each test ran 5 iterations in randomized order to eliminate ordering bias.

Reads (10M rows):

Approach	Time	Speedup
`cursor.fetchall()`	144.5s	baseline
Manual UNLOAD script	~58s	2.5x
Arrowjet	36.3s	4x

Writes (1M rows):

Approach	Time	vs Arrowjet
`write_dataframe()` INSERT	~23 hours	3,138x slower
Manual COPY script	11.1s	parity
Arrowjet	11.7s	baseline

The read speedup scales with cluster size — more nodes means more UNLOAD parallelism. On an 8-node cluster, you'd expect ~7x. The write path matches what a competent engineer scripts manually, but in one line with automatic cleanup.

How it works

Reads: your query goes through UNLOAD → S3 → Parquet → Arrow.
Writes: your data goes through Arrow → Parquet → S3 → COPY.

For small queries, Arrowjet uses the standard PostgreSQL wire protocol (safe mode). The bulk path only fires when you explicitly ask for it.

import arrowjet

conn = arrowjet.connect(
    host="your-cluster.region.redshift.amazonaws.com",
    database="dev",
    user="awsuser",
    password="...",
    staging_bucket="your-staging-bucket",
    staging_iam_role="arn:aws:iam::123456789:role/RedshiftS3Role",
    staging_region="us-east-1",
)

# Bulk read — 4x faster for large results
result = conn.read_bulk("SELECT * FROM events WHERE date > '2025-01-01'")
df = result.to_pandas()

# Bulk write — COPY-speed with one line
conn.write_dataframe(my_dataframe, "target_table")

# Small queries still use the normal path
df = conn.fetch_dataframe("SELECT COUNT(*) FROM events")

Bring your own connection

If you already have connection management (Airflow DAGs, ETL scripts, dbt hooks), you don't need to change it. The Engine API takes any DBAPI connection:

import arrowjet
import redshift_connector

conn = redshift_connector.connect(host=..., database=..., ...)

engine = arrowjet.Engine(
    staging_bucket="your-bucket",
    staging_iam_role="arn:aws:iam::123:role/RedshiftS3Role",
    staging_region="us-east-1",
)

result = engine.read_bulk(conn, "SELECT * FROM events")
engine.write_dataframe(conn, df, "target_table")

Works with redshift_connector, psycopg2, ADBC, or anything DBAPI-compatible.

CLI

arrowjet export --query "SELECT * FROM sales" --to s3://bucket/sales/
arrowjet import --from s3://bucket/data/ --to sales_table
arrowjet preview --file ./out.parquet
arrowjet validate --table sales --row-count --schema --sample

The S3-direct export path runs UNLOAD straight to your destination — data goes Redshift → S3 with no client roundtrip.

What it doesn't do

It's Redshift-only for now. The provider abstraction is built (adding Snowflake/BigQuery means implementing one interface), but Redshift is the first and only backend today.
Bulk mode has S3 overhead. For queries returning 100 rows, the standard wire protocol is faster. Arrowjet's sweet spot is 100K+ rows.
You need an S3 bucket in the same region as your cluster, and an IAM role attached to Redshift with S3 access. The IAM setup guide covers three deployment models.

Try it

pip install arrowjet              # core + CLI
pip install arrowjet[redshift]    # + Redshift drivers

GitHub: github.com/arrowjet/arrowjet
Docs: configuration, CLI reference, IAM setup
Migration from redshift_connector: migration guide

Apache 2.0. Feedback welcome — especially from anyone doing bulk Redshift work at scale.

Top comments (1)

vahid Saber • Apr 27

Nice