DEV Community

Cover image for 4x Faster Redshift Reads With One Line of Python
abdu masah
abdu masah

Posted on

4x Faster Redshift Reads With One Line of Python

Arrowjet wraps Redshift's COPY/UNLOAD commands in a simple Python API. Benchmarked at 4x faster reads and 3,000x faster writes than standard drivers.

Standard Redshift drivers fetch data row-by-row over the PostgreSQL wire protocol. For 10M rows, cursor.fetchall() takes about 145 seconds on a 4-node cluster.

That's not a Redshift problem. That's a wire protocol problem.

Redshift has a much faster path: UNLOAD to S3 as Parquet, then read the files. This is how AWS moves data internally. But wiring it up yourself means 30-50 lines of boilerplate every time — S3 paths, IAM roles, Parquet conversion, cleanup, error handling.

I built Arrowjet to wrap that in one line.

The numbers

I benchmarked on a 4-node ra3.large cluster with an EC2 instance in the same region. Each test ran 5 iterations in randomized order to eliminate ordering bias.

Reads (10M rows):

Approach Time Speedup
cursor.fetchall() 144.5s baseline
Manual UNLOAD script ~58s 2.5x
Arrowjet 36.3s 4x

Writes (1M rows):

Approach Time vs Arrowjet
write_dataframe() INSERT ~23 hours 3,138x slower
Manual COPY script 11.1s parity
Arrowjet 11.7s baseline

The read speedup scales with cluster size — more nodes means more UNLOAD parallelism. On an 8-node cluster, you'd expect ~7x. The write path matches what a competent engineer scripts manually, but in one line with automatic cleanup.

How it works

Reads: your query goes through UNLOAD → S3 → Parquet → Arrow.
Writes: your data goes through Arrow → Parquet → S3 → COPY.

For small queries, Arrowjet uses the standard PostgreSQL wire protocol (safe mode). The bulk path only fires when you explicitly ask for it.

import arrowjet

conn = arrowjet.connect(
    host="your-cluster.region.redshift.amazonaws.com",
    database="dev",
    user="awsuser",
    password="...",
    staging_bucket="your-staging-bucket",
    staging_iam_role="arn:aws:iam::123456789:role/RedshiftS3Role",
    staging_region="us-east-1",
)

# Bulk read — 4x faster for large results
result = conn.read_bulk("SELECT * FROM events WHERE date > '2025-01-01'")
df = result.to_pandas()

# Bulk write — COPY-speed with one line
conn.write_dataframe(my_dataframe, "target_table")

# Small queries still use the normal path
df = conn.fetch_dataframe("SELECT COUNT(*) FROM events")
Enter fullscreen mode Exit fullscreen mode

Bring your own connection

If you already have connection management (Airflow DAGs, ETL scripts, dbt hooks), you don't need to change it. The Engine API takes any DBAPI connection:

import arrowjet
import redshift_connector

conn = redshift_connector.connect(host=..., database=..., ...)

engine = arrowjet.Engine(
    staging_bucket="your-bucket",
    staging_iam_role="arn:aws:iam::123:role/RedshiftS3Role",
    staging_region="us-east-1",
)

result = engine.read_bulk(conn, "SELECT * FROM events")
engine.write_dataframe(conn, df, "target_table")
Enter fullscreen mode Exit fullscreen mode

Works with redshift_connector, psycopg2, ADBC, or anything DBAPI-compatible.

CLI

arrowjet export --query "SELECT * FROM sales" --to s3://bucket/sales/
arrowjet import --from s3://bucket/data/ --to sales_table
arrowjet preview --file ./out.parquet
arrowjet validate --table sales --row-count --schema --sample
Enter fullscreen mode Exit fullscreen mode

The S3-direct export path runs UNLOAD straight to your destination — data goes Redshift → S3 with no client roundtrip.

What it doesn't do

  • It's Redshift-only for now. The provider abstraction is built (adding Snowflake/BigQuery means implementing one interface), but Redshift is the first and only backend today.
  • Bulk mode has S3 overhead. For queries returning 100 rows, the standard wire protocol is faster. Arrowjet's sweet spot is 100K+ rows.
  • You need an S3 bucket in the same region as your cluster, and an IAM role attached to Redshift with S3 access. The IAM setup guide covers three deployment models.

Try it

pip install arrowjet              # core + CLI
pip install arrowjet[redshift]    # + Redshift drivers
Enter fullscreen mode Exit fullscreen mode

Apache 2.0. Feedback welcome — especially from anyone doing bulk Redshift work at scale.

Top comments (0)