Arrowjet wraps Redshift's COPY/UNLOAD commands in a simple Python API. Benchmarked at 4x faster reads and 3,000x faster writes than standard drivers.
Standard Redshift drivers fetch data row-by-row over the PostgreSQL wire protocol. For 10M rows, cursor.fetchall() takes about 145 seconds on a 4-node cluster.
That's not a Redshift problem. That's a wire protocol problem.
Redshift has a much faster path: UNLOAD to S3 as Parquet, then read the files. This is how AWS moves data internally. But wiring it up yourself means 30-50 lines of boilerplate every time — S3 paths, IAM roles, Parquet conversion, cleanup, error handling.
I built Arrowjet to wrap that in one line.
The numbers
I benchmarked on a 4-node ra3.large cluster with an EC2 instance in the same region. Each test ran 5 iterations in randomized order to eliminate ordering bias.
Reads (10M rows):
| Approach | Time | Speedup |
|---|---|---|
cursor.fetchall() |
144.5s | baseline |
| Manual UNLOAD script | ~58s | 2.5x |
| Arrowjet | 36.3s | 4x |
Writes (1M rows):
| Approach | Time | vs Arrowjet |
|---|---|---|
write_dataframe() INSERT |
~23 hours | 3,138x slower |
| Manual COPY script | 11.1s | parity |
| Arrowjet | 11.7s | baseline |
The read speedup scales with cluster size — more nodes means more UNLOAD parallelism. On an 8-node cluster, you'd expect ~7x. The write path matches what a competent engineer scripts manually, but in one line with automatic cleanup.
How it works
Reads: your query goes through UNLOAD → S3 → Parquet → Arrow.
Writes: your data goes through Arrow → Parquet → S3 → COPY.
For small queries, Arrowjet uses the standard PostgreSQL wire protocol (safe mode). The bulk path only fires when you explicitly ask for it.
import arrowjet
conn = arrowjet.connect(
host="your-cluster.region.redshift.amazonaws.com",
database="dev",
user="awsuser",
password="...",
staging_bucket="your-staging-bucket",
staging_iam_role="arn:aws:iam::123456789:role/RedshiftS3Role",
staging_region="us-east-1",
)
# Bulk read — 4x faster for large results
result = conn.read_bulk("SELECT * FROM events WHERE date > '2025-01-01'")
df = result.to_pandas()
# Bulk write — COPY-speed with one line
conn.write_dataframe(my_dataframe, "target_table")
# Small queries still use the normal path
df = conn.fetch_dataframe("SELECT COUNT(*) FROM events")
Bring your own connection
If you already have connection management (Airflow DAGs, ETL scripts, dbt hooks), you don't need to change it. The Engine API takes any DBAPI connection:
import arrowjet
import redshift_connector
conn = redshift_connector.connect(host=..., database=..., ...)
engine = arrowjet.Engine(
staging_bucket="your-bucket",
staging_iam_role="arn:aws:iam::123:role/RedshiftS3Role",
staging_region="us-east-1",
)
result = engine.read_bulk(conn, "SELECT * FROM events")
engine.write_dataframe(conn, df, "target_table")
Works with redshift_connector, psycopg2, ADBC, or anything DBAPI-compatible.
CLI
arrowjet export --query "SELECT * FROM sales" --to s3://bucket/sales/
arrowjet import --from s3://bucket/data/ --to sales_table
arrowjet preview --file ./out.parquet
arrowjet validate --table sales --row-count --schema --sample
The S3-direct export path runs UNLOAD straight to your destination — data goes Redshift → S3 with no client roundtrip.
What it doesn't do
- It's Redshift-only for now. The provider abstraction is built (adding Snowflake/BigQuery means implementing one interface), but Redshift is the first and only backend today.
- Bulk mode has S3 overhead. For queries returning 100 rows, the standard wire protocol is faster. Arrowjet's sweet spot is 100K+ rows.
- You need an S3 bucket in the same region as your cluster, and an IAM role attached to Redshift with S3 access. The IAM setup guide covers three deployment models.
Try it
pip install arrowjet # core + CLI
pip install arrowjet[redshift] # + Redshift drivers
- GitHub: github.com/arrowjet/arrowjet
- Docs: configuration, CLI reference, IAM setup
- Migration from redshift_connector: migration guide
Apache 2.0. Feedback welcome — especially from anyone doing bulk Redshift work at scale.
Top comments (0)