Your warehouse isn't expensive. Your full table scans are.

#bigdata #bigquery #redshift #aws

Every few months someone shows me a cloud bill that tripled and asks whether they should switch warehouses — BigQuery to Redshift, Redshift to BigQuery, the grass always cheaper on the other side. They almost never need to switch. They need to stop scanning everything. And to see why, you have to understand that the two warehouses charge you for completely different mistakes.

(Prices below are US list rates verified mid-2026. They drift — confirm current numbers on the vendor pages before you quote them in a meeting.)

Two warehouses, two completely different bills

BigQuery charges you for bytes scanned. It's serverless — no cluster to manage — and on-demand pricing is about $6.25 per terabyte your query reads (the first 1 TiB each month is free). You don't pay for a machine. You pay for how much data each query touches. That sounds harmless right up until someone does this:

-- BigQuery on-demand: this reads EVERY byte of a 10 TB table.
-- At ~$6.25/TB, that's ~$62.50. For one query.
SELECT * FROM events;

Sixty-odd dollars, one query. Now schedule a dashboard to run it hourly and you've built a machine that sets fire to four figures a day. The warehouse didn't do that. The SELECT * did.

Redshift charges you for compute you rent. With provisioned clusters you pay per node, per hour — whether or not a single query runs. The classic Redshift bill isn't a runaway query; it's a cluster humming along at 3% utilization all weekend, billing full price to do nothing:

A provisioned cluster bills 24/7 while it's up.
Idle Saturday 2 a.m., zero queries running? Still on the meter.

Redshift Serverless softens this — it bills per "RPU-hour" only while queries actually run, and stops when idle — but the trap just relocates: now an unbounded, badly-written query scales the compute up and you pay for the burst. Different shape, same root cause.

So: two warehouses, two opposite-looking bills — and one shared villain underneath. Both are downstream of a single question: are you reading only what you need, or dragging the whole table every time?

Fix #1: stop selecting columns you don't use

Because BigQuery is columnar — it stores each column separately — selecting fewer columns reads less data and costs less, directly and proportionally.

-- Scans all 200 columns. You used three of them.
SELECT * FROM events WHERE event_date = '2026-06-01';

-- Scans three columns. Same rows, a fraction of the bytes, a fraction of the cost.
SELECT user_id, event_type, amount
FROM events
WHERE event_date = '2026-06-01';

On a wide table this one change can cut a query's cost by 80–90%. SELECT * isn't a convenience — on a columnar warehouse it's a bill multiplier.

And before you run anything expensive, BigQuery will tell you the cost for free. Use the dry run:

# Estimates bytes scanned WITHOUT running (or charging for) the query.
bq query --dry_run --use_legacy_sql=false 'SELECT * FROM dataset.events'
# -> "Query will process 10.4 TB when run."  (i.e., ~$65. Maybe don't.)

There is no excuse for a surprise five-figure query when a one-line dry run shows you the bill in advance.

Fix #2: partition so you skip data you don't need

A partition splits a table into chunks (usually by date) so a query that filters on that column can skip whole chunks without reading them.

-- BigQuery: partition by day at create time
CREATE TABLE events_partitioned
PARTITION BY DATE(event_timestamp) AS
SELECT * FROM events;

Now the math changes completely. A query filtering to one day of a three-year table:

-- Without partitioning: scans 3 years of data to find one day.
-- With partitioning:    reads ONLY that day's partition.
SELECT user_id, amount
FROM events_partitioned
WHERE DATE(event_timestamp) = '2026-06-01';

On a 10 TB / 3-year table, that single-day query goes from scanning ~10 TB (~$62) to scanning a few GB (a few cents). Same answer. Same SQL shape. Three orders of magnitude cheaper, because you stopped reading data you were only going to throw away.

Clustering (sorting the data within partitions by a column you filter on a lot, like user_id) prunes even further within each partition. Partition by the thing you filter by date; cluster by the thing you filter by identity.

Fix #3: on Redshift, design how data is laid out

Redshift's levers are different because its cost is about how efficiently the cluster does work — which comes down to how the data is physically arranged across nodes.

CREATE TABLE events (
    user_id     BIGINT,
    event_date  DATE,
    amount      DECIMAL(10,2)
)
DISTKEY(user_id)     -- spreads rows across nodes by user_id, so joins on
                     -- user_id happen locally instead of shuffling data
                     -- across the network (network shuffle = slow = $$)
SORTKEY(event_date); -- keeps rows ordered by date, so a date filter can
                     -- skip whole blocks instead of scanning the table

A SORTKEY on event_date is Redshift's version of partition pruning: filter on a sorted column and the engine skips blocks it knows can't match. A bad DISTKEY is the silent Redshift killer — pick the wrong column and every join shuffles gigabytes across the network between nodes, and your "slow warehouse" is really a data-layout problem wearing a warehouse costume.

And if your raw data lives in S3 and you query it through Redshift Spectrum, the same bytes-scanned logic from BigQuery returns: Spectrum bills per TB scanned, so storing that data as Parquet (columnar, compressed) instead of CSV cuts the scan — and the bill — by a large factor, because Spectrum only reads the columns you ask for.

The decision: read less, before you migrate

Here's the part that saves people the migration. Both warehouses, stripped down, reward the exact same discipline:

→ Read only the columns you need. SELECT * is the most expensive habit in analytics, on either platform.
→ Read only the rows you need. Partition (BigQuery) or sort-key (Redshift) on what you filter by, so the engine skips the rest.
→ Estimate before you run. Dry-run on BigQuery; check the query plan on Redshift. Know the cost before you pay it.
→ Match the pricing model to the workload. Bursty and unpredictable → serverless/on-demand. Steady and heavy → provisioned/committed capacity. Paying on-demand prices for a 24/7 workload, or renting an idle cluster for a once-a-day job, is its own quiet waste.

The bill tracks how much data you move, not which logo is on the console. The gap between a well-architected warehouse and a careless one is routinely 10–20x on the same data and the same questions — and that gap is almost entirely query patterns and table design, not the vendor.

So before you migrate, go pull your single most expensive query. How much data does it actually touch — and how much of that do you actually use? The honest answer to that second question is usually where your "expensive warehouse" turns out to live.

I'm Vinicius Fagundes — principal data engineer, independent, and an MBA lecturer in São Paulo. Cutting cloud-warehouse bills without changing what teams can do with their data is a big part of what I do. That work lives at vf-insights.com.