DEV Community: Vinicius Fagundes

46%. 13%. 61%. Same Data, Three Headlines.

Vinicius Fagundes — Thu, 23 Jul 2026 12:22:00 +0000

All three numbers are true. All three describe the same trend. And whichever one you saw first probably decided what you believe.

The story: cheap open-weight models are eating enterprise AI traffic. A routing platform's public data went through the business press this month, and it went viral in three versions:

46% — share of enterprise tokens in the single best week
13% — same models, measured across a full year of traffic
61% — share among only the top-10 models, in one hand-picked week

One dataset. Three denominators. Three completely different headlines. Nobody faked a number. They picked frames.

Let's do what the headlines didn't: compute all three from the same data and watch the trick happen in front of us.

One dataset, three framings

I've reconstructed a year of weekly token-share data that matches the reported figures — the mechanics are what matter here, not the vendor's raw logs:

import numpy as np

rng = np.random.default_rng(7)

# 52 weeks of open-weight share of TOTAL enterprise tokens:
# a modest baseline with a launch-week spike.
share = np.clip(rng.normal(0.11, 0.025, 52), 0.04, None)
share[31] = 0.46          # the week a hot open-weight model launched

print(f"Framing 1 — THE PEAK:    {share.max():.0%}  (best single week)")
print(f"Framing 2 — THE AVERAGE: {share.mean():.0%}  (all 52 weeks)")

# Framing 3 — THE CHERRY-PICK: same spike week, but the denominator
# shrinks to 'tokens among the top-10 models only'.
spike_week_tokens = {"open_weight": 46, "closed_top10": 29, "closed_long_tail": 25}
top10_only = spike_week_tokens["open_weight"] / (
    spike_week_tokens["open_weight"] + spike_week_tokens["closed_top10"])
print(f"Framing 3 — TOP-10 ONLY: {top10_only:.0%}  (spike week, filtered denominator)")

Framing 1 — THE PEAK:    46%  (best single week)
Framing 2 — THE AVERAGE: 13%  (all 52 weeks)
Framing 3 — TOP-10 ONLY: 61%  (spike week, filtered denominator)

Look at framing 3 closely, because it's the most elegant of the three moves. Nothing about the open-weight tokens changed — 46 units either way. The denominator changed: drop the long tail of smaller closed models from the bottom of the fraction, and 46% becomes 61% without touching a single data point. The numerator is honest. The frame is the argument.

This is exactly how your own dashboards lie to you

Here's the part I actually care about, because I don't run a routing platform and neither do you — but we both own dashboards.

Nobody on your team fakes numbers either. They pick frames, usually without noticing:

Peak week when they want momentum ("adoption hit 46%!")
Yearly average when they want calm ("it's really only 13%")
A filtered subset when they want drama ("61% among the models that matter")

Same metric. Same honesty. Three different meetings.

The four questions I drill before trusting any percentage

I make my MBA students interrogate every percentage the same way, and it transfers directly to dashboard review:

1. What's the denominator? A share of what, exactly? "61% of top-10 tokens" and "61% of tokens" are different claims wearing the same number.

2. What's the time window — and who picked it? A window chosen after seeing the data is not a window, it's a selection. The launch week didn't volunteer; someone picked it.

3. Is this a level or a peak? "Reached 46%" and "runs at 46%" get collapsed into the same headline constantly. One is a spike; the other is a state.

4. What happens if I widen the frame? The cheapest robustness check that exists:

# Widen the window around the spike and watch the number deflate toward reality.
for w in [1, 4, 12, 52]:
    lo, hi = max(0, 31 - w // 2), min(52, 31 + (w + 1) // 2)
    print(f"window = {w:2d} weeks around the spike -> {share[lo:hi].mean():.0%}")

window =  1 weeks around the spike -> 46%
window =  4 weeks around the spike -> 20%
window = 12 weeks around the spike -> 14%
window = 52 weeks around the spike -> 13%

One for loop. The 46% headline decays to the 13% reality in four lines of output. If a number can't survive a wider window, the number was never the story — the window was.

The trend underneath is still real

Worth saying plainly: cost pressure genuinely is reshaping AI traffic, and open-weight models genuinely are taking share. The trend survives the de-framing. But you only learn that — the calm, true, 13%-and-climbing version — after you strip the framing. Never from the loudest number.

And notice the resolution, because it's the same as always: this is a data problem before it's a statistics problem. The frame lives in the query — the WHERE clause that filtered to top-10, the date predicate that picked the week. Whoever writes the query picks the argument. Dashboard literacy is reading the SQL, not the chart.

The principle: a percentage without its denominator is an opinion wearing a suit.

What's the most misleading metric you've ever caught on a dashboard?

I'm Vinicius Fagundes — principal data engineer, independent consultant, and MBA lecturer in São Paulo. I write about data pipelines and the math that runs on top of them, and take on a few projects per quarter through vf-insights.com.

The Model Didn't Change. A Parameter Disappeared. Thousands of Pipelines Broke.

Vinicius Fagundes — Wed, 22 Jul 2026 03:00:00 +0000

An LLM vendor dropped two sampling parameters this month. Every hard-coded call that passed them started throwing errors. The model was fine. The weights were fine. The contract changed — and thousands of pipelines discovered they never had one.

Here's the uncomfortable symmetry. You pin your packages. You contract-test your APIs. You'd never let a payment provider change a request schema under you without a canary catching it. But the LLM call — the one sitting in the middle of your most visible feature — is a raw dict of kwargs assembled by hand, validated by nothing, tested against production only.

Your LLM is a dependency too. Treat it like one. Three habits, all boring:

1. One choke point, not forty call sites

# llm_client.py — the ONLY file in the repo that talks to the vendor.
PINNED_MODEL = "vendor-model-2026-06-01"   # dated snapshot, never "latest"

def build_request(prompt: str, max_tokens: int = 1024) -> dict:
    return {
        "model": PINNED_MODEL,
        "max_tokens": max_tokens,
        "messages": [{"role": "user", "content": prompt}],
        # sampling params live HERE and nowhere else —
        # when the vendor drops one, you change one line, not forty call sites
        "temperature": 0.2,
    }

When the parameter vanished, teams with a choke point shipped a one-line fix. Teams with kwargs scattered across forty files spent the day grepping.

2. A contract test that fails in CI, not in production

# test_llm_contract.py — runs on every deploy and on a nightly schedule
import jsonschema
from llm_client import build_request

REQUEST_SCHEMA = {
    "type": "object",
    "required": ["model", "max_tokens", "messages"],
    "properties": {
        "model": {"const": "vendor-model-2026-06-01"},
        "temperature": {"type": "number", "minimum": 0, "maximum": 1},
    },
    "additionalProperties": True,
}

def test_request_matches_contract():
    jsonschema.validate(build_request("ping"), REQUEST_SCHEMA)

def test_canary_call_against_live_api(client):
    """The nightly canary: one real call. If the vendor drops a param,
    THIS fails at 3am in CI — not at 9am in your product."""
    resp = client.messages.create(**build_request("Reply with the word: ok"))
    assert resp.content[0].text.strip().lower().startswith("ok")

$ pytest test_llm_contract.py -q
..                                                    [100%]
2 passed in 1.84s

3. Pin the snapshot, schedule the upgrade

"latest" in a model string is the same lie as >= in a requirements file: it means "surprise me in production." Pin the dated snapshot, and make upgrading a deliberate event — run the golden evals against the new snapshot, diff the outputs, then move the pin. On your calendar, not the vendor's.

None of this is new engineering. It's the same discipline you already apply to every other dependency — the only novelty is admitting the LLM is one.

I'm Vinicius Fagundes — principal data engineer, independent consultant, and MBA lecturer in São Paulo. I write about data pipelines and the practices that keep them boring, through vf-insights.com.

Your AI Bill Is a Routing Problem, Not a Model Problem

Vinicius Fagundes — Tue, 21 Jul 2026 03:00:00 +0000

The price spread right now: open-weight models around $0.14 per million input tokens. Frontier flagships at $5.00. That's a 35x gap — and most pipelines send every request to the expensive one, because routing everything to the best model is the architecture you get when you don't make an architecture decision.

The fix is boring. Which is my favorite kind of fix.

Step one: classify the workload, not the model

Stop asking "which model is best?" and start asking "which tasks are cheap-safe?" The split is almost always the same:

High-volume, low-ambiguity — extraction, classification, formatting, summarization → cheap tier
Multi-step reasoning, anything customer-facing → frontier tier

Same interface in front of both, so callers never know or care which tier answered.

Step two: the router is maybe 200 lines

Here's the skeleton — a task registry, two tiers, one dispatch function:

from dataclasses import dataclass

PRICE_PER_MTOK = {"cheap": 0.14, "frontier": 5.00}  # $/1M input tokens

@dataclass
class Route:
    tier: str            # "cheap" or "frontier"
    eval_threshold: float  # min accuracy on the golden set to KEEP this route

ROUTES = {
    "extract_invoice_fields": Route("cheap",    0.95),
    "classify_ticket":        Route("cheap",    0.92),
    "format_to_json":         Route("cheap",    0.98),
    "summarize_call":         Route("cheap",    0.90),
    "multi_step_analysis":    Route("frontier", 0.00),  # never downgraded
    "customer_reply":         Route("frontier", 0.00),
}

def route(task: str, eval_scores: dict[str, float]) -> str:
    r = ROUTES[task]
    if r.tier == "cheap" and eval_scores.get(task, 0.0) < r.eval_threshold:
        return "frontier"   # automatic fallback: evals gate the cheap tier, not vibes
    return r.tier

# Last eval run on the golden datasets:
eval_scores = {"extract_invoice_fields": 0.97, "classify_ticket": 0.94,
               "format_to_json": 0.99, "summarize_call": 0.86}  # summarize regressed

for task in ROUTES:
    print(f"{task:24s} -> {route(task, eval_scores)}")

extract_invoice_fields   -> cheap
classify_ticket          -> cheap
format_to_json           -> cheap
summarize_call           -> frontier
multi_step_analysis      -> frontier
customer_reply           -> frontier

Notice summarize_call: it wanted the cheap tier, but its eval score dropped below threshold, so the router bounced it back to frontier automatically. Nobody woke up at 3am. That's the whole design — the golden dataset per task and the threshold are the safety net; the router is just an if.

And that's the honest part of this post: the router is trivial. The eval harness is the real work. Building golden datasets per task, scoring them on every model change, wiring the scores into dispatch — that's where the engineering hours go. Teams that skip it end up routing on vibes, and vibes always route to the expensive model, because nobody gets fired for picking the flagship.

Step three: the arithmetic that sells it upstairs

monthly_mtok = 100  # 100M input tokens/month

def bill(cheap_share: float) -> float:
    return monthly_mtok * (cheap_share * PRICE_PER_MTOK["cheap"]
                           + (1 - cheap_share) * PRICE_PER_MTOK["frontier"])

for share in [0.0, 0.3, 0.5, 0.7]:
    print(f"{share:.0%} of traffic on cheap tier -> ${bill(share):,.0f}/month")

0% of traffic on cheap tier -> $500/month per 100M tokens
30% of traffic on cheap tier -> $354/month per 100M tokens
50% of traffic on cheap tier -> $257/month per 100M tokens
70% of traffic on cheap tier -> $160/month per 100M tokens

Route just half the traffic and the bill drops nearly in half — the cheap tier is so much cheaper that its contribution barely registers. Scale those numbers to your actual token volume; the ratios hold.

This isn't theoretical. One large exchange rebuilt exactly this across 1,200 agents and cut its AI spend nearly in half. Not better prompts. Not a new model. A router and a workload map.

The resolution is the usual one: this is a data problem wearing an AI costume. The workload map comes from your usage logs. The golden datasets are labeled data pipelines. The eval scores live in a table. The teams that treat their AI traffic as a dataset get the 35x spread working for them.

Where does your traffic go today — one model for everything, or tiers?

I'm Vinicius Fagundes — principal data engineer, independent consultant, and MBA lecturer in São Paulo. I write about data pipelines and the economics that run on top of them, and take on a few platform projects per quarter through vf-insights.com.

Cloud Bills Took a Decade to Get FinOps. AI Bills Won't Get That Long.

Vinicius Fagundes — Mon, 20 Jul 2026 03:00:00 +0000

Watch what happened this month.

Databricks moved Genie to pay-as-you-go. Every user gets a small free allowance, then the meter runs. And notice what shipped the same day: budgets, alerts, per-user spending limits. Not a follow-up release. Not "coming in Q4." The governance shipped with the meter.

That detail is the whole story, and it took the industry ten years of cloud pain to learn it. Let me show you the pattern — and then let's compute the numbers your CFO is going to ask you for, before they ask.

The three acts (every computing wave repeats them)

I teach my MBA students that every computing wave walks through the same three acts.

Act one: adoption without meters. Everyone experiments, nobody measures. Engineering spins up assistants, agents, RAG pipelines. The spend is small, distributed, and invisible — a rounding error across forty cost centers.

Act two: the invoice. Some team discovers their AI assistant spend rivals their warehouse spend. This is the quarter someone screenshots the bill and posts it in the leadership channel. Every company gets exactly one of these moments, and it sets the tone for everything after.

Act three: governance. Budgets, showback, unit economics. The discipline gets a name, then a foundation, then a job title.

Cloud took ten years to walk that arc. FinOps became a practice around 2015, a foundation in 2019, a line on job postings after that. AI spend is compressing the same arc into quarters — and the platforms know it. That's why Databricks shipped cost controls with the meter, not after it. Vendors learned from the cloud era that metering without governance creates churn, not revenue.

The skill isn't cutting the bill

Here's what I tell the class: the skill isn't cutting the bill. It's knowing your unit economics before the invoice teaches them to you.

Three numbers. If you can't name them today, act two is already on your calendar:

Cost per query
Cost per agent run
Cost per answered question (not the same thing — see below)

They're not hard to compute. They're just not computed. Here's the whole calculation, by hand, so nothing is hidden:

# Unit economics of an AI assistant — the three numbers your CFO will ask for.
# Prices per 1M tokens; swap in your model's actual rates.

PRICE_IN, PRICE_OUT = 3.00, 15.00   # $/1M tokens, input and output

def query_cost(tokens_in: int, tokens_out: int) -> float:
    return tokens_in / 1e6 * PRICE_IN + tokens_out / 1e6 * PRICE_OUT

# A typical RAG query: system prompt + retrieved chunks + question in,
# a few hundred tokens of answer out.
tin, tout = 6_000, 500
per_query = query_cost(tin, tout)
print(f"Cost per query:        ${per_query:.4f}")

# An agent run is not one query. It's a loop: plan, call tools, read results, retry.
steps_per_run = 9
per_agent_run = per_query * steps_per_run
print(f"Cost per agent run:    ${per_agent_run:.4f}")

# And 'answered question' carries the failure tax: reruns, abandoned sessions,
# answers wrong enough that a human redid the work.
success_rate = 0.72
per_answered = per_agent_run / success_rate
print(f"Cost per answered q:   ${per_answered:.4f}")

# Now scale it to a company.
users, queries_per_user_day, workdays = 800, 12, 22
monthly = users * queries_per_user_day * workdays * per_answered
print(f"\nMonthly, 800 users:    ${monthly:,.0f}")

Cost per query:        $0.0255
Cost per agent run:    $0.2295
Cost per answered q:   $0.3188
Monthly, 800 users:    $67,320

Read that progression again, because it's where every AI budget surprise lives. The per-query number looks like nothing — two and a half cents, who cares. Multiply by the agent loop and the failure tax, scale to the org, and "who cares" is $67K a month. Nobody lied. The meter just compounds faster than intuition does.

Plain: the gap between "cost per query" and "cost per answered question" is where act two happens.

The number that changes behavior isn't the total

Act-three governance is not a bigger spreadsheet. It's attribution — showback per team, per feature, per user, so the meter changes decisions instead of just documenting them. The minimum viable version is a log line:

import json, time
from datetime import datetime, timezone

def metered_call(client, team: str, feature: str, **kwargs):
    """Wrap every LLM call. Attribution is a logging decision, not a platform purchase."""
    t0 = time.monotonic()
    resp = client.messages.create(**kwargs)
    usage = resp.usage
    cost = usage.input_tokens / 1e6 * PRICE_IN + usage.output_tokens / 1e6 * PRICE_OUT
    print(json.dumps({
        "ts": datetime.now(timezone.utc).isoformat(),
        "team": team,
        "feature": feature,
        "tokens_in": usage.input_tokens,
        "tokens_out": usage.output_tokens,
        "cost_usd": round(cost, 6),
        "latency_s": round(time.monotonic() - t0, 2),
    }))
    return resp

That's it. Pipe those lines into the warehouse you already have, group by team and feature, and you have showback on day one — before any vendor sells you a FinOps-for-AI platform. When I rebuilt cost visibility on a Databricks AI pipeline this way, the spend went from $14K/month to $5.8K/month — and most of the reduction came not from optimization heroics but from attribution making waste visible: retry loops nobody owned, a dev environment calling the production model, one feature quietly responsible for 60% of the tokens. You can't fix what isn't attributed to anyone.

Notice the resolution here, because it's the same as always: this is a data problem before it's an AI problem. Cost per answered question is a pipeline metric — it lives in your event logs, your usage tables, your warehouse. Teams that already treat their platform telemetry as a first-class dataset will walk into act three early. Teams that don't will get walked there by an invoice.

The principle: every abstraction eventually sends an invoice. AI is just sending it faster.

What was the first AI line item that made someone at your company ask "wait — how much?"

12 Patterns Run Every Data Stack. Seniority Is Knowing When Each One Is a Mistake.

Vinicius Fagundes — Thu, 16 Jul 2026 13:25:17 +0000

Suggested tags: dataengineering, architecture, bigdata, dataops

You've seen the chart — twelve boxes, twelve patterns, every data stack ever built somewhere inside it. Darshil Parmar's map of the territory is genuinely good. But a map doesn't tell you which roads are closed, which ones flood in the rainy season, and which ones look like highways but bill you like toll roads.

Here's my take on all 12, after 17 years of shipping them in production. For each one: what it's for, and — the part the chart can't show you — when it's a mistake.

Moving data: ETL, ELT, CDC

1. ETL — alive, and not from nostalgia

When PII must be masked before the data lands anywhere, transform-first isn't a preference. It's the only architecture that satisfies the constraint. Compliance keeps ETL employed.

The claim is executable. Here's the shape of it — masking in flight, so the raw value never touches storage:

import hashlib
import json

def mask_pii(record: dict, pii_fields: list[str], salt: str) -> dict:
    """Transform BEFORE load. The raw CPF/email never lands in the lake."""
    out = dict(record)
    for field in pii_fields:
        if field in out and out[field] is not None:
            digest = hashlib.sha256((salt + str(out[field])).encode()).hexdigest()
            out[field] = digest[:16]  # stable pseudonym, joinable, not reversible
    return out

raw = {"user_id": 42, "email": "maria@example.com", "cpf": "123.456.789-00", "amount": 89.90}
landed = mask_pii(raw, pii_fields=["email", "cpf"], salt="rotate-me-quarterly")
print(json.dumps(landed, indent=2))

{
  "user_id": 42,
  "email": "0a1f2b8c9d3e4f5a",
  "cpf": "7c6d5e4f3a2b1c0d",
  "amount": 89.9
}

Try doing that in ELT. You can't — by definition, the untransformed value already landed. If your regulator cares about where raw PII exists, not just who queries it, the L-before-T order is disqualifying.

When it's a mistake: when there's no landing constraint and you're paying for a transform cluster the warehouse could replace.

2. ELT — the modern default, with a hidden multiplier

Snowflake, BigQuery and Redshift turned the warehouse into the compute engine. Cheap storage flipped the letters. Load raw, transform in SQL, version it in dbt. It's the right default.

But watch the bill, because "transform later" quietly becomes "transform five times." Marketing derives sessions_clean from raw events. Product derives almost the same table. Finance derives it again with one extra filter. Nobody knows, because raw data is right there and deriving is one CREATE TABLE AS away.

The multiplier is just arithmetic:

raw_scan_tb = 2.0          # TB scanned per full transform over raw events
price_per_tb = 6.25        # on-demand scan pricing, adjust to your warehouse
runs_per_day = 4           # scheduled refreshes
teams_rederiving = 5       # teams independently transforming the same raw data

one_pipeline_month = raw_scan_tb * price_per_tb * runs_per_day * 30
print(f"One team, one month:   ${one_pipeline_month:,.0f}")
print(f"Five teams, same data: ${one_pipeline_month * teams_rederiving:,.0f}")

One team, one month:   $1,500
Five teams, same data: $7,500

Same raw data. Same business question. Five bills. The fix isn't abandoning ELT — it's a shared staging layer so the expensive scan of raw happens once, and everyone derives from that.

When it's a mistake: never as a pattern — only as an unowned one. ELT without a shared model layer is a subscription to redundant compute.

3. CDC — the most misdiagnosed pattern in the stack

Everyone asks for real-time replication. Most need hourly batch. The diagnosis question is not "how fresh is the data?" It's "how fresh is the decision?"

def cdc_or_batch(decision_latency_min: float, change_rate_per_hr: int) -> str:
    """If no decision changes faster than your batch window, CDC is cost, not capability."""
    if decision_latency_min < 15:
        return "CDC — a decision actually changes inside the batch window"
    if change_rate_per_hr > 500_000:
        return "CDC — batch extracts would hammer the source DB"
    return "Hourly batch — you're done, go home"

for case, args in {
    "Fraud hold on a transaction": (0.5, 50_000),
    "Executive sales dashboard":   (1440, 20_000),
    "Inventory sync to the app":   (10, 800_000),
}.items():
    print(f"{case:32s} -> {cdc_or_batch(*args)}")

Fraud hold on a transaction      -> CDC — a decision actually changes inside the batch window
Executive sales dashboard        -> Hourly batch — you're done, go home
Inventory sync to the app        -> CDC — a decision actually changes inside the batch window

And if the answer really is CDC — respect it. A Debezium agent feeding Kafka is production infrastructure: schema evolution handling, snapshot strategy, offset management, dead-letter queues, on-call. Not a weekend project. I've seen more incidents from casually-deployed CDC than from any other pattern on this list.

When it's a mistake: when it was prescribed by the word "real-time" in a slide instead of a decision that changes mid-hour.

Storing data: Lakehouse

4. Lakehouse — the truce after a decade of war

Lake vs warehouse was never a fair fight — it was two copies of the same data, two pipelines pretending to agree, and a reconciliation job apologizing between them. The lakehouse (Delta, Iceberg, Hudi — table formats with ACID over object storage) is the truce: one copy, one source of truth, warehouse semantics on lake economics.

I merged 6 data lakes into 1 on this pattern. The before/after:

	Before (6 lakes)	After (1 lakehouse)
Storage	baseline	−40%
Compute	baseline	−55%
Copies of "the same" customer table	6, disagreeing	1
Reconciliation pipelines	many	0

The compute drop is the part people don't predict: most of that spend wasn't analysis. It was six pipelines keeping six copies loosely synchronized — work that produces no insight, only agreement. Delete the duplication and the compute deletes itself.

When it's a mistake: when your data fits comfortably in one warehouse and the "lakehouse migration" is résumé-driven. One Postgres replica and dbt beats a Delta lake you don't need.

Processing cadence: streaming, batch, Lambda, Kappa

5. Real-time streaming — powerful, and over-prescribed

Flink and Spark Structured Streaming earn their complexity when decisions change mid-minute: fraud holds, dynamic pricing, operational alerting. That complexity is real — watermarks, late data, exactly-once semantics, state stores that need care and feeding.

Meanwhile, a dashboard refreshed every 5 minutes is batch wearing a costume. If the consumer of the data is a human glancing at a screen between meetings, a 5-minute micro-batch delivers the identical experience at a fraction of the operational surface. The streaming job isn't serving the user — it's serving the architecture diagram.

When it's a mistake: whenever the freshest possible data and the freshest needed data differ by more than the batch window.

6. Batch — unglamorous, still moves the world

Most of the planet's data still moves at night, in batch, and that's fine. My favorite optimization win ever was pure batch: an 8-hour Spark pipeline tuned down to 47 minutes. No new tools. No migration. The fixes were the boring, compounding kind:

# The class of changes that took 8 hours to 47 minutes — none of them exotic.

spark.conf.set("spark.sql.shuffle.partitions", "auto")   # was a hardcoded 200 from a 2018 blog post
spark.conf.set("spark.sql.adaptive.enabled", "true")      # let AQE resize joins at runtime

# 1. Stop reading everything: partition pruning instead of full scans
df = spark.read.parquet("s3://events/").where("event_date = '2026-07-14'")

# 2. Stop shuffling the small side: broadcast the dimension table
from pyspark.sql.functions import broadcast
joined = df.join(broadcast(dim_stores), "store_id")

# 3. Stop recomputing shared branches: cache what three outputs reuse
base = joined.filter("status = 'valid'").cache()

Boring done well beats exciting. An 8-hour job at 47 minutes reruns inside a business day when it fails — which changes your on-call life more than any architecture choice on this page.

When it's a mistake: only when a decision genuinely can't wait for the window. Which is rarer than the vendor deck says.

9. Lambda — honest about the tension, brutal in practice

(Taking 9 and 10 here because they're one argument.) Lambda architecture admits the real tension: some answers need to be fast, all answers need to be complete. Its solution — a speed layer and a batch layer computing the same metric — means two codebases that must agree forever.

They won't. Here's the same "daily revenue" in both layers; spot the bug that ships twice:

-- Batch layer (SQL): timezone-aware, refunds excluded
SELECT DATE(created_at AT TIME ZONE 'America/Sao_Paulo') AS day,
       SUM(amount) AS revenue
FROM orders
WHERE status != 'refunded'
GROUP BY 1;

# Speed layer (streaming): UTC dates, refunds... forgotten
def on_event(order, state):
    day = order["created_at"][:10]        # UTC slice — already disagrees with batch
    state[day] = state.get(day, 0) + order["amount"]  # refunds counted — bug #2

Every definition change now has to land in two languages, two deploy pipelines, two teams' heads. Every bug ships twice, and the reconciliation meeting is where careers go to age.

When it's a mistake: almost always, today. Lambda made sense before table formats and modern stream processors; now it's mostly historical baggage with a Greek name.

10. Kappa — Lambda's apology

One stream, one codebase, replay the log when you need to reprocess. Elegant — if you can afford the Kafka retention and the reprocessing story:

daily_gb = 400            # ingest volume
retention_days = 90       # long enough to replay after finding a logic bug
replication = 3
storage_per_gb_month = 0.10

retained_gb = daily_gb * retention_days * replication
print(f"Retained in Kafka: {retained_gb/1000:.1f} TB")
print(f"Retention cost:    ${retained_gb * storage_per_gb_month:,.0f}/month")
print(f"Full replay time at 500 MB/s: {retained_gb/3 / 0.5 / 3600:.0f} hours")

Retained in Kafka: 108.0 TB
Retention cost:    $10,800/month
Full replay time at 500 MB/s: 20 hours

Many teams look at that and choose batch with a small streaming edge instead. That's not failure — that's arithmetic.

When it's a mistake: when the retention bill and replay window are bigger than the problem Lambda's dual codebase was causing you.

Knowing your data: catalog, lineage

7. Data cataloging — only works as a living system

A catalog nobody updates is a museum: pleasant to visit, wrong about everything current. The only catalogs I've seen survive are the ones populated by the pipelines themselves — metadata emitted at run time, not typed in by a data steward six weeks later. If updating the catalog is a human task, the catalog is already stale; you just haven't noticed yet.

When it's a mistake: as a standalone documentation project. As an automated side effect of orchestration, it's one of the highest-leverage things on this list.

11. Lineage — nobody budgets for it until a board deck is wrong

Then "where did this number come from" is suddenly worth more than the pipeline that produced it. Lineage is insurance: it looks like pure cost right up until the incident where it's the only thing that matters. The trick is the same as cataloging — derive it from execution (query logs, dbt manifests, OpenLineage events), never from documentation.

When it's a mistake: hand-maintained. Automated or absent — those are the only two honest states.

Organizing people: mesh

8. Data mesh — an org decision wearing an architecture costume

Mesh is not a technology. It's a claim about your company: that domain teams are ready to own data as a product — with SLAs, schemas as contracts, and someone on call when their dataset breaks a consumer. If that claim is true, mesh is liberating. If it's false, mesh is decentralized chaos with better naming, and the central data team still gets paged for everything while owning nothing.

When it's a mistake: when it's adopted as an architecture before it's true as an org chart.

Running it all: orchestration

12. Orchestration — the heartbeat nobody monitors

The DAG is your platform's heartbeat and its least monitored component. Everyone watches the pipelines. Few watch the scheduler running them — and a silently degraded scheduler fails every pipeline at once, quietly, by simply not starting them.

# The alert most platforms are missing: not "did the job fail?"
# but "did the scheduler fail to even try?"
from airflow.sensors.base import BaseSensorOperator
from datetime import datetime, timedelta, timezone

class SchedulerHeartbeatCheck(BaseSensorOperator):
    def poke(self, context):
        from airflow.jobs.job import Job
        latest = Job.most_recent_job(job_type="SchedulerJob")
        lag = datetime.now(timezone.utc) - latest.latest_heartbeat
        if lag > timedelta(minutes=5):
            raise RuntimeError(f"Scheduler heartbeat is {lag} old — nothing is being scheduled")
        return True

A DAG that fails pages someone. A DAG that never starts pages no one — until the morning dashboard is empty and twelve teams find out together.

When it's a mistake: it isn't. Unmonitored, it's just a mistake waiting for a quiet night.

What's missing from the chart

Notice: not one of these twelve is cloud-specific. The same patterns run on AWS or GCP, on Databricks or on bare Postgres and cron if you're stubborn enough. Tools rotate every five years. Patterns are the career.

The principle: every pattern here solves a problem and creates one. Architecture is choosing which problems you'd rather have.

Which of these did you adopt too early — and what did it cost you?

You saved $80K self-hosting Spark. It cost you $400K.

Vinicius Fagundes — Mon, 06 Jul 2026 17:23:10 +0000

Then he walked me through the team.

Two senior engineers, 30% of their time, on Spark cluster maintenance. One person on-call for jobs that fail at 3am. A third spent last quarter upgrading the Kubernetes operator. The lead data engineer quit in April — burnout, the team said.

They didn't save $80K. They spent $400K in engineer-hours and called it freedom.

Nobody runs this math. So let's actually run it — with code you can drop your own numbers into, because the values are illustrative and the model is the point.

The math nobody runs

The pitch for self-hosting is always the invoice: the managed platform costs $80K/year more, so you self-host and "save" $80K. That $80K is real. It's also the only number anyone counts.

Here's the rest of the bill — the part that shows up as people, not line items:

# Fully-loaded annual cost of one senior data engineer.
# Salary + benefits + payroll tax + overhead + equipment. Adjust to your market.
LOADED_SENIOR = 200_000

# What "self-hosting" saves on the invoice — the number that sold the decision.
managed_premium = 80_000   # you pay this much MORE per year for the managed platform

# What self-hosting actually costs in people, from the team we walked through:
ops = {
    "2 seniors @ 30% on cluster maintenance":            2 * 0.30 * LOADED_SENIOR,
    "1 engineer on-call (the 3am / weekend tax)":        0.20 * LOADED_SENIOR,
    "1 engineer, a quarter on the K8s operator upgrade": 0.25 * LOADED_SENIOR,
}

# The lead quit in April. Replacing a senior = recruiting + ~6mo ramp.
attrition = 0.95 * LOADED_SENIOR

hidden = sum(ops.values()) + attrition
net    = managed_premium - hidden   # what self-hosting "saved" minus what it cost

for k, v in ops.items():
    print(f"  {k:<50} ${v:>10,.0f}")
print(f"  {'attrition: backfilling the lead who burned out':<50} ${attrition:>10,.0f}")
print(f"  {'-'*63}")
print(f"  {'hidden cost of self-hosting':<50} ${hidden:>10,.0f}")
print(f"  {'invoice savings that sold the decision':<50} ${managed_premium:>10,.0f}")
print(f"  {'NET':<50} ${net:>10,.0f}")

  2 seniors @ 30% on cluster maintenance             $   120,000
  1 engineer on-call (the 3am / weekend tax)         $    40,000
  1 engineer, a quarter on the K8s operator upgrade  $    50,000
  attrition: backfilling the lead who burned out     $   190,000
  ---------------------------------------------------------------
  hidden cost of self-hosting                        $   400,000
  invoice savings that sold the decision             $    80,000
  NET                                                $  -320,000

$80K saved. $320K underwater. And that's before you count the roadmap you never shipped, because four engineers were quietly moonlighting as a platform team instead of building the product.

"But at scale, self-hosting wins" — fine, so where's the line?

I'm not here to tell you managed always wins. It doesn't. There's a real crossover, and it's worth finding instead of arguing about in the abstract.

Self-hosting wins when the premium a managed vendor charges grows bigger than the cost of a team that runs the platform well — and "well" means a real team whose entire job is the platform, not four people running it on the side of their actual roadmap.

LOADED_SENIOR = 200_000

# A platform team that runs this WELL — a real team, not 4 people moonlighting.
platform_team = 4 * LOADED_SENIOR   # $800,000/yr

# Managed premium scales with your footprint (~35% platform tax on raw compute).
# Team cost is roughly fixed. Find where they cross.
print(f"  a real platform team costs ~${platform_team:,.0f}/yr\n")
for annual_compute in [200_000, 1_000_000, 3_000_000, 8_000_000]:
    premium = 0.35 * annual_compute
    verdict = "self-host wins" if premium > platform_team else "pay for managed"
    print(f"  compute ${annual_compute:>9,}/yr   managed premium ${premium:>10,.0f}   -> {verdict}")

crossover = platform_team / 0.35
print(f"\n  crossover ~= ${crossover:,.0f}/yr in compute")

  a real platform team costs ~$800,000/yr

  compute $  200,000/yr   managed premium $    70,000   -> pay for managed
  compute $1,000,000/yr   managed premium $   350,000   -> pay for managed
  compute $3,000,000/yr   managed premium $ 1,050,000   -> self-host wins
  compute $8,000,000/yr   managed premium $ 2,800,000   -> self-host wins

  crossover ~= $2,285,714/yr in compute

The line sits around $2.3M/year in compute. Below it — where the overwhelming majority of teams actually live — you're paying a platform team to lose money against a bill. Above it, a real platform org starts to pay for itself. The mistake was never self-hosting. It's self-hosting at a tenth of the scale where it makes sense, with a tenth of the team it needs.

The real question isn't "cloud vs self-host"

So the 2026 question isn't build-vs-buy in the abstract. It's narrower and far more useful: which layer is your edge actually in? Build the layer that's yours. Pay for every layer that isn't.

Here's the map I use:

Layer	Commodity or edge?	Call	Why
Object storage (S3/GCS)	commodity	buy	Nobody has ever won a market on a better bucket.
Warehouse (Snowflake/BigQuery)	commodity	buy	Hundreds of vendor engineers make it fast so you don't have to.
Orchestration (Airflow/Prefect)	commodity	buy	A scheduler is undifferentiated heavy lifting.
Compute engine (Spark)	commodity	buy	You will not out-tune the people who wrote it.
Vector DB	commodity	buy	It's an index. Managed until proven otherwise.
Foundation model	commodity	buy	Rent the weights. Your edge is never the base model.
Transformation logic (dbt, business rules)	edge	build	This encodes what your business actually knows.
Features / feature logic	edge	build	The signal is yours; it's the whole game.
Retrieval strategy + eval harness	edge	build	RAG quality lives here, not in the vector DB.
The product on top	edge	build	Obviously. This is the only thing customers pay for.

Read it top to bottom and the pattern falls out. Everything that's undifferentiated heavy lifting — storage, warehouse, orchestration, the engine, the model — is a commodity someone else already runs better than you will. Everything that's yours — the transformation logic, the features, the retrieval strategy, the product — is where engineering time actually compounds.

Most teams do the exact opposite. They build their own warehouse to avoid lock-in, then YOLO their prompts straight into a model API with no eval harness and no retrieval strategy. They protect the commodity and surrender the differentiator.

What you actually inherit when you self-host

Invoice "savings" hide a transfer of work, not an elimination of it. Self-host the orchestrator and you now own: version upgrades, CVE patching, autoscaling, HA and failover, backup and restore, the on-call rotation, and a values file nobody remembers writing.

Managed, the whole ops surface collapses to a config:

# Managed orchestration — upgrades, scaling, HA, patching: not your problem.
orchestration:
  provider: managed
  version: pinned-latest

Self-hosted, it's the part that never makes the pitch slide:

# The same "run Airflow", except now it's yours forever:
helm upgrade airflow apache-airflow/airflow \
  --values values.prod.yaml \        # 400 lines you now own and must reason about
  --set images.airflow.tag=2.9.3     # ...after you test the DAG-parsing regression
# then: patch the CVE, resize the workers, chase KubernetesExecutor pod churn,
#       restore the metadata DB someone truncated, and do it again next quarter.

Neither is wrong. But one of them is a dropdown, and the other is two senior engineers at 30% plus a Slack channel called #pipelines-help. The decision is fine — as long as you price both sides of it, which is exactly the step that got skipped.

The principle

Pay for the commodity. Build the differentiator. Every hour your seniors spend keeping someone else's undifferentiated infrastructure alive is an hour not spent on the one thing your competitor can't just buy — the logic, the features, the product that's actually yours.

And notice this was never really a cloud-vs-self-host argument. It's a data-engineering allocation problem wearing an infrastructure costume: where does your team's finite senior time actually go, and does that map to where your edge actually is?

What's the most expensive thing your team built that you should have just paid for?

The numbers here are illustrative placeholders — the loaded salary, the managed premium, the 35% tax are all knobs. Plug in your own before you make a real build-vs-buy call; the model is the point, not the values.

I'm Vinicius Fagundes — principal data engineer, independent, and an MBA lecturer in São Paulo. I spend my time deciding which layers a team should own and which they should rent, then building the ones that are theirs. That work lives at vf-insights.com.

Your warehouse isn't expensive. Your full table scans are.

Vinicius Fagundes — Tue, 30 Jun 2026 03:00:00 +0000

Every few months someone shows me a cloud bill that tripled and asks whether they should switch warehouses — BigQuery to Redshift, Redshift to BigQuery, the grass always cheaper on the other side. They almost never need to switch. They need to stop scanning everything. And to see why, you have to understand that the two warehouses charge you for completely different mistakes.

(Prices below are US list rates verified mid-2026. They drift — confirm current numbers on the vendor pages before you quote them in a meeting.)

Two warehouses, two completely different bills

BigQuery charges you for bytes scanned. It's serverless — no cluster to manage — and on-demand pricing is about $6.25 per terabyte your query reads (the first 1 TiB each month is free). You don't pay for a machine. You pay for how much data each query touches. That sounds harmless right up until someone does this:

-- BigQuery on-demand: this reads EVERY byte of a 10 TB table.
-- At ~$6.25/TB, that's ~$62.50. For one query.
SELECT * FROM events;

Sixty-odd dollars, one query. Now schedule a dashboard to run it hourly and you've built a machine that sets fire to four figures a day. The warehouse didn't do that. The SELECT * did.

Redshift charges you for compute you rent. With provisioned clusters you pay per node, per hour — whether or not a single query runs. The classic Redshift bill isn't a runaway query; it's a cluster humming along at 3% utilization all weekend, billing full price to do nothing:

A provisioned cluster bills 24/7 while it's up.
Idle Saturday 2 a.m., zero queries running? Still on the meter.

Redshift Serverless softens this — it bills per "RPU-hour" only while queries actually run, and stops when idle — but the trap just relocates: now an unbounded, badly-written query scales the compute up and you pay for the burst. Different shape, same root cause.

So: two warehouses, two opposite-looking bills — and one shared villain underneath. Both are downstream of a single question: are you reading only what you need, or dragging the whole table every time?

Fix #1: stop selecting columns you don't use

Because BigQuery is columnar — it stores each column separately — selecting fewer columns reads less data and costs less, directly and proportionally.

-- Scans all 200 columns. You used three of them.
SELECT * FROM events WHERE event_date = '2026-06-01';

-- Scans three columns. Same rows, a fraction of the bytes, a fraction of the cost.
SELECT user_id, event_type, amount
FROM events
WHERE event_date = '2026-06-01';

On a wide table this one change can cut a query's cost by 80–90%. SELECT * isn't a convenience — on a columnar warehouse it's a bill multiplier.

And before you run anything expensive, BigQuery will tell you the cost for free. Use the dry run:

# Estimates bytes scanned WITHOUT running (or charging for) the query.
bq query --dry_run --use_legacy_sql=false 'SELECT * FROM dataset.events'
# -> "Query will process 10.4 TB when run."  (i.e., ~$65. Maybe don't.)

There is no excuse for a surprise five-figure query when a one-line dry run shows you the bill in advance.

Fix #2: partition so you skip data you don't need

A partition splits a table into chunks (usually by date) so a query that filters on that column can skip whole chunks without reading them.

-- BigQuery: partition by day at create time
CREATE TABLE events_partitioned
PARTITION BY DATE(event_timestamp) AS
SELECT * FROM events;

Now the math changes completely. A query filtering to one day of a three-year table:

-- Without partitioning: scans 3 years of data to find one day.
-- With partitioning:    reads ONLY that day's partition.
SELECT user_id, amount
FROM events_partitioned
WHERE DATE(event_timestamp) = '2026-06-01';

On a 10 TB / 3-year table, that single-day query goes from scanning ~10 TB (~$62) to scanning a few GB (a few cents). Same answer. Same SQL shape. Three orders of magnitude cheaper, because you stopped reading data you were only going to throw away.

Clustering (sorting the data within partitions by a column you filter on a lot, like user_id) prunes even further within each partition. Partition by the thing you filter by date; cluster by the thing you filter by identity.

Fix #3: on Redshift, design how data is laid out

Redshift's levers are different because its cost is about how efficiently the cluster does work — which comes down to how the data is physically arranged across nodes.

CREATE TABLE events (
    user_id     BIGINT,
    event_date  DATE,
    amount      DECIMAL(10,2)
)
DISTKEY(user_id)     -- spreads rows across nodes by user_id, so joins on
                     -- user_id happen locally instead of shuffling data
                     -- across the network (network shuffle = slow = $$)
SORTKEY(event_date); -- keeps rows ordered by date, so a date filter can
                     -- skip whole blocks instead of scanning the table

A SORTKEY on event_date is Redshift's version of partition pruning: filter on a sorted column and the engine skips blocks it knows can't match. A bad DISTKEY is the silent Redshift killer — pick the wrong column and every join shuffles gigabytes across the network between nodes, and your "slow warehouse" is really a data-layout problem wearing a warehouse costume.

And if your raw data lives in S3 and you query it through Redshift Spectrum, the same bytes-scanned logic from BigQuery returns: Spectrum bills per TB scanned, so storing that data as Parquet (columnar, compressed) instead of CSV cuts the scan — and the bill — by a large factor, because Spectrum only reads the columns you ask for.

The decision: read less, before you migrate

Here's the part that saves people the migration. Both warehouses, stripped down, reward the exact same discipline:

→ Read only the columns you need. SELECT * is the most expensive habit in analytics, on either platform.
→ Read only the rows you need. Partition (BigQuery) or sort-key (Redshift) on what you filter by, so the engine skips the rest.
→ Estimate before you run. Dry-run on BigQuery; check the query plan on Redshift. Know the cost before you pay it.
→ Match the pricing model to the workload. Bursty and unpredictable → serverless/on-demand. Steady and heavy → provisioned/committed capacity. Paying on-demand prices for a 24/7 workload, or renting an idle cluster for a once-a-day job, is its own quiet waste.

The bill tracks how much data you move, not which logo is on the console. The gap between a well-architected warehouse and a careless one is routinely 10–20x on the same data and the same questions — and that gap is almost entirely query patterns and table design, not the vendor.

So before you migrate, go pull your single most expensive query. How much data does it actually touch — and how much of that do you actually use? The honest answer to that second question is usually where your "expensive warehouse" turns out to live.

I'm Vinicius Fagundes — principal data engineer, independent, and an MBA lecturer in São Paulo. Cutting cloud-warehouse bills without changing what teams can do with their data is a big part of what I do. That work lives at vf-insights.com.

Your p-value answered a question you didn't ask.

Vinicius Fagundes — Mon, 29 Jun 2026 03:00:00 +0000

You ran the A/B test. It came back p = 0.08. "Not significant." So you killed the feature and moved on. You may have just buried something that works — not because the math lied, but because it answered a different question than the one in your head.

This is the most common statistics mistake I see in data teams, and it's not about being bad at math. It's about a quiet mismatch between the question you're asking and the question the test is answering. Let's fix that, with code, and keep the examples generic — we'll use a plain conversion test, the kind every team runs.

What a p-value actually is (read this twice)

Here's the precise definition, and then the plain-English one.

Precise: a p-value is the probability of observing data at least as extreme as yours, assuming there is no real effect at all.

Plain: it answers "if nothing were going on, how surprised should I be by results this big?" A small p-value means "pretty surprised — random noise doesn't usually produce a gap this large."

Now read what it does not say. It does not tell you the probability that your effect is real. It does not tell you the probability you're wrong. Those are the questions you actually care about — and the p-value answers neither. It answers a question about a hypothetical world where the effect is zero. That's the mismatch, right there.

Let's run one.

import numpy as np
from scipy import stats

# Variant A: 120 conversions out of 2400.  Variant B: 145 out of 2400.
a_conv, a_n = 120, 2400
b_conv, b_n = 145, 2400

p_a, p_b = a_conv / a_n, b_conv / b_n          # 5.0% vs ~6.04%
p_pool   = (a_conv + b_conv) / (a_n + b_n)

# two-proportion z-test, computed by hand so nothing is hidden
se = np.sqrt(p_pool * (1 - p_pool) * (1/a_n + 1/b_n))
z  = (p_b - p_a) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z)))

print(f"A: {p_a:.3%}   B: {p_b:.3%}")
print(f"z = {z:.2f},  p = {p_value:.3f}")

You'll get something like p ≈ 0.15. By the sacred 0.05 threshold, "not significant." Most teams stop here and conclude B doesn't work.

That conclusion is wrong, or at least unsupported. Watch.

"Not significant" means "not enough evidence," not "no effect"

"Not significant" does not mean "B is the same as A." It means "we don't have enough evidence to rule out luck as one possible explanation." Those are wildly different statements, and conflating them is how good ideas get killed.

Here's the proof that it's about sample size, not reality. Keep the conversion rates identical — 5.0% vs 6.04%, the exact same real-world effect — and just collect more data:

for scale in [1, 4, 10]:
    a_n2, b_n2 = a_n * scale, b_n * scale
    a_c2, b_c2 = round(p_a * a_n2), round(p_b * b_n2)
    p_pool = (a_c2 + b_c2) / (a_n2 + b_n2)
    se = np.sqrt(p_pool * (1 - p_pool) * (1/a_n2 + 1/b_n2))
    z  = (p_b - p_a) / se
    p  = 2 * (1 - stats.norm.cdf(abs(z)))
    print(f"sample x{scale:<2}  n={a_n2:>6}  p={p:.4f}")

Same effect every time. But the p-value marches from "not significant" to "highly significant" purely because the sample grew:

sample x1   n=  2400  p=0.15
sample x4   n=  9600  p=0.01
sample x10  n= 24000  p=0.0003

The effect never changed. Only the evidence did. So "not significant" was never a verdict about whether B works — it was a verdict about whether you'd collected enough data to tell. And "not enough data" is a collection problem, a sampling problem, a pipeline problem, long before it's a statistics problem. If your experiment is underpowered, no test will save you — the answer was decided when you chose the sample size.

What the Bayesian crowd does instead

This is where the other camp has a genuine point. Instead of asking "how surprised would I be if the effect were zero," the Bayesian approach asks the question you actually wanted answered all along: "given this data, how probable is it that B beats A, and by how much?"

The cleanest version, for conversion rates, uses a Beta distribution — and it's only a few lines because conversions are just successes-out-of-trials:

rng = np.random.default_rng(0)

# Beta(1,1) is a flat "I know nothing yet" prior.
# Update it with what we observed: conversions and non-conversions.
post_a = rng.beta(1 + a_conv, 1 + a_n - a_conv, 200_000)
post_b = rng.beta(1 + b_conv, 1 + b_n - b_conv, 200_000)

prob_b_better = (post_b > post_a).mean()
lift          = (post_b - post_a) / post_a

print(f"P(B beats A)          = {prob_b_better:.1%}")
print(f"median relative lift  = {np.median(lift):.1%}")
print(f"95% credible interval = "
      f"[{np.percentile(lift, 2.5):.1%}, {np.percentile(lift, 97.5):.1%}]")

This might tell you P(B beats A) ≈ 92%. Read that against where we started. The frequentist test said "not significant, p = 0.15" and you killed the feature. The Bayesian view says "there's about a 92% chance B is better" — which is a completely different business decision, from the same data.

Notice the other gift: a credible interval. A Bayesian 95% credible interval means what everyone wrongly thinks a confidence interval means — "there's a 95% probability the true lift is in this range." It directly describes your uncertainty about the effect size, which is usually the actual thing you're trying to decide on.

You don't have to pick a religion

I'm not here to recruit you to a camp. Frequentist methods are fast, standard, and fine — when you respect what they're telling you. The failure isn't using a p-value. The failure is reading "not significant" as "no effect" and treating a sample-size problem as a verdict from nature.

So, practically:

→ Power your test before you run it. Decide the smallest effect worth caring about and compute the sample size that can actually detect it. Half of all "inconclusive" tests were doomed at the planning stage — underpowered before a single user saw the variant.

→ Report the effect size and its uncertainty, not just a yes/no. "B is up ~1 point, 95% interval roughly [−0.3, +2.3]" tells a decision-maker far more than "not significant," which tells them almost nothing.

→ When the cost of waiting is high, ask the Bayesian question directly. "How likely is B to win, and by how much" is often the decision you're actually making — so compute that, instead of a threshold on a hypothetical null world.

→ Remember the test only sees the data your pipeline fed it. Garbage sampling, a logging bug, a biased assignment — the math will faithfully analyze bad data and hand you a confident wrong answer. The statistics are downstream of the data engineering, always.

The principle: know which question your number is answering before you let it make your decision. A p-value is a fine tool pointed at one specific question — just rarely the one you were actually asking.

Last time a test came back "not significant" — did you conclude the idea was dead, or that you just didn't have enough data yet? Those lead to opposite roadmaps, and most teams pick the first one without noticing they had a choice.

This touches statistical methods that get misused in high-stakes settings, so treat the code here as a teaching starting point, not a turnkey decision system for anything that actually matters — validate it against your own data and context.

I'm Vinicius Fagundes — principal data engineer, independent, and an MBA lecturer in São Paulo. I build the pipelines and experiment plumbing that feed analyses like these. That work lives at vf-insights.com.

Strong vs eventual consistency: you can't have it always-on and always-right. Pick one.

Vinicius Fagundes — Thu, 25 Jun 2026 03:00:00 +0000

You can't have it always-on and always-right. Pick one.

Every distributed system makes you choose, and most teams choose by accident — then act genuinely surprised when the tradeoff shows up at 2 a.m. as a customer screenshot of a wrong number. This post is the choice, stripped of the jargon that usually buries it, with real examples you can run.

The moment the choice gets forced

Here's the setup. Your data lives on more than one machine — a primary and a replica, two nodes, a cluster, whatever. You did that on purpose, for durability and scale. Most of the time those copies agree and nobody thinks about it.

Then the network between them hiccups. Not "goes down forever" — just a few seconds where node A can't reach node B. This is called a network partition, and here's the uncomfortable truth: it is not a question of if, it's when. Networks are not your friend. Packets drop, links saturate, a switch reboots.

In that moment — copies can't talk to each other, and a read request comes in — your system has exactly two options, and it cannot have both:

Refuse to answer until it's sure it has the latest, agreed-upon truth. (It stays correct, but it's briefly unavailable.)
Answer immediately with whatever this node has, and admit the value might be a few seconds stale. (It stays available, but it's briefly inconsistent.)

That's it. That's the whole famous "CAP theorem" in one sentence: when the network partitions (P), you choose between Consistency (C) and Availability (A). You don't get to keep both during the partition. Anyone selling you "always-on AND always-right" is hiding the moment the network forces the choice.

The two answers, in plain language

Strong consistency means every read sees the latest write. The instant a write succeeds, every subsequent read anywhere returns the new value. To guarantee that, the system sometimes has to say "I can't answer right now" — because the only honest alternative during a partition would be to risk handing you a stale value.

Eventual consistency means the system always answers, immediately, from whatever node you reached — and the copies reconcile "eventually," usually in milliseconds to seconds. The price: for that brief window, a read might return yesterday's value. Given a little time and no new writes, all copies converge to the same answer. Hence "eventual."

Neither is "correct." They're bets about which kind of pain hurts less for this specific data.

Make it concrete: the same read, two ways

DynamoDB lets you pick per-read, which makes the tradeoff wonderfully literal:

import boto3
table = boto3.resource("dynamodb").Table("accounts")

# Eventually consistent read (the DEFAULT): fast, cheap, might be slightly stale
table.get_item(Key={"id": "user-123"})

# Strongly consistent read: guaranteed latest, costs more, can fail during a partition
table.get_item(Key={"id": "user-123"}, ConsistentRead=True)

One boolean. ConsistentRead=True says "I need the truth, and I'll accept that this read might be slower or unavailable to get it." The default says "answer me fast, stale-by-a-moment is fine."

On the write side, relational databases expose the same dial through replication. In Postgres:

-- synchronous_commit = on  : the primary waits for the replica to confirm
--                            the write before telling the client "done."
--                            Stronger consistency, higher latency.
-- synchronous_commit = off : the primary confirms immediately and ships
--                            to the replica in the background.
--                            Lower latency, a window where the replica is behind.
SET synchronous_commit = on;

Same fundamental choice, opposite end of the pipe: wait for agreement (consistent, slower), or confirm now and reconcile later (available, faster).

The "read your own write" trap

Here's the bug eventual consistency hands you, and almost everyone hits it once. A user updates their profile, the page reloads, and their change isn't there — because the reload hit a replica that hadn't caught up yet.

update_profile(user_id, {"name": "New Name"})   # writes to the primary
profile = read_profile(user_id)                  # reads from a lagging replica
# profile["name"] is still the OLD name. The user thinks the save failed.

The data wasn't lost. The replica just hadn't received it yet. But your user doesn't know that — they retype it, hit save again, and now you've got a support ticket about a "broken" feature that's working exactly as designed. The fix is to route reads-after-writes to the primary, or use a session-consistency guarantee — but you only know to do that if you understood you chose eventual consistency in the first place.

How to actually pick (per piece of data, not per company)

The mistake is picking one religion for your whole system. You choose per data type, by asking what a stale or refused answer costs:

→ Account balance, inventory count, anything money or correctness-critical → lean strong. A stale balance is a double-spend, an overdraft, a lawsuit. Here you'd rather the system say "one moment" than confidently lie. Refuse and retry beats fast and wrong.

→ Like counts, view counts, feed ordering, "last seen" timestamps → lean eventual. Nobody is harmed if the like counter is off by three for two seconds. Availability wins easily; just answer.

→ Quorum systems give you a middle dial. In systems like Cassandra you tune it per query: require how many nodes must agree on a read (R) and a write (W) out of N copies. If R + W > N, any read is guaranteed to see the latest write — you've bought strong consistency by making reads or writes wait for more nodes. Lower the numbers and you trade that guarantee for speed and availability. The knob is right there in the query.

-- Cassandra: consistency is a per-statement choice
CONSISTENCY QUORUM;   -- wait for a majority — stronger, slower
CONSISTENCY ONE;      -- first node to answer wins — faster, possibly stale

The principle: every "always-on and always-right" architecture is hiding the exact moment it will be forced to break one of those two promises. Your job is to find that moment, decide on purpose which promise you break and for which data — and find out before your users do, not after.

So here's the real question: the system you ship today — which did it actually choose, strong or eventual? And did anyone choose it on purpose, or did it just default its way into a decision you'll meet at 2 a.m.?

I'm Vinicius Fagundes — principal data engineer, independent, and an MBA lecturer in São Paulo. I help teams make these tradeoffs deliberately instead of accidentally. That work lives at vf-insights.com.

Brute force isn't grit. It's the bill for a plan you skipped.

Vinicius Fagundes — Wed, 24 Jun 2026 03:00:00 +0000

Brute force isn't grit. It's the bill for a plan you skipped.

We've turned grinding into a personality trait. The all-nighter, the heroic debugging marathon, the Slack message at 1 a.m. that says "still on it, almost there." It photographs like dedication. Most of the time, it's a planning failure wearing a cape — and someone is paying the bill for it, usually the company, often the engineer's weekend.

I want to make this concrete, because "plan more, communicate better" is the kind of advice that's true and useless at the same time. So here's a real shape of it.

Three days against a locked door

An engineer I worked with lost three days to a pipeline that kept dying mid-run. Day one, he threw a bigger instance at it. Day two, he added retries, then a queue in front of it to smooth the load. Day three, he was tuning JVM memory flags at 11 p.m., exhausted, and — this is the part that matters — completely convinced he was close. Every hour of effort made him more certain the next hour would crack it.

The data was malformed upstream. One bad export from a source system, three days earlier. A single thirty-second question in Monday's standup — "hey, did anything change with the export job over the weekend?" — would have ended it before lunch on day one.

He didn't have an effort problem. He had the opposite. A mountain of effort, aimed with total commitment at a door that was never going to open by force. Because the door wasn't stuck. It was locked. And no amount of pushing opens a locked door — you have to stop, notice it's locked, and go find the key.

That's the whole job, the part that never makes it onto a résumé.

Why we do it anyway

Brute force feels productive in a way that thinking doesn't. When you're grinding, you can see the effort — commits, log lines, instances spun up. It looks like work and it feels like virtue. Stopping to think looks like nothing. Walking to someone's desk to ask a question feels like admitting you couldn't figure it out yourself.

So we optimize for the thing that looks like work instead of the thing that is the work. And the engineering culture cheers it on — we tell war stories about the heroic 3 a.m. fix and stay quiet about the thirty-second question that would have made the fix unnecessary.

Here's the reframe I'd burn into every junior engineer: the question you're avoiding because you'd rather just solve it yourself is usually the fastest path to the answer. Not the weak path. The fast one.

What "planning" actually means here

When I say planning, I don't mean a Gantt chart. I mean three small, unglamorous habits that prevent almost all of the heroic grinding:

→ Measure before you move. Before you optimize anything, find out where the time or the failure actually is. On the pipeline above, ten minutes of looking at which records were failing — instead of three days assuming it was a compute problem — would have pointed straight at the bad export. Profile first. Force later, if at all.

# The cheapest planning tool there is: look before you push.
# Instead of "the job is slow, make the cluster bigger," ask "slow WHERE?"
import time
for name, fn in [("extract", extract), ("transform", transform), ("load", load)]:
    t = time.perf_counter()
    fn()
    print(f"{name:<12} {time.perf_counter() - t:6.2f}s")

Three lines. They turn "the pipeline is slow" — an opinion — into "the transform is 90% of the runtime" — a fact you can act on. Most heroic optimization is solving a problem nobody confirmed existed.

→ State the assumption out loud before you build on it. "I'm assuming the upstream export hasn't changed." The second you say that to another human, half the time they go "...actually, it did." Assumptions die fast in conversation and live forever in your head.

→ Define done before you start. "This is working when the job completes under an hour with zero malformed records." Without that line, you don't know if you're three days from done or three minutes — so you grind, because grinding is what you do when you can't see the finish.

Communication isn't the soft part

This is the thing the bootcamps skip. We file communication under "soft skills," as if it sits next to the real engineering. On a hard problem, it is the engineering.

The thirty-second standup question wasn't a nicety. It was the single highest-leverage technical move available, and it was free. The engineer who asks "what am I missing here?" on day one looks slightly less impressive in the moment than the one who emerges triumphant on day three — and is worth ten times as much, because they shipped the same outcome and kept three days.

I teach this to MBA students who are mostly not engineers, and it lands the same way for all of them: the senior move was never pushing harder. Seniority isn't measured in how much force you can apply. It's measured in how quickly you can tell whether force is even the right tool — and how willing you are to stop and ask when it isn't.

Nobody remembers the all-nighter you didn't need to pull. They remember the thing shipped on time, by someone who looked at the door, saw the lock, and went and found the key.

When's the last time you ground for days on something a single question would have solved in a minute? Be honest about it — that memory is the most useful teacher you've got.

I'm Vinicius Fagundes — principal data engineer, independent, and an MBA lecturer in São Paulo. Seventeen years of fixing pipelines taught me that the slowest ones are usually a thinking problem before they're a compute problem. That work lives at vf-insights.com.

Aws Glue or Airflow? You're probably paying for both to do one job

Vinicius Fagundes — Tue, 23 Jun 2026 03:00:00 +0000

Glue or Airflow? You're probably paying for both to do one job.

It's the wrong question, and the wrong question quietly doubles your bill. Every couple of months someone asks me whether they should move their pipeline from Airflow to Glue, or the reverse, and the answer is almost always "neither, because you've misunderstood what each one is for." So let's fix that first, because once you see it, the cost mistakes become obvious.

Two different jobs that get confused for one

Picture a restaurant kitchen. There's a head chef calling out the order of dishes — appetizers first, mains when table four is ready, dessert last. And there are the line cooks actually chopping, searing, and plating. The chef coordinates. The cooks do the work. They are not interchangeable, and you wouldn't ask "should I hire a chef or a line cook?" You need the right amount of each.

That's Airflow and Glue.

Airflow is the head chef. It's an orchestrator. It decides what runs, in what order, and when — and then it waits. It does not move your data. It triggers a task, watches whether it succeeded, and triggers the next one. An Airflow DAG ("directed acyclic graph" — just a fancy term for "a list of steps with dependencies") looks like this:

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def extract():  ...
def transform(): ...
def load():     ...

with DAG("daily_sales", start_date=datetime(2026, 1, 1),
         schedule="0 2 * * *", catchup=False) as dag:

    e = PythonOperator(task_id="extract",   python_callable=extract)
    t = PythonOperator(task_id="transform", python_callable=transform)
    l = PythonOperator(task_id="load",      python_callable=load)

    e >> t >> l     # this line is the whole point: order and dependency

Read that last line out loud: extract, then transform, then load. That >> is Airflow's entire reason to exist — managing order and dependencies and retries and schedules. Notice it doesn't say anything about how the data gets transformed. That's not its job.

Glue is the line cook. It's managed Spark — actual compute that lifts and reshapes data. A Glue job does the chopping:

import sys
from awsglue.context import GlueContext
from pyspark.context import SparkContext

glueContext = GlueContext(SparkContext.getOrCreate())

# read raw data, transform it with Spark, write it back — this MOVES data
df = glueContext.create_dynamic_frame.from_catalog(
        database="raw", table_name="sales").toDF()

clean = (df.filter(df.amount > 0)
           .dropDuplicates(["order_id"])
           .groupBy("region").sum("amount"))

clean.write.mode("overwrite").parquet("s3://warehouse/sales_by_region/")

This code reads gigabytes, filters, dedupes, aggregates, and writes Parquet. That's compute. That's the work.

And here's the thing that resolves the whole "versus": you run Glue jobs from inside Airflow. They're not competitors on the same shelf. The chef tells the cook when to start.

from airflow.providers.amazon.aws.operators.glue import GlueJobOperator

run_glue = GlueJobOperator(task_id="transform_sales", job_name="clean_sales_job")

So where does the bill quietly double?

Now that the roles are clear, the two classic ways teams overpay are easy to see.

Mistake 1: a full orchestrator to babysit three cron jobs. Someone stands up Airflow — which means a scheduler, a webserver, a metadata database, all running 24/7 — to coordinate three independent daily jobs that have no real dependencies between them. Airflow is superb when you have a tangled DAG: forty tasks, branching, backfills, retries, SLAs. It is wild overkill for "run these three scripts every morning." That's a cron line:

# three independent jobs, no dependencies — this is the whole orchestrator you need
0 2 * * *  python extract_sales.py
0 3 * * *  python extract_users.py
0 4 * * *  python build_report.py

If your "DAG" is a straight line with no branching and the failure handling is "email me," you're paying the operational cost of a tool built for a problem you don't have.

Mistake 2: a Spark cluster to transform five gigabytes. This is the more expensive one. Someone spins up Glue — a distributed Spark cluster, billed per DPU-hour — to process a few gigabytes that pandas on a single modest box would crush in under a minute. Spark earns its cost when data genuinely doesn't fit on one machine and needs to be processed in parallel across a cluster. Below that threshold, you're paying cluster prices and cluster cold-start latency to do laptop work.

# 5 GB that fits in memory? You don't need a cluster.
import pandas as pd

df = pd.read_parquet("s3://raw/sales/")
clean = (df[df.amount > 0]
           .drop_duplicates("order_id")
           .groupby("region")["amount"].sum())
clean.to_parquet("s3://warehouse/sales_by_region/")

Same result as the Glue job above, no cluster, no DPU-hours, no cold start.

The actual decision

Stop asking "Glue or Airflow." Ask two separate questions, because they're answering two separate needs:

→ Do I have real orchestration complexity? Branching, dependencies, backfills, retries across many tasks, schedules that interact. Yes → you want an orchestrator (Airflow, or a lighter one like Prefect, or your cloud's native scheduler). No, it's a few independent jobs → cron or a managed schedule is plenty.

→ How much data is actually moving in the transform? Tens of GB or more, or it genuinely needs parallelism → managed Spark like Glue earns its keep. A few GB that fits in memory → pandas, DuckDB, or plain SQL in your warehouse will be faster and cheaper, with none of the cluster overhead.

These are independent dials. A real pipeline might be: a managed schedule (no Airflow) triggering a small Python job (no Glue), because it's three steps and four gigabytes. Another might be: full Airflow orchestrating a dozen Glue jobs, because it's a forty-task DAG over terabytes. Both are correct. The expensive mistake is reaching for the heavyweight version of either dial when your workload only turned one of them up.

Seventeen years in, the single most common thing I see on a cloud bill isn't a slow pipeline. It's a stack architected for a scale the company hasn't reached yet — Spark clusters idling, orchestrators babysitting cron jobs, all of it provisioned for the data volume someone hopes to have in two years.

What's running in your stack right now that's sized for a problem you don't actually have yet? Go look at your least-utilized component first — that's usually where the answer is hiding.

I'm Vinicius Fagundes — principal data engineer, independent, and an MBA lecturer in São Paulo. Right-sizing over-built data stacks is a big part of what I do. If this sounds like yours, that work lives at vf-insights.com.

Your model isn't underfitting. Your features are lazy.

Vinicius Fagundes — Tue, 23 Jun 2026 00:21:32 +0000

Here's the scene I've watched play out on a dozen teams. Accuracy plateaus. Someone rips out the logistic regression, drops in XGBoost, and waits for the jump. It doesn't come — or it comes with two points you can't explain to anyone. So the week disappears into hyperparameter tuning, and you end up with a slower, heavier, less interpretable model that's barely better than where you started.

The model was almost never the bottleneck. The features were.

This post is the long, practical version of that argument. We'll define the two camps in plain language, run real code, look at when boosting genuinely wins, and then walk through the failure mode nobody warns you about — the one where the fancy model is "winning" because it's quietly cheating.

A note before we start: keep your examples generic. We'll predict a numeric target — think demand, a quantity, a score on a tabular dataset. The principles are the same everywhere, and you should validate them on your own data.

The two camps, in plain terms

Linear / logistic regression fits a straight-line relationship: each feature gets a weight (a coefficient), and the prediction is a weighted sum. Logistic regression is the same idea bent for classification — it outputs a probability.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# the whole model, readable in one line per feature:
for name, weight in zip(feature_names, model.coef_[0]):
    print(f"{name:<20} {weight:+.3f}")

That loop is the entire model. A positive weight means "more of this pushes the prediction up," and you can hand that table to a stakeholder and defend every number. The cost: it assumes the relationship is roughly linear and that features act independently. Real data often isn't that polite.

Gradient boosting (XGBoost, LightGBM, sklearn's GradientBoostingClassifier) builds hundreds of small decision trees, each one correcting the mistakes of the last. It captures nonlinearity and feature interactions for free, and on messy tabular data it usually wins on raw accuracy.

from xgboost import XGBClassifier

model = XGBClassifier(n_estimators=300, max_depth=4, learning_rate=0.05)
model.fit(X_train, y_train)

The cost is the mirror image: it's a black box. You can't read it the way you read coefficients, it will happily overfit if you let it, and — this is the part that bites — it will exploit any leakage in your data with terrifying enthusiasm.

When boosting genuinely wins

Let me be fair to boosting, because it deserves it. Build a dataset with a real interaction effect — where the target depends on two features multiplied together, not added — and watch what happens.

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

rng = np.random.default_rng(0)
n = 5000
x1 = rng.normal(size=n)
x2 = rng.normal(size=n)

# the signal lives in the INTERACTION: x1 * x2, not x1 + x2
logit = 3 * (x1 * x2)
y = (rng.uniform(size=n) < 1 / (1 + np.exp(-logit))).astype(int)
X = np.column_stack([x1, x2])

lr  = LogisticRegression()
xgb = XGBClassifier(n_estimators=200, max_depth=3, learning_rate=0.1)

print("logreg:", cross_val_score(lr,  X, y, cv=5, scoring="roc_auc").mean())
print("xgb:   ", cross_val_score(xgb, X, y, cv=5, scoring="roc_auc").mean())

Logistic regression will score around chance here — close to 0.5 AUC — because there's no straight-line relationship between either feature alone and the target. Boosting will score much higher, because trees can split on x1 and then split on x2 inside that branch, which is exactly an interaction.

That's the honest case for boosting: when the signal is nonlinear or lives in interactions, and you don't know that ahead of time. Trees find structure you didn't hand-engineer.

But notice the catch in that last sentence — "you didn't hand-engineer." What if you had?

The plot twist: features close the gap

Give the linear model the interaction term explicitly, and it catches right up:

# hand the interaction to the linear model as a feature
X_better = np.column_stack([x1, x2, x1 * x2])

print("logreg + feature:", cross_val_score(lr, X_better, y, cv=5,
                                            scoring="roc_auc").mean())

One engineered column — x1 * x2 — and the "weak" model is now competitive with boosting, while staying fully interpretable. You can look at the coefficient on that interaction term and know what the model learned.

This is the whole thesis in one experiment. Boosting wasn't smarter. It was compensating for a feature you forgot to create. The accuracy gap between a simple model and a complex one is very often just the complex model rediscovering, internally and opaquely, a feature you could have written by hand.

Better features beat a better algorithm, and they cost less to run and far less to trust.

The failure mode nobody warns you about: leakage

Here's where boosting's enthusiasm turns dangerous. Data leakage is when information sneaks into your features that wouldn't actually be available at prediction time — usually because it's downstream of the very thing you're predicting.

A concrete example. Say you're predicting whether an order will be cancelled. Someone adds a feature refund_amount. It's wildly predictive — accuracy jumps ten points. Ship it!

Except refunds only happen after a cancellation. At the moment you actually need to predict, refund_amount is always zero. You've trained a model to predict cancellations using a column that only exists because of cancellations. In production it's useless, and you won't find out until the numbers quietly fall apart.

# This "feature" is the answer wearing a disguise.
# It is only populated after the event you're trying to predict.
df["refund_amount"]   # leaks the target

Why does this matter more for boosting? Because a linear model spreads its attention across features and a single leaky column produces one suspiciously huge coefficient you might actually notice. Boosting will find the leak, latch onto it, and route most of its trees through it — handing you a gorgeous validation score that's pure fiction. The more powerful the model, the more efficiently it exploits a mistake in your data.

There's a subtler version too — preprocessing leakage — where you compute something over the whole dataset before splitting:

# WRONG: scaler sees the test set's statistics before you split
X_scaled = StandardScaler().fit_transform(X)
X_train, X_test = train_test_split(X_scaled)

# RIGHT: fit preprocessing on train only, inside a pipeline
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(StandardScaler(), LogisticRegression())
scores = cross_val_score(pipe, X, y, cv=5)   # scaler refits on each fold's train

A Pipeline isn't a style preference. It's the thing that keeps test information from bleeding into training, and it's the difference between a validation score you can believe and one you can't.

So how do I actually choose?

Here's the decision I'd hand a junior engineer, in order:

→ Start with the simple model. Logistic or linear regression, clean features, a real cross-validation setup. This is your baseline and your sanity check — if it scores absurdly well, you probably have leakage, and the simple model made it easy to spot.

→ Spend your effort on features, not models. Interactions, ratios, time-since-event, sensible encodings. Most of the accuracy you're chasing lives here. Every feature you engineer by hand is one the black box doesn't have to reconstruct opaquely.

→ Reach for boosting when the simple model plateaus and you've ruled out leakage and you've exhausted obvious features. Now you're using boosting for what it's actually good at — nonlinearity you genuinely can't hand-engineer — instead of as a band-aid over lazy features.

→ When you do use boosting, demand interpretability back. Feature importances, SHAP values, partial dependence. If you can't explain why it predicts what it predicts, you can't catch it when it's wrong.

The principle underneath all of it: model choice is a data decision, not a leaderboard contest. A clean regression on good features will beat boosting on dirty ones almost every time, and it'll be cheaper to run and easier to defend. XGBoost won't save you from a pipeline that feeds it lies. Nothing will.

When your accuracy last stalled — did you reach for a new model, or did you go back and interrogate the features first? I'm curious which instinct fired, because it tells you a lot about where you are in this.

I'm Vinicius Fagundes — principal data engineer, independent, and an MBA lecturer in São Paulo. I build and fix the data pipelines that feed models like these. If this is your world, this is the work I do at vf-insights.com.