DEV Community: Pratik Dhanave

Architecture Over Alerts: How We Cut BigQuery Costs by 57%($12M) for a Fortune 500

Pratik Dhanave — Mon, 08 Jun 2026 06:02:02 +0000

I helped redesign a large BigQuery-based enterprise data warehouse and cut spend by 57%. The biggest savings didn't come from dashboards or one-off query tuning. They came from architecture decisions — partitioning, clustering, incremental MERGE patterns, and a better capacity model.

Here's how I approach cost as an architectural problem in large systems.

The Problem

I joined as the Solution Architect for a Fortune 500 financial company running BigQuery at scale — hundreds of analysts, dozens of automated pipelines, 4+ TB of daily ingestion across finance, risk, and regulatory reporting workloads.

The platform had grown organically. Datasets were added without architectural standards. Queries ran without cost awareness. The pricing model was never revisited as workloads matured. Leadership was asking "how much are we spending?" We needed to answer a different question: "why does our architecture force us to spend this much?"

In BigQuery, cost scales with data scanned — so table design and query patterns directly determine spend. One unoptimized query scanning 10 TB hourly — without caching or partition pruning — burns through a massive annual budget. We had dozens of patterns like this across hundreds of queries.

The cost drivers fell into four areas: compute (unoptimized queries and MERGE operations), storage (no partitioning, clustering, or lifecycle policies), governance (no cost attribution or alerting), and capacity (on-demand pricing for predictable workloads).

What We Found

The first four weeks were pure analysis. No changes. Using BigQuery's INFORMATION_SCHEMA and Jobs logs — no third-party tools — we profiled every query by cost, frequency, and scan volume, mapped every dataset by partitioning state and access patterns, traced pipelines end-to-end to understand what would break, and modeled 90 days of workload behavior for capacity planning.

Across hundreds of pipelines, a small number of recurring architectural patterns dominated total spend.

The single largest: what we internally called the MERGE anti-pattern — MERGE operations doing full-table read + write on every execution, even when only a fraction of rows had changed. This one pattern accounted for more waste than any other issue we found.

We explicitly avoided query-level tuning as a primary strategy. At this scale, tuning individual queries reduces cost — but redesigning the architecture that generates those queries changes the cost curve. That decision shaped everything that followed.

Five Architecture Decisions

Every decision was constrained by zero-downtime requirements — hundreds of downstream pipelines, regulatory reporting jobs, and cross-team dependencies all depended on these tables. Nothing could break.

Zero-downtime wasn't a constraint — it was the primary design input.

Partitioning: Ingestion-date partitioning on high-traffic tables, over business-key partitioning. We traded optimal query performance for migration safety — accepting slightly higher scan costs on edge-case query patterns to avoid breaking production pipelines.

Clustering: High-cardinality filter columns, over composite clustering. 80% of the benefit for 20% of the complexity. We accepted the remaining gap for operational simplicity.

The MERGE anti-pattern — the biggest win. The underlying pattern: treating BigQuery like a traditional RDBMS. Every MERGE operation was reading and rewriting the entire target table on every run. We shifted from full-table DML to idempotent, partition-scoped operations — joining only against affected shards using staging tables with predicate filters. Write amplification dropped ~90%. One architectural pattern change eliminated the platform's largest cost driver.

Capacity model: Hybrid — 70/30 split between committed slots for predictable workloads and on-demand burst for peaks. We traded slightly higher peak cost for 40% lower total cost and predictable billing. Finance could finally forecast cloud spend annually — that alone changed the stakeholder conversation from reactive to strategic.

Monitoring: Real-time cost attribution with automated multi-tier alerting, replacing the monthly bill-review cycle. It required instrumentation effort upfront, but it shifted governance from reactive to proactive — and gave every team visibility into their own cost footprint for the first time.

Results

These were structural changes, not temporary fixes. The savings persist because the architecture prevents regression.

57% reduction in data warehouse spend
Significant realised annual savings at multi-petabyte scale with thousands of daily queries
3-5x performance improvement on the highest-cost query patterns
Top pipelines reduced from ~10 TB scans per run to under 500 GB — a 95% reduction through partition pruning and clustering

The engagement evolved from a tactical intervention into a strategic FinOps roadmap. The client established a permanent Architecture Review Board to vet all future high-scale pipelines.

The approach was repeatable: diagnostic → cost driver isolation → architecture intervention → governance. We've since applied the same model across multiple enterprise environments.

What I'd Do Differently: The Trust Gap

I underestimated how long it takes to get stakeholder buy-in without real-time proof. For nine weeks, leadership was trusting my word that the architecture changes were working — I couldn't show them live cost-attribution data until the monitoring layer went live in Phase 3. That was a risk I shouldn't have taken.

If I reran this, I'd deploy cost instrumentation in Week 1, alongside the first architecture changes. Not because dashboards solve cost problems — they don't. But because architects need to show their work in real-time, or the next budget review becomes a trust exercise instead of a data conversation. I'd also implement Cost-as-Code — CI/CD gates that reject queries exceeding a scan threshold — so cost governance becomes automated rather than advisory.

The Takeaway

At scale, cloud cost is an architectural outcome — not a reporting problem. Monitoring explains spend. Architecture determines it.

What's the most expensive architecture mistake you've seen in a cloud data warehouse?

Pratik Dhanave — Architect, 7+ years. Enterprise FinOps & Cloud Architecture. Google Cloud Next Speaker. GSoC Mentor (2019–present).

Building Production Multi-Agent Systems: Real-World Lessons from Genie

Pratik Dhanave — Thu, 04 Jun 2026 03:18:27 +0000

Building Production Multi-Agent Systems: Real-World Lessons from Genie

I shipped a 15-agent financial assistant on Microsoft's Multi-Agent Reference Architecture (MARA). It processes 40M requests per month, orchestrates complex workflows, and costs 40% less than alternatives.

This isn't a "how to build agents" tutorial. This is what I learned when agents broke in production.

The Problem With Single-Agent Systems

You start with one agent: "Give me a financial recommendation."

It works. Then users ask for:

Account analysis + fraud detection + investment advice in the same request
Real-time responses (P99 latency < 2 seconds)
Cost control (don't spend more on LLM calls than we make in profit)

A single agent trying to do all three becomes a bottleneck. Enter: multi-agent orchestration.

The Architecture That Worked

Customer Request
    ↓
Supervisor Agent ← (routes intelligently)
    ├→ Analyst Agent ← (account data)
    ├→ Risk Agent ← (fraud detection)
    ├→ Recommender Agent ← (investment logic)
    └→ Compliance Agent ← (regulatory checks)
    ↓
Aggregator (merge results)
    ↓
Response to customer (< 2 seconds)

Key insight: Each agent is independent. If Risk Agent is slow, it doesn't block Analyst Agent.

Pattern 1: Hierarchical Routing

The Supervisor agent doesn't run LLM logic. It's dumb and fast:

class SupervisorAgent:
    def __init__(self, specialists: list[Agent], budget: float):
        self.specialists = specialists
        self.budget = budget
        self.spent = 0.0

    async def route(self, request: dict) -> dict:
        # Determine which agents to call based on request type
        required_agents = self.classify_request(request)

        # Call them in parallel
        tasks = [
            agent.process(request) 
            for agent in required_agents
        ]

        results = await asyncio.gather(*tasks)

        # Aggregate results
        return self.aggregate(results)

Why this works:

Supervisor is lightweight (no LLM call)
Specialists run in parallel (fast)
Each specialist is reusable (easy to test, modify, scale)

Pattern 2: Cost-Aware Routing

This is where the 40% savings came from. Route by value, not by capability.

class CostGuardian:
    """Enforce budget limits on every agent call."""

    async def execute_with_budget(self, agent: Agent, request: dict):
        cost = agent.estimated_cost(request)

        # High-value decision (customer asking about $50K portfolio)?
        # Use expensive model (GPT-4)
        if request.value > 10000:
            return await agent.execute_with_model("gpt-4")

        # Low-value decision (quick lookup)?
        # Use cheap model (Ollama)
        if request.value < 100:
            return await agent.execute_with_model("ollama:3b")

        # Medium value? Tradeoff model
        return await agent.execute_with_model("gpt-3.5-turbo")

Result: 30-40% cost reduction without sacrificing quality.

Pattern 3: Structured Message Passing

Don't let agents communicate via free-form strings. Use schemas:

from dataclasses import dataclass

@dataclass
class AnalystResult:
    account_id: str
    balance: float
    monthly_income: float
    recent_transactions: list[dict]
    confidence: float  # 0-1, how confident in this data?

# Supervisor validates before passing to next agent
def validate_and_forward(result: AnalystResult) -> dict:
    if result.confidence < 0.8:
        return {"error": "low confidence, retry"}

    return result.dict()

Why: Catch errors early. Agents can validate input before processing.

Pattern 4: Timeout & Fallback Chains

In production, agents timeout. Have a plan:

async def call_with_fallback(primary_agent, fallback_agents, request, timeout=2.0):
    """Try primary. If timeout, try fallback 1. If timeout, try fallback 2."""

    for agent in [primary_agent] + fallback_agents:
        try:
            result = await asyncio.wait_for(
                agent.process(request),
                timeout=timeout
            )
            return result
        except asyncio.TimeoutError:
            continue  # Try next agent

    # All failed
    return {"error": "all agents timed out"}

Real example: Risk Agent is slow (database query backlog). Fallback to Risk Agent cached result. User gets a response instead of timeout.

Observability: The Game Changer

You need to see every agent call:

import opentelemetry.trace as trace

tracer = trace.get_tracer(__name__)

async def agent_execute(agent_id: str, request: dict):
    with tracer.start_as_current_span(f"agent.{agent_id}") as span:
        span.set_attribute("request.type", request["type"])
        span.set_attribute("request.value", request["value"])

        result = await agent.process(request)

        span.set_attribute("latency_ms", time.time() - start)
        span.set_attribute("cost", agent.last_cost)
        span.set_attribute("success", result.get("error") is None)

        return result

In production, I can see:

Which agent is slow (latency heatmap)
Which agent is expensive (cost per call)
Which agent fails most (error rate by agent)

This data drove our optimization decisions.

The Real Numbers

Before multi-agent:

P99 latency: 5 seconds (single agent doing everything)
Cost per request: $0.032
Error rate: 2.1%

After multi-agent + patterns above:

P99 latency: 1.2 seconds (40% reduction)
Cost per request: $0.018 (40% savings)
Error rate: 0.3% (better due to fallbacks)

Scale: 40M requests/month × $0.014 savings = $560K/month saved.

The Trap: Premature Multi-Agent

Don't build this until you need it. Signals you need multi-agent:

Single agent request handling time > 1 second
One request needs >3 different LLM calls
Cost per request > $0.01
Error rate > 0.5%

Before that? Single agent is simpler and faster to build.

What I'd Do Differently

Start with one agent, add observability immediately — then you'll see where multi-agent helps
Cost budgets from day 1 — not an afterthought
Test with mocks first — multi-agent systems are hard to test with real LLM calls
Document processor agreements — especially important if agents share data

The Hardest Part

Not the architecture. The hardest part is testing. You can't mock everything. You need:

Unit tests (mock all LLM calls)
Integration tests (real agents, fake data)
Load tests (does it handle 30K RPS?)
Cost audits (are we actually saving money?)

Next Steps

Multi-agent systems aren't magic. They're cost-aware, observable, fault-tolerant orchestration.

If you're building agents:

Start with one
Add observability (OpenTelemetry)
Scale to multiple agents only when you hit limits
Enforce budgets from the start

Questions? I detail implementation patterns for compliance-driven multi-agent systems (GDPR, PSD2) on my portfolio.

Tags: #architecture #python #ai #production #observability