Mahendran

Posted on Dec 22, 2025 • Originally published at Medium

Beyond Tagging: A Blueprint for Real-Time Cost Attribution in Data Platforms

#dataengineering #finops #bigdata #costoptimization

The Slack message came at 2 AM: "Did someone just blow our entire monthly Spark budget over the weekend?"

By morning, you're staring at an AWS bill that looks like a phone number, frantically digging through metadata logs, and promising your CFO you'll "improve the tagging strategy." But here's the uncomfortable truth: tagging is a 2015 solution to a 2025 problem.

In a world of shared compute, serverless warehouses, and complex DAGs, a tag only tells you who owns the resource—it doesn't tell you who's wasting it. You can tag your Snowflake warehouse as Owner: Data_Platform, but that won't help you identify which team ran an unoptimized cross-join at 3 AM that cost $2,000.

To truly control costs, we need to stop treating FinOps as a labeling exercise and start approaching it as an infrastructure design challenge. This is the architectural blueprint for building a query-level cost attribution engine that actually works.

The Black Box Problem with Shared Compute

The fundamental issue with modern data platforms is the "shared warehouse" model.

Picture this: Four teams—Marketing, Product, Finance, and Data Science—all query the same Snowflake warehouse or hit the same Spark cluster. At month's end, you get one bill. Sure, you can tag that warehouse as team:data-platform, but that doesn't help you figure out which department is responsible for what portion of the spend.

Here's why tags fail at scale:

Tags are infrastructure-level metadata. Your Spark cluster is tagged "data-eng-team" and costs $500/hr to run. Great. But that cluster is shared by five different teams running hundreds of queries. When the bill comes, everyone points at everyone else. The tag told you the cluster cost $500/hour. It didn't tell you that the marketing analytics team's daily report consumed 60% of those resources.

Tags don't survive query boundaries. A data scientist submits a Trino query that joins three massive tables. Trino spins up workers, reads from S3, shuffles data across nodes. The query finishes in 20 minutes. Which tag captured that cost? The one on the Trino coordinator? The S3 bucket? The VPC? Good luck reconciling that.

Tags require discipline that doesn't scale. You can mandate that every Spark job must include a --conf spark.yarn.tags=team:analytics parameter. You can write wiki pages and send Slack reminders. But the moment someone copy-pastes code from Stack Overflow at 5 PM on Friday, your tagging discipline goes out the window.

The fundamental problem is that tags are metadata about infrastructure, not about usage. Standard cloud billing files (like the AWS Cost and Usage Report) give you the cost of the instance, but they don't know what happened inside the instance.

This is why I'm convinced the future of data platform FinOps isn't about better tagging—it's about building query-level cost attribution systems that treat every query as a billable unit of work.

The Blueprint: A 3-Tier Attribution Architecture

To build a system that actually answers "Who spent what?", you need to bridge the gap between Cloud Billing Data (dollars) and Query Metadata (usage). Here's the architectural framework that makes this possible.

Tier 1: The Contextual Metadata Layer (The Query "Passport")

Before you can attribute costs, every query needs to carry a passport—a set of metadata that identifies who ran it and why.

The challenge? Raw query logs often lack business context. Someone ran SELECT * FROM users WHERE signup_date > '2024-01-01' via a Jupyter notebook. Who should be charged?

The Strategy: Tag the Session, Not Just the Infrastructure

Modern orchestrators like Airflow and dbt allow you to inject context directly into query sessions. Instead of relying on infrastructure tags, you force every query to declare its identity upfront.

For Snowflake:

ALTER SESSION SET QUERY_TAG = '{"project": "marketing_churn", "env": "prod", "team": "analytics"}';

For Spark:

spark.conf.set("spark.sql.query.tags", "project=customer_segmentation,team=data_science")

For Trino:

SET SESSION query_id = 'marketing-dashboard-prod';

The key is making this non-negotiable. If a query doesn't carry a valid passport, it either gets rejected or lands in an "unattributed" bucket that you review manually. This is where tooling like admission controllers or query interceptors comes in—enforce the passport at the platform level, not through documentation.

Metadata Enrichment

Even with session tags, you'll need to enrich query logs with additional context:

User → Team mapping (from your HR system or directory service)
Project tags (from orchestration platforms like Airflow)
Application context (is this from a dashboard? An ETL job? Ad-hoc analysis?)

Suddenly, a raw query becomes a rich record: "Analytics team, marketing project, Airflow DAG, production environment, $12.50 cost."

Tier 2: The Ingestion & Normalization Pipeline

You need an ETL process that treats cost data as a first-class citizen. This tier is about collecting and standardizing data from two parallel sources.

Ingest Query Logs:
Every query engine maintains execution history—Spark event logs, Trino's system.runtime.queries, Hive's YARN logs, Snowflake's ACCOUNT_USAGE. Extract these into a structured format (Parquet files in your data lake).

These logs capture who ran the query, runtime, resource consumption (CPU, memory, I/O), and success/failure status.

Ingest Infrastructure Costs:
Pull billing data from AWS Cost Explorer, Snowflake ACCOUNT_USAGE, or amortized on-prem hardware costs.

The Normalization Challenge

Different platforms measure consumption differently:

Snowflake uses credits
Databricks uses DBUs (Database Units)
Spark measures executor-hours
Trino tracks CPU time and memory-seconds

Your normalization layer must convert everything into a standardized metric: cost per compute-second. This is where you build conversion tables:

1 Snowflake Credit = $X (varies by warehouse size)
1 Databricks DBU = $Y (varies by region and instance type)
Spark executor-hour = (# of executors) × (instance cost/hour)

Tier 3: The Attribution Logic (The "Join")

This is where the magic happens. You're joining query metadata with infrastructure costs to produce per-query attribution.

The Core Principle: Proportional Allocation

Infrastructure costs are time-based. A Spark cluster costs $X per hour whether it's idle or maxed out. Your job is to figure out which queries consumed those hours.

The formula:

Cost_query = (Infrastructure Cost_period) × (Query Runtime / Total Active Time) × (Resource Weight)

Example: If your Spark cluster cost $500 between 2-3 PM, and Team A's queries ran for 48 minutes (80% of the hour) while consuming 90% of executor memory, they pay: $500 × 0.80 × 1.125 = $450

The Pipeline:
Time-slice costs → Join query logs with cost buckets → Calculate proportions → Apply formula → Output attributed costs.

The pipeline doesn't need to be real-time on day one. A daily batch job is often good enough to start.

Engine-Specific Implementation Details

Trino: Attributing Distributed Query Costs

Trino is beautiful from a user's perspective—submit SQL, get results. But from a cost attribution perspective, it's challenging. Queries dynamically spawn tasks across worker nodes. A single query might touch 50 workers for 10 seconds each.

Option 1: Use Trino's Resource Metrics

Trino exposes cpu_time, peak_memory_bytes, and cumulative_memory for every query in its system.runtime.queries table. Use these as cost proxies.

If your worker nodes cost $0.50 per hour and a query used 60 CPU-seconds:

Cost = (60 seconds / 3600 seconds/hour) × $0.50 = $0.0083

It's not perfect—it doesn't account for I/O or network—but it's directionally correct.

Option 2: Time-Slice Worker Costs

If your Trino cluster is fixed-size (say, 20 workers running 24/7), you know the total hourly cost. Divide that by concurrent queries, weighted by their CPU time. This gets messy with dynamic scaling, but for stable clusters, it's surprisingly accurate.

Spark: Event Logs Are Your Friend

Spark gives you too much information. Every job writes an event log—a JSON file packed with details about every stage, task, executor, and shuffle.

The Challenge: These logs are huge. A single large Spark job can produce multi-gigabyte event logs. Parsing them all in real-time isn't realistic.

The Practical Approach:

Sample strategically. Parse event logs for jobs that ran longer than X minutes or consumed more than Y resources. For tiny jobs, use a simplified heuristic (runtime × cluster size).
Pre-aggregate at the application level. Calculate the cost per application first, then drill down into stages and tasks only when needed.
Cache the results. Once you've parsed an event log and calculated a job's cost, cache it. Event logs don't change.

With Spark, the win is being able to tell someone: "Your daily ETL job costs $87.50. Here's the breakdown by stage. Stage 3 accounts for 60% of the cost because of a shuffle—maybe repartition fewer times?"

That's actionable insight, not just a bill.

Hive: The Legacy Challenge

Let's be honest—Hive is the dinosaur in the room. If you're still running Hive on-prem or via EMR, query logs are often incomplete.

Your best bet:

Use YARN logs. If Hive runs on YARN, query the YARN ResourceManager API. It tracks applications and their resource consumption.
Approximate with runtime. If you can't get detailed metrics, use query runtime × cluster capacity as a rough proxy.
Focus on the biggest offenders. Identify the top 10 most expensive queries and optimize those. That's where 80% of the savings come from.

The Implementation Reality Check: Gotchas You'll Hit

Building this isn't a weekend project. Here are the landmines every architect encounters:

The "Idle Time" Dilemma: Who pays for the 60 seconds a warehouse stays awake after queries finish? Distribute idle costs proportionally based on that hour's usage.

The "Unattributed" Bucket: You'll never reach 100% attribution. Background services, storage overhead, and untagged queries create noise. Aim for 90%—the last 10% has diminishing returns.

The "Billing Lag" Issue: Cloud billing is delayed 24-48 hours. Calculate preliminary costs immediately, then reconcile with actual billing when it arrives.

Why This Infrastructure Actually Matters

Building an attribution engine isn't just about accounting. It's about fundamentally changing how engineers think about data work.

Unit Economics Become Visible: Calculate "cost-to-serve" for every dashboard and pipeline. If a dashboard costs $500/month but only 2 people use it, you have a data-driven decommissioning case.

Automated Guardrails: Build intelligent alerts—single query > $150? Slack the user with a breakdown. Team spend jumped 60%? Alert with top culprit queries.

True Chargebacks: Provide Finance with 95% accurate department breakdowns. Answer the CFO's question—"Why is our platform so expensive?"—with data, not hand-waving.

Culture Shifts: When engineers see query costs, they care. Teams compete to be efficient. FinOps becomes part of engineering culture, not a top-down mandate.

Making It Actionable: Dashboards and Alerts

A cost attribution system is useless if nobody looks at it.

Build these dashboards:

Team Cost Overview: Monthly spend by team, with month-over-month comparison
Top Queries by Cost: The 50 most expensive queries with user, runtime, and cost
Cost per User: Rank users by spend—sometimes one notebook burns $500/day
Unit Economics: Cost-per-row, cost-per-dashboard-refresh, cost-per-API-call

What I'd Do Differently Next Time

If I were starting from scratch today, here's what I'd prioritize:

Start with one engine. Pick whichever (Spark, Trino, or Hive) burns the most money. Build attribution there first, get wins, then expand.

Don't over-engineer V1. A daily batch job emailing a CSV is fine. Ship something simple, learn, iterate.

Make data self-service. Store attributed costs in a queryable table. Let teams build their own dashboards. You're the enabler, not the gatekeeper.

Enforce the passport early. Make session tags mandatory from day one. Queries without context get rejected or throttled.

Socialize wins. Share cost insights in team meetings. Celebrate savings. Make it positive, not punitive.

Is It Worth It?

Let's do the math.

Say your data platform costs $150k/month. You build a cost attribution system over a quarter (one engineer, part-time). That's roughly $40k in labor.

Now you start optimizing. You find:

A dashboard query running every 5 minutes that could run hourly → $8k/month saved
An ETL job shuffling 10x more data than needed → $12k/month saved
A team running duplicate queries because they didn't know others had the same data → $5k/month saved

In two months, you've paid for the system. Everything after that is pure savings.

But honestly, the ROI isn't even the best part. The best part is the confidence. You can finally answer the CFO's question—"Why is our data platform so expensive?"—with data, not hand-waving.

You can show engineering teams where they're spending money and empower them to make smarter decisions. You can catch runaway workloads before they blow the budget.

And you can stop relying on tags.

Wrapping Up

In the Modern Data Stack era, the most successful data engineers won't just be the ones who move data the fastest. They'll be the ones who can explain the unit economics of every byte they move.

Cost attribution for data platforms isn't a solved problem. There's no open-source tool you can pip install that just works. But the underlying principles are straightforward:

Tag sessions, not just infrastructure (Tier 1: Metadata Layer)
Normalize billing and query data into a unified schema (Tier 2: Ingestion)
Join and attribute proportionally based on resource consumption (Tier 3: Attribution Logic)
Make the data visible and actionable (Dashboards & Alerts)

Whether you're running Spark, Trino, Hive, or all three, the pattern is the same. Instrument your query engines. Collect the data. Build a simple pipeline. Start showing teams what their workloads cost.

It won't be perfect. Your first version will have gaps. Some queries will be hard to attribute. That's okay. Start with 80% coverage and iterate.

Because at the end of the day, the goal isn't perfect cost accounting. It's building a data platform where engineers understand the cost of their work and have the tools to optimize it.

Stop tagging. Start architecting.

Your CFO will thank you.

DEV Community