Arpit Gupta

Posted on May 10 • Originally published at dev.to

Your AWS bill is lying to you — it shows services, not features

#aws #cloud #architecture #devops

Our AWS bill crossed $180K in a single month.

We had Datadog. We had CloudWatch dashboards. We had a PagerDuty integration, three runbooks, and a Grafana board that took an intern two weeks to build. We had observability.

What we didn't have — what I didn't realize we were missing until our CFO sent a Slack message on a Tuesday afternoon — was any idea which product feature was responsible for which dollar.

She didn't ask why our p99 latency was 230ms. She asked:

"Which feature drove the $34K jump in compute costs last month?"

I had no answer.

This is that story. And if you're running a B2B SaaS product with more than a handful of microservices and an AI component or two, I'd bet it's your story too — you just haven't hit the Tuesday Slack message yet.

The false confidence of having "monitoring"

Here's the trap: modern observability tooling is extraordinarily good at answering infrastructure questions. It is almost completely blind to product questions.

When our bill jumped 34% in a month, I went straight to CloudWatch Cost Explorer. I could see:

Service	Change
EC2	up $8,200
Lambda invocations	up 1.1M calls
RDS	up $4,400
OpenAI API (via internal proxy)	up $19,600

Clean data. Perfectly granular by service. Completely useless for answering the CFO's question.

Because the business doesn't care that Lambda invocations went up. The business cares whether that jump came from the new AI-assisted onboarding flow we shipped in week 2, or from the document summarization feature we'd been quietly scaling to enterprise accounts.

Those two answers have completely different business implications. One is a bug. One is growth. Our monitoring couldn't tell us which.

What our stack actually looked like

Before I walk through what we did, some context on the architecture — because this problem scales with complexity and I want you to recognize your own system in here.

We had 12 services at the time:

api-gateway          → Kong, sitting in front of everything
auth-service         → Node.js, JWT issuance + refresh
user-service         → Python/FastAPI, profile + preferences
document-service     → Python/FastAPI, upload + parsing (heavy S3 + Textract)
ai-service           → Python, wraps OpenAI + Anthropic calls
search-service       → Go, Elasticsearch-backed
notification-service → Node.js, SES + SNS
billing-service      → Node.js, Stripe integration
analytics-service    → Python, ClickHouse writes
export-service       → Python, async PDF + CSV generation
webhook-service      → Go, outbound delivery + retry
admin-service        → Node.js, internal tooling

Standard enough setup. The problem wasn't that the architecture was unusual — it's that cost was never a first-class concern in how these services communicated.

Every service had structured logs going to CloudWatch. Every Lambda had a function name. Every ECS task had a cluster and a service tag. We could filter by service. We could not filter by what the user was doing when that service ran.

The document-service processed uploads for three different product features:

Onboarding — auto-importing documents when a new company signed up
Manual upload — a user explicitly uploads a file
Integration sync — documents pulled in from Google Drive / Dropbox via OAuth

From AWS's perspective, it's all the same Lambda. From our perspective, these three features had wildly different unit economics.

The investigation

When I started digging, I pulled Cost Explorer data into a spreadsheet and tried to correlate it with our feature release schedule manually.

This was a mistake. Not because the data wasn't there — it was — but because the correlation surface area was too large. In the month in question, we had shipped:

A complete overhaul of the onboarding flow (week 1)
AI-assisted document tagging (week 2)
Bulk export to PDF for enterprise accounts (week 3)
A search relevance improvement that doubled our Elasticsearch query volume (week 4)

Trying to attribute cost jumps to feature releases by eyeballing a timeline is how you spend two days producing a narrative that sounds plausible but is mostly wrong.

I needed actual attribution. So I started building it.

Step 1: Get granular cost data out of AWS

Cost Explorer has an API. Most engineers I talk to have never touched it. It's genuinely useful.

import boto3
from datetime import datetime, timedelta

ce = boto3.client('cost-explorer', region_name='us-east-1')

def get_service_costs_by_tag(start_date: str, end_date: str, tag_key: str):
    response = ce.get_cost_and_usage(
        TimePeriod={'Start': start_date, 'End': end_date},
        Granularity='DAILY',
        Filter={
            'Tags': {
                'Key': tag_key,
                'MatchOptions': ['PRESENT']
            }
        },
        GroupBy=[
            {'Type': 'TAG', 'Key': tag_key},
            {'Type': 'DIMENSION', 'Key': 'SERVICE'}
        ],
        Metrics=['UnblendedCost']
    )
    return response['ResultsByTime']

# Pull last 30 days grouped by our 'feature' tag
costs = get_service_costs_by_tag(
    start_date=(datetime.now() - timedelta(days=30)).strftime('%Y-%m-%d'),
    end_date=datetime.now().strftime('%Y-%m-%d'),
    tag_key='feature'
)

The key word in there is tag_key='feature'.

AWS Cost Allocation Tags let you slice your bill by any tag you've applied to your resources. The catch: you have to have applied those tags in the first place.

We had tags. We just hadn't been tagging by feature — only by service, environment, and team. So the API gave me cost by document-service but not by onboarding-flow vs manual-upload vs integration-sync.

This is the core of the visibility problem. Not a tooling limitation. A tagging discipline failure that compounded silently for eighteen months until the bill was large enough for someone to ask the question.

Step 2: Retrofit feature context into the request path

I couldn't retroactively fix historical data. But I could fix the attribution model going forward — and then backfill from logs.

Here's the pattern we landed on. Every inbound request to api-gateway now carries a X-Feature-Context header, set by the frontend client:

// In the React app, every API call goes through this wrapper
const apiFetch = (path, options = {}) => {
  const featureContext = getCurrentFeatureContext(); // e.g., "onboarding.document_import"

  return fetch(`${API_BASE}${path}`, {
    ...options,
    headers: {
      ...options.headers,
      'Authorization': `Bearer ${getToken()}`,
      'X-Feature-Context': featureContext,
      'X-Session-Id': getSessionId(),
    }
  });
};

Kong (our gateway) propagates this header downstream. Every service receives it. Every service logs it.

Then, in each service's structured log output:

import structlog
import functools

log = structlog.get_logger()

def with_cost_context(func):
    """Decorator that ensures feature context is present in all log lines."""
    @functools.wraps(func)
    async def wrapper(request, *args, **kwargs):
        feature_ctx = request.headers.get('X-Feature-Context', 'unknown')
        session_id  = request.headers.get('X-Session-Id', 'unknown')

        with structlog.contextvars.bound_contextvars(
            feature_context=feature_ctx,
            session_id=session_id,
        ):
            return await func(request, *args, **kwargs)
    return wrapper


@app.post("/process")
@with_cost_context
async def process_document(request: Request, body: DocumentBody):
    log.info("document.processing_started",
             doc_size_bytes=body.size,
             doc_type=body.mime_type)

    result = await run_textract(body)

    log.info("document.processing_complete",
             pages_processed=result.pages,
             textract_units=result.units_consumed,
             duration_ms=result.duration_ms)

    return result

Now every log line for every Textract call carries feature_context. When I query CloudWatch Logs Insights:

fields @timestamp, feature_context, textract_units, duration_ms
| filter ispresent(textract_units)
| stats
    sum(textract_units) as total_textract_units,
    count(*)            as call_count,
    avg(duration_ms)    as avg_duration
  by feature_context
| sort total_textract_units desc

I get this:

feature_context	total_textract_units	call_count	avg_duration
onboarding.document_import	1,847,302	12,483	4,201ms
integration.google_drive_sync	934,110	8,921	3,890ms
manual_upload.user_initiated	221,004	4,102	2,340ms

The onboarding flow was consuming 61% of our Textract spend while generating approximately 12% of our active users. The unit economics were broken — we were spending more per onboarded user than we were in the previous billing period, and we hadn't noticed because the cost was invisible inside the document-service aggregate.

Step 3: Do the same for AI spend

AI API costs are the fastest-growing line item for most SaaS products right now, and they're also the most opaque. OpenAI doesn't know what feature triggered your completion call. You have to tell yourself.

We run all AI calls through an internal ai-service. Here's a simplified version of the instrumentation we added:

import openai
import anthropic
import time
from dataclasses import dataclass

@dataclass
class AICallRecord:
    feature_context:    str
    model:              str
    provider:           str
    input_tokens:       int
    output_tokens:      int
    duration_ms:        int
    estimated_cost_usd: float
    request_id:         str

PRICING = {
    "gpt-4o":                    {"input": 0.0000025,  "output": 0.00001},
    "gpt-4o-mini":               {"input": 0.00000015, "output": 0.0000006},
    "claude-3-5-haiku-20241022": {"input": 0.0000008,  "output": 0.000004},
    "claude-sonnet-4-20250514":  {"input": 0.000003,   "output": 0.000015},
}

def tracked_completion(
    messages: list,
    model: str,
    feature_context: str,
    max_tokens: int = 1000,
    **kwargs
) -> tuple[str, AICallRecord]:

    provider = "anthropic" if "claude" in model else "openai"
    start    = time.time()

    if provider == "openai":
        client    = openai.OpenAI()
        response  = client.chat.completions.create(
            model=model, messages=messages, max_tokens=max_tokens, **kwargs
        )
        content       = response.choices[0].message.content
        input_tokens  = response.usage.prompt_tokens
        output_tokens = response.usage.completion_tokens

    else:
        client    = anthropic.Anthropic()
        response  = client.messages.create(
            model=model, messages=messages, max_tokens=max_tokens, **kwargs
        )
        content       = response.content[0].text
        input_tokens  = response.usage.input_tokens
        output_tokens = response.usage.output_tokens

    duration_ms = int((time.time() - start) * 1000)
    pricing     = PRICING.get(model, {"input": 0, "output": 0})
    cost        = (input_tokens * pricing["input"]) + (output_tokens * pricing["output"])

    record = AICallRecord(
        feature_context=feature_context,
        model=model,
        provider=provider,
        input_tokens=input_tokens,
        output_tokens=output_tokens,
        duration_ms=duration_ms,
        estimated_cost_usd=cost,
        request_id=getattr(response, 'id', 'unknown')
    )

    log.info("ai.completion",
             feature_context    =record.feature_context,
             model              =record.model,
             provider           =record.provider,
             input_tokens       =record.input_tokens,
             output_tokens      =record.output_tokens,
             estimated_cost_usd =round(record.estimated_cost_usd, 6),
             duration_ms        =record.duration_ms)

    return content, record

Every AI call now emits a structured log line with the feature context, model, token counts, and estimated cost. This flows into CloudWatch → Logs Insights → internal cost dashboard, refreshed nightly.

After two weeks running this in production, we ran the numbers:

AI spend attribution — trailing 14 days

feature_context                      | cost_usd | % of total
-------------------------------------|----------|----------
ai_tagging.document_auto_tag         |  $6,841  |   34.9%
search.query_expansion               |  $4,203  |   21.4%
onboarding.welcome_email_personalize |  $3,112  |   15.9%
export.pdf_summary_generation        |  $2,891  |   14.7%
support.ticket_classification        |  $2,570  |   13.1%

The document auto-tagging feature — which we had shipped three weeks prior as a "small AI enhancement" — was consuming 35% of our entire AI budget. It was running GPT-4o on every document on ingest, including documents that the user never actually opened.

We switched it to GPT-4o-mini for the initial pass and only escalated to GPT-4o on user-initiated tagging reviews. Monthly AI cost for that feature dropped from ~$13,500 to ~$2,100. Same user-facing quality on the 95% of documents that get the light model. Better quality on the 5% where it actually matters.

This specific gap — knowing which feature is running which model at what cost — is exactly what I've been building CostReveal to surface automatically, without requiring teams to wire up their own logging pipeline from scratch.

The actual problem was never the cost

Here's what I want you to take away from this — and it's the thing I've said to every CTO I've talked to since we went through this.

The $47K month wasn't a cost problem. It was a visibility problem.

The costs were legitimate. Infrastructure runs money. AI APIs cost money. Textract costs money. None of those bills were wrong in isolation. But without the ability to connect those costs to the product decisions that drove them, we had no mechanism to evaluate whether we were spending intelligently.

We were flying on instruments that told us altitude and airspeed but not fuel consumption per destination.

The fix wasn't to spend less. The fix was to be able to see what we were spending on — at the feature level, not the service level.

Once we had that visibility:

We made a targeted optimization that saved ~$11K/month on AI spend alone
We identified that our Google Drive sync was 4× more expensive per document than manual uploads, and adjusted our pricing for integration-heavy enterprise accounts accordingly
We found two Lambda functions running on m5.xlarge instances that had been over-provisioned since a Black Friday scare eighteen months ago — neither team remembered provisioning them

None of that required a cost-cutting initiative. It required knowing where to look.

What I'd do differently from day one

If I were starting the same architecture fresh:

1. Make feature context a first-class concern in your API contract

Don't leave it optional. Every request that enters your system should carry a feature identifier. Enforce it at the gateway level. Log it everywhere.

2. Tag AWS resources by feature, not just by service

Cost Allocation Tags are free. The discipline to apply them is not — but it compounds. A service tag tells you document-service costs $8K. A feature tag tells you onboarding.document_import costs $5K of that $8K.

resource "aws_lambda_function" "document_processor" {
  # ... other config

  tags = {
    Environment = var.environment
    Team        = "platform"
    Service     = "document-service"
    Feature     = "onboarding.document_import"  # ← this is the one that matters
    CostCenter  = "product-engineering"
  }
}

3. Build a token budget per feature, not per model

For AI features, define expected token ranges based on input size. Alert when a feature consistently runs 40%+ over budget. This is the AI equivalent of a memory leak — silent until it's expensive.

FEATURE_TOKEN_BUDGETS = {
    "ai_tagging.document_auto_tag": {
        "input_per_call":   2000,
        "output_per_call":  400,
        "alert_multiplier": 1.4,           # alert if 40% over budget
        "model_default":    "gpt-4o-mini",
        "model_escalate":   "gpt-4o",
    },
    "search.query_expansion": {
        "input_per_call":   500,
        "output_per_call":  150,
        "alert_multiplier": 1.5,
        "model_default":    "gpt-4o-mini",
        "model_escalate":   None,          # never escalate — cheap by design
    }
}

4. Treat cost spikes like production incidents

We have runbooks for latency spikes and error rate spikes. We now have the same for cost spikes. The runbook isn't "reduce costs" — it's "identify which feature drove the change and assess whether it's expected."

Sometimes the answer is: expected, this is growth.
Sometimes it's: unexpected, this is a bug.
Both are important to know quickly.

Closing thought

Your observability stack is almost certainly excellent at telling you that something is running. It is probably telling you almost nothing about why specific product decisions cost what they cost.

That gap is manageable when your bill is $10K/month. It becomes genuinely dangerous when you're at $100K+ — not because the absolute numbers are large, but because decisions about AI features, scaling strategies, and enterprise pricing all depend on unit economics you cannot calculate without feature-level attribution.

The good news: the technical lift to get there is real but not enormous. A consistent context header, a tagging discipline, and a few structured log fields get you 80% of the way. The remaining 20% — aggregating it into something queryable and surfacing anomalies automatically — is where it gets harder, but you can start with CloudWatch Logs Insights and a spreadsheet and still get signal that changes your decisions.

Start with the one feature you're most uncertain about. Instrument it this week. Run it for two billing cycles. I'd be genuinely surprised if you didn't find something worth knowing.

If you'd rather skip building the pipeline and go straight to the attribution layer, that's what CostReveal is for.

Top comments (2)

Vadym Arnaut • May 11

Hit a much smaller version of this same blind spot in our Datadog
RUM setup. Our infra dashboards showed cold start counts and lambda
invocations per service, but when a feature ramp pushed cost +30%
in a week, we couldn't tell whether the translation pipeline or the
chapter rendering was responsible. Both share document-service
equivalents in our stack.

The fix we landed on is the same shape as your feature_id tag.
We instrument requests with a flow attribute at the API gateway
layer (it's just FastAPI middleware). One enum value per product
surface. Then a single Datadog facet pivot answers "cost by feature"
cleanly. The audit took maybe two days of grep to backfill. Wish
we'd done it before the bill jumped.

Arpit Gupta • May 12

Completely agreed. Thanks for sharing.