Trigops for Trigops

Posted on Jun 29

Multi-Account AWS Cost Visibility Without a FinOps Team

#aws #finops #cloud #devops

The Spreadsheet That Lives in Someone's Head

You manage AWS cost for an engineering org with eight accounts: production, per-team staging, a data-science sandbox, a legacy account from the acquisition two years ago that still runs a handful of RDS instances nobody has formally decommissioned. Cost Explorer exists. You open it, switch to the org-level view, and see a number. A large number. It does not tell you whether that number is reasonable, which team owns the spike in us-east-1 last Tuesday, or whether the r6g.2xlarge RDS cluster in the staging account ran all weekend because someone forgot to stop it or because there was actually a load test.

The practical answer for most engineering orgs at this stage is: somebody builds a spreadsheet. It gets stale. Then the cost review meeting is an hour of archaeology.

This article is about building something better - not a platform, not a FinOps tool purchase, but a clear architectural picture of where AWS multi-account cost visibility actually breaks down and what the minimal viable fix looks like. We'll ground it in the resource types that generate the most surprise at the end of the month: EC2, ECS, RDS, and ASG.

Why Cost Explorer Alone Isn't Enough

Cost Explorer is good at billing history. It is less good at answering operational questions:

Which EC2 instances are running right now and are they doing anything?
How long has this ECS service been at zero requests per minute?
Who owns this RDS cluster and does that team know it ran all weekend?

The gap is that Cost Explorer is a billing ledger. Billing accrues by the hour regardless of whether a resource is serving traffic, running tests, or simply existing because no one turned it off. For production, that's fine - production should run. For dev and staging, the relationship between uptime and actual utility is much weaker.

In a single-account org you can build a reasonable picture with Cost Allocation Tags and a few Athena queries over the CUR (Cost and Usage Report). Multi-account orgs hit two compounding problems:

1. Tag discipline is uneven across accounts. The production account has a mature tagging policy enforced by SCPs. The sandbox account the data team set up eighteen months ago has almost nothing tagged, and the data team has since reorganized. Untagged spend is a black hole in Cost Explorer's group-by views.

2. Utilization is invisible in billing data. A stopped EC2 instance still shows an EBS charge. A running RDS instance in a staging account at 1% CPU looks identical in the billing line to a running RDS instance serving real traffic. The bill doesn't carry signal about whether the resource was actively used.

The Architecture: Three Data Planes

Building multi-account visibility that actually answers operational questions requires stitching together three separate data planes:

1. Billing data (what you're paying)

The Cost and Usage Report is the authoritative source for what AWS charged you. Enable CUR at the org management account level with hourly granularity and resource-level line items. Land it in S3, crawl it with a Glue crawler, query it with Athena.

The key table relationship: line_item_resource_id is the ARN of the resource that generated a charge. This is your join key to the other two data planes.

Useful starting queries:

-- Top spenders by account and service, last 30 days
SELECT
  line_item_usage_account_id,
  line_item_product_code,
  SUM(line_item_unblended_cost) AS total_cost
FROM cur_db.cur_table
WHERE line_item_line_item_type = 'Usage'
  AND month = DATE_TRUNC('month', CURRENT_DATE)
GROUP BY 1, 2
ORDER BY 3 DESC
LIMIT 50;

-- Untagged EC2 spend by account (tag enforcement gap finder)
SELECT
  line_item_usage_account_id,
  SUM(line_item_unblended_cost) AS untagged_cost
FROM cur_db.cur_table
WHERE line_item_product_code = 'AmazonEC2'
  AND resource_tags_user_team IS NULL
  AND month = DATE_TRUNC('month', CURRENT_DATE)
GROUP BY 1
ORDER BY 2 DESC;

2. Inventory data (what's running)

Billing data tells you what you paid. It doesn't tell you the current state of the resource or who's responsible for it. For that you need an inventory pass across accounts.

The standard approach is an AWS Config aggregator at the org level. Enable it from the management account; it federates resource configuration snapshots from all member accounts into a single query interface.

For the resource types that matter most:

EC2: aws:ec2:instance - state (running/stopped), instance type, launch time, attached tags
RDS: aws:rds:dbinstance - status (available/stopped), instance class, MultiAZ, last modified
ECS: aws:ecs:service - desiredCount, runningCount, lastDeployment
ASG: aws:autoscaling:autoScalingGroup - min/max/desired, instance count

You can query this with AWS Config Advanced Queries (which speak a subset of SQL):

-- All running EC2 instances across org, with tags
SELECT
  accountId,
  awsRegion,
  resourceId,
  configuration.instanceType,
  configuration.state.name,
  tags
FROM aws_config_configuration_snapshot
WHERE resourceType = 'AWS::EC2::Instance'
  AND configuration.state.name = 'running'

For ECS services the useful signal is runningCount vs desiredCount. A service sitting at desired=1, running=1 but with zero CloudWatch RequestCount metrics for 72 hours is a candidate for investigation - it may be legitimate (a background worker) or it may be forgotten.

3. Activity data (what's being used)

This is the hardest layer to build and the most valuable. Billing tells you uptime cost; inventory tells you what exists; activity tells you whether any human or system is actually using a resource right now.

For EC2 and ECS, CloudWatch CPUUtilization is the proxy most teams reach for first. It is better than nothing but has known blind spots:

A dev box running at 0.2% CPU might have a developer actively connected over SSH writing code in a terminal - CPU is low because they're thinking, not because the machine is idle.
An ECS service at 0% CPU might be perfectly legitimately idle, waiting for an event that happens once a day.
Conversely, a forgotten RDS instance might spike to 30% CPU periodically from an automated job that nobody thought to cancel when the project ended.

CPU utilization is a workload signal. It's not a human-presence signal. For development and staging resources, what you actually want to know is: is a person actively working on this right now?

That's a meaningfully different question. This is what activity-driven resource management attempts to answer - tracking real user presence and work-tool focus at the desktop level, rather than inferring activity from server-side metrics. Instead of asking "is the CPU busy?", an activity-driven system asks "is the engineer who owns this box actually at their desk, with their IDE open?" The two questions have different answers surprisingly often.

For the purposes of a visibility architecture, whether you're building activity detection yourself or relying on a tool that provides it, the key architectural point is that server-side metrics and billing data alone cannot tell you whether dev/staging resource usage is intentional. You need a signal from the demand side, not just the supply side.

Stitching It Together: The Minimal Viable Dashboard

For an org without a dedicated FinOps team, the goal is a weekly operational artifact that answers four questions:

What did we spend per account and per team last week?
Which resources ran longest with no detectable activity?
Which accounts have the worst untagged spend?
What's the week-over-week trend?

You don't need a commercial platform to answer these. A reasonable minimal stack:

CUR → S3 → Athena for billing queries. Schedule a weekly Athena query via EventBridge + Lambda and write results to another S3 prefix.

Config Aggregator → Advanced Queries for inventory snapshots. The same Lambda can call select_aggregate_resource_config via the SDK.

CloudWatch Metrics Insights for utilization data. A cross-account metrics query (available in CloudWatch Metrics Insights with a metrics stream or via the AWS/EC2 namespace with cross-account observability enabled) can pull CPUUtilization P50 over the last 7 days for all instances.

Join on ARN / resource ID. Export to a simple HTML report or push to a Slack channel. The output doesn't need to be beautiful; it needs to be accurate and regular.

A skeleton Lambda handler for the billing aggregation:

import boto3

ATHENA_DB = "cur_db"
ATHENA_OUTPUT = "s3://your-bucket/athena-results/"
WORKGROUP = "primary"

QUERY = """
SELECT
  line_item_usage_account_id AS account,
  resource_tags_user_team AS team,
  ROUND(SUM(line_item_unblended_cost), 2) AS week_cost_usd
FROM cur_db.cur_table
WHERE line_item_line_item_type = 'Usage'
  AND line_item_usage_start_date >= DATE_ADD('day', -7, CURRENT_DATE)
GROUP BY 1, 2
ORDER BY 3 DESC
"""

def handler(event, context):
    client = boto3.client("athena", region_name="us-east-1")
    response = client.start_query_execution(
        QueryString=QUERY,
        QueryExecutionContext={"Database": ATHENA_DB},
        ResultConfiguration={"OutputLocation": ATHENA_OUTPUT},
        WorkGroup=WORKGROUP,
    )
    return {"QueryExecutionId": response["QueryExecutionId"]}

Run this in the management account with a cross-account IAM role that has athena:StartQueryExecution and s3:PutObject. The CUR data is already org-wide if you set it up at the management account level.

The Tag Governance Problem

None of the above is useful if half your spend is untagged. Multi-account tag governance is its own discipline, but the minimum viable practice:

SCPs on create actions. A Service Control Policy that denies ec2:RunInstances, rds:CreateDBInstance, and ecs:CreateService when the required tags (Team, Environment, Owner) are missing prevents untagged resources from being created. Attach it to the OU containing non-production accounts first; production teams will complain less if you start there.

Config Rules for drift detection. required-tags is a managed Config rule. Combined with the aggregator, it gives you a cross-account view of resources that exist but are missing tags - often legacy resources predating the policy.

A remediation queue, not a penalty. The goal is visibility, not compliance theater. Tag violations should generate a ticket in your engineering team's backlog, not an automated termination (until you have very high confidence in the policy). Terminating a production resource because someone forgot a tag is a worse outcome than the tag missing.

What This Won't Solve

A billing + inventory + utilization dashboard tells you that your staging RDS ran all weekend and cost $47. It does not tell you why, and it does not prevent it next weekend.

Prevention requires either a schedule (turn it off at 6 PM Friday, turn it on Monday morning - fragile if the team works irregular hours) or something that responds to actual human activity. A schedule can be built in an afternoon with RDS maintenance windows or AWS Instance Scheduler. It will start misfiring as soon as the engineering team's actual work pattern diverges from the schedule - which is immediately, for most teams.

The harder architectural question is: what signal should drive resource state? Scheduled stop/start is a blunt instrument. Utilization-based autoscaling handles load variation but not the fundamental question of whether anyone is present and working. Activity-driven approaches - where resource state follows the engineer's actual desktop session, IDE focus, and work-tool activity - are a newer category, purpose-built for this gap in dev/test environments.

Starting Points

If you're building this from scratch, the priority order that gives the most signal per hour of work:

Enable CUR with resource-level line items at the org level. Everything else in cost visibility builds on this. Without resource-level data, the billing plane is opaque.
Enable Config Aggregator. Inventory without billing is guesswork; billing without inventory is a black box. Together they let you ask "what is this charge and what resource caused it?"
Build the tag gap report first. Before you can reason about team-level spend, you need to know how much of your spend is attributable. Fix the gaps with SCPs on new resources; chase legacy untagged resources with Config remediation.
Add utilization signals last. CloudWatch Metrics Insights cross-account is the right tool once you have billing and inventory aligned. CPU P50 over 7 days by resource is a reasonable first cut at identifying idle candidates.

The staging RDS that ran all weekend is a $47 problem once. It is a $2,400 problem annually per forgotten instance. Across eight accounts with mixed tag hygiene and teams that move fast, the aggregate is significant - not because any single resource is catastrophically wasteful, but because the pattern repeats undetected at scale.

Visibility is the prerequisite to action. Get the data plane right first.

If this is a problem you're actively working on - staging and dev resources running past their useful window, or cost visibility spread thin across accounts - Trigops is built specifically for this: activity-driven pause and resume for EC2, ECS, RDS, and ASG, with per-account and per-team visibility. Worth a look if you want the detection and the action layer in one place.

DEV Community