Anderson Leite

Posted on Jul 1

Keeping Log Analytics Costs at Bay: Budgets, Alerts and a Kill Switch You Actually Test

#devops #infrastructure #azure

A big shout to Giovanna who brought this challenge!

Log Analytics ingestion is one of those Azure costs that behaves well for months, then doesn't. A noisy diagnostic setting, a new service sending verbose logs, a misconfigured Sentinel connector, and suddenly your monthly bill has a very different shape than the one you budgeted for.

Cost Management budgets and alerts exist for exactly this. But an email at 100% of budget doesn't stop the ingestion that's already happening. If you want something that actually intervenes, you need automation behind the alert, and automation that intervenes in production needs the same rigor you'd apply to any other change: a rollback plan, a real test, and an honest accounting of its blast radius.

This is the story of building that automation: a Logic App that gets triggered by a budget alert and caps daily ingestion on a set of workspaces. It's also the story of the two bugs that almost let it ship broken, because the failure mode for a safety net that silently doesn't work is worse than having no safety net at all. You find out during the next runaway bill, not during testing.

Why Log Analytics Costs Sneak Up on You

Azure bills Log Analytics ingestion by the gigabyte. That's simple until you count the sources actually writing to a workspace: diagnostic settings on every resource type, Microsoft Sentinel and Defender for Cloud connectors, custom application logs, AKS container insights, and whatever verbose debug logging someone forgot to turn off after an incident. Each source looks small on its own. Together, they compound.

There are three layers worth thinking about, in order of how early they intervene:

Prevention: Data Collection Rule transformations that filter or sample before ingestion, table-level retention tuned to what you actually query, and Basic Logs or Auxiliary tables for high-volume, low-query-value data like verbose diagnostics.
Detection: Cost Management budgets with staged alert thresholds, so someone finds out at 50% and 75% of budget, not only at 100%.
Circuit breaking: an automated, blunt intervention that stops ingestion outright once detection has failed to prevent the problem in time.

Most cost control work should live in layer one. Layer three is what this article is about, and it's worth saying upfront: it's a last resort, not a cost management strategy. A daily ingestion cap doesn't care which table or which source is responsible. It stops everything on a workspace, including the security logs you probably want flowing during whatever cost spike triggered the cap in the first place.

The Design: Staged Alerts, Escalating Response

The setup that made sense here uses one Cost Management budget with three thresholds, each pointing at a different Action Group:

Threshold	Action	Purpose
50% of budget	Email to the team	Early warning, still time to investigate
75% of budget	Email to the team	Last chance to act before automation takes over
100% of budget	Logic App trigger	Circuit breaker: stop ingestion now, ask questions after

The staging matters. Two email warnings give a human the chance to catch a runaway cost before it becomes a production incident. Only the last threshold, the one that means prevention already failed, triggers the automated response. If your first alert fires the kill switch, you've skipped the part of the process where a person gets to say "wait, that's expected, we're running a migration this week."

Building the Kill Switch

The Trigger

The Logic App starts with an HTTP request trigger. This is what an Action Group calls when it fires, whether the underlying alert is a budget alert, a metric alert, or anything else that supports the common alert schema. The schema needs to be pasted into the trigger's definition so the workflow can actually parse what arrives, not just accept it:

{
    "type": "object",
    "properties": {
        "schemaId": { "type": "string" },
        "data": {
            "type": "object",
            "properties": {
                "essentials": {
                    "type": "object",
                    "properties": {
                        "alertId": { "type": "string" },
                        "alertRule": { "type": "string" },
                        "severity": { "type": "string" },
                        "signalType": { "type": "string" },
                        "monitorCondition": { "type": "string" },
                        "monitoringService": { "type": "string" },
                        "alertTargetIDs": {
                            "type": "array",
                            "items": { "type": "string" }
                        },
                        "originAlertId": { "type": "string" },
                        "firedDateTime": { "type": "string" },
                        "description": { "type": "string" },
                        "essentialsVersion": { "type": "string" },
                        "alertContextVersion": { "type": "string" }
                    }
                },
                "alertContext": { "type": "object" }
            }
        }
    }
}

Getting the schema right is the difference between a trigger that fires and a trigger that fires but can't read anything useful out of the payload.

The Condition

Not every payload the trigger receives means "act now." Azure Monitor sends a payload when an alert fires and another when it resolves, and only one of those should do anything:

@equals(triggerBody()?['data']?['essentials']?['monitorCondition'], 'Fired')

Skip this check and your kill switch will happily re-trigger every time the alert resolves too, which is not what you want from something that's supposed to be careful.

The Action

Inside the condition, one HTTP action per workspace, each a PATCH against the workspace's ARM resource, each authenticated through the Logic App's system-assigned managed identity:

{
    "method": "PATCH",
    "uri": "https://management.azure.com/subscriptions/<subscription-id>/resourceGroups/<rg-name>/providers/Microsoft.OperationalInsights/workspaces/<workspace-name>?api-version=2023-09-01",
    "authentication": {
        "type": "ManagedServiceIdentity"
    },
    "body": {
        "properties": {
            "workspaceCapping": {
                "dailyQuotaGb": 0.023
            }
        }
    }
}

0.023 GB is the practical floor the API accepts for dailyQuotaGb. Setting it there stops new ingestion almost immediately rather than merely reducing it. There's no dedicated "pause ingestion" API. This is the closest real equivalent, and it's blunt on purpose: workspace-wide, not table-by-table.

One mechanical detail worth knowing: Azure automatically lifts a daily cap's enforcement roughly 24 hours after it takes effect, on Pay-As-You-Go workspaces. Don't rely on that reset as your rollback plan. It resets the ingestion counter, not the quota configuration that existed before your kill switch ran. Plan your own rollback rather than waiting for Azure to quietly undo part of the situation on its own timeline.

The managed identity needs a role that can write to the workspace resource. Log Analytics Contributor is the tightly scoped option. A broader Contributor role also works, since it's a permission superset, but it grants more than this automation needs: write access to saved searches, linked alert rules, and other workspace configuration that has nothing to do with ingestion caps. If anyone audits IAM grants in your environment, scope it down.

What Actually Broke During Testing

Here's where it gets useful, because a walkthrough that only shows the happy path teaches you how to build something, not how to trust it. Two things broke while testing this exact Logic App, and both were the kind of failure that looks fine until you specifically go looking for it.

The IAM Role That Wasn't Actually There

The setup: role assignments were pushed to all four target workspaces in one pass, granting the managed identity write access.

What went wrong: two of the four PATCH actions failed. The other two succeeded. A quick check showed those two workspaces simply hadn't received the role assignment. A networking hiccup during the bulk assignment meant it silently didn't apply everywhere it was supposed to.

The lesson: role propagation isn't instant, and role assignment isn't atomic across multiple resources just because you issued the commands together. If your kill switch fans out to several resources, verify the identity's access on each one individually before you trust the automation to work on all of them. A partial success that looks like it might be a full success is worse than an obvious total failure, because it's the kind of thing that passes a quick glance.

The Trigger That Got Renamed and the Callback That Didn't Notice

The setup: the Logic App's HTTP trigger was renamed from its default internal name to something more descriptive. Purely cosmetic, done in the workflow's code view.

What went wrong: the Action Group's Logic App action had already been configured before the rename, and Azure Monitor's callback URL bakes the trigger's name directly into the invocation path (/triggers/<trigger-name>/paths/invoke). The rename updated the workflow. It did not update the Action Group's stored callback URL. The result: a Logic App that worked perfectly when triggered manually, and would have 404'd silently the moment a real budget alert tried to invoke it, because the URL Azure Monitor had on file pointed at a trigger name that no longer existed.

The lesson: a rename inside the Logic App doesn't propagate to anything that already has a URL pointing at the old name. If you rename a trigger after wiring an Action Group to it, go back and re-select the Logic App and trigger in the Action Group's configuration so it regenerates the callback against the current name. Don't assume the two sides of an integration stay in sync just because one half changed. The only way to catch this is diffing the Action Group's actual stored configuration against the Logic App's actual current trigger name, and it's easy to skip that diff entirely if your only test is a manual curl against a URL you copied fresh from the Designer, since that always reflects the current name.

Testing It Properly

The Test That Looks Like It Passed But Didn't

Azure Logic Apps has a built-in "Run Trigger" button for HTTP request triggers. It's tempting to use it as a quick smoke test. Don't, at least not for a workflow gated by a condition. That button sends an empty request body. If your condition checks a field inside the body, like monitorCondition, the check evaluates against nothing, the condition fails, and the run reports success because nothing errored. It just also didn't do anything. A green checkmark on a run that skipped every meaningful action is a worse outcome than a red one, because it looks like proof when it's actually silence.

The fix is sending a real payload that matches the schema, with monitorCondition set to Fired:

curl -X POST "<trigger-url>" \
  -H "Content-Type: application/json" \
  -d '{
    "schemaId": "azureMonitorCommonAlertSchema",
    "data": {
        "essentials": {
            "alertId": "test-alert-id",
            "alertRule": "test-budget-alert",
            "severity": "Sev3",
            "signalType": "Metric",
            "monitorCondition": "Fired",
            "monitoringService": "Platform",
            "alertTargetIDs": [],
            "originAlertId": "test-origin",
            "firedDateTime": "2026-07-01T00:00:00.000Z",
            "description": "Test trigger",
            "essentialsVersion": "1.0",
            "alertContextVersion": "1.0"
        },
        "alertContext": {}
    }
  }'

Send the same payload with monitorCondition set to Resolved too. That's your negative test, confirming the condition correctly does nothing when it should do nothing. A kill switch that fires when it shouldn't is its own kind of incident.

The Test That Actually Matters

A manual curl against the trigger URL proves the Logic App works. It does not prove the Action Group is correctly wired to it, and the second bug above is exactly why that distinction matters: the manual curl passed every single time, because it used a URL copied fresh from the Designer, which always reflects the current trigger name. The Action Group's stored callback URL was the one still pointing at the old name, and no amount of manual curl testing would ever have caught that, because the manual test never touches the Action Group's copy of the URL at all.

The only way to close that gap is a real end-to-end test: let an actual alert fire the actual Action Group. For a budget alert, that means temporarily lowering the threshold below current spend, waiting for Azure Monitor's evaluation cycle (it isn't instant, budget alert evaluation happens a few times a day rather than in real time), and confirming the Logic App's run history shows a run you didn't personally initiate. Revert the threshold immediately afterward, and roll back any workspace caps the run applied.

Capture Your Baseline Before You Ever Fire This

One easy mistake with any "emergency remediation" automation: building the intervention without first recording what normal looks like. If your kill switch's job is to set dailyQuotaGb to a near-zero value, you need to know what it was before, on every workspace it touches, before you ever trigger it for real. Otherwise your rollback is a guess, and "set everything back to unlimited" is not automatically correct. A workspace might have had a deliberate cap in place for its own cost-control reasons, and blindly reverting to unlimited silently removes a control that had nothing to do with this incident.

Capture it with a plain read against the workspace resource:

az rest --method GET \
  --url "https://management.azure.com/subscriptions/<subscription-id>/resourceGroups/<rg-name>/providers/Microsoft.OperationalInsights/workspaces/<workspace-name>?api-version=2023-09-01" \
  --query "properties.workspaceCapping"

Record the result somewhere durable before touching anything, not somewhere you'll forget to check when you actually need it during rollback.

Understand What You're Actually Blunting

A daily ingestion cap is not a scalpel. It stops every table on a workspace, which means if Microsoft Sentinel or Defender for Cloud writes to that workspace, you've just created a security-monitoring blind spot at exactly the moment something unusual is happening on your cost curve, which is sometimes the same moment something unusual is happening for other reasons entirely. Before wiring this into production, know which workspaces carry security-relevant data, and decide whether those specific ones deserve a different, more targeted response than a blanket cap. Disabling a specific Data Collection Rule, rather than capping the whole workspace, is the gentler option when you can identify the actual noisy source ahead of time.

None of this is really about Log Analytics specifically. It's about what it means to automate a response to a cost problem. The automation needs a rollback plan that matches the actual prior state, not a guess. It needs a test that proves the whole chain works, not just the piece that's easy to curl. And it needs an honest account of what else it touches when it fires.

Before you wire an automated response to any cost alert, ask:

Do you know what "normal" looked like on every resource this touches, recorded somewhere, before you ever trigger it for real?
Have you tested the full chain, alert to action group to automation, not just the automation in isolation?
What else shares this resource, and what happens to it when your circuit breaker trips?
Who finds out when this fires, and how fast?

Get those right, and a budget alert stops being just an email nobody reads until the invoice arrives. It becomes something that actually protects you, on the day you needed it to.

DEV Community