<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Infraforge</title>
    <description>The latest articles on DEV Community by Infraforge (@infraforge).</description>
    <link>https://dev.to/infraforge</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F13346%2Fde839a31-485a-47dd-92b8-2425002f861b.png</url>
      <title>DEV Community: Infraforge</title>
      <link>https://dev.to/infraforge</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/infraforge"/>
    <language>en</language>
    <item>
      <title>ArgoCD CVE-2022-24348: a Secret leak that hid in log volume</title>
      <dc:creator>Muhammad Hassaan Javed</dc:creator>
      <pubDate>Tue, 02 Jun 2026 02:05:12 +0000</pubDate>
      <link>https://dev.to/infraforge/argocd-cve-2022-24348-a-secret-leak-that-hid-in-log-volume-46ee</link>
      <guid>https://dev.to/infraforge/argocd-cve-2022-24348-a-secret-leak-that-hid-in-log-volume-46ee</guid>
      <description>&lt;p&gt;The first thing we saw in Loki was a fanout service log line that contained the string 'a2V5Y2xvYWstY2xpZW50' repeated about 40 times in a single minute. Base64 decode: 'keycloak-client'. The fanout service had no business reading anything from the keycloak namespace. It had been emitting fragments of another namespace's client-secret for three days, quietly, while Grafana OnCall sat on a low-priority log volume alert that nobody clicked. The vector turned out to be CVE-2022-24348, the ArgoCD directory traversal bug, riding in on a ConfigMap key that an automation script had committed without anyone noticing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem signals:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A low-priority alert on log volume spikes that nobody investigated for days&lt;/li&gt;
&lt;li&gt;ConfigMap keys with URL values that contain '../' segments&lt;/li&gt;
&lt;li&gt;Application logs containing base64 strings that decode to credential-shaped prefixes&lt;/li&gt;
&lt;li&gt;ArgoCD Application source.repoURL values that point outside the expected repo root&lt;/li&gt;
&lt;li&gt;ConfigMap changes in the cluster that have no matching Git commit&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why a fanout service was emitting Keycloak credential fragments
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;The log line that should not have existed&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;An on-call engineer was triaging an unrelated paging storm and, out of habit, ran a Loki query against the noisiest service of the previous week. The fanout service had spiked from roughly 200 log lines per minute to 6,400 per minute three days earlier and had stayed there. The lines looked like garbage. They were not garbage.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="err"&gt;app=&lt;/span&gt;&lt;span class="s2"&gt;"fanout-service"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;|=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;line_format&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{{.message}}"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;json&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;__error__=&lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Sample&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;line&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;(sanitized):&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="err"&gt;level=info&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;msg=&lt;/span&gt;&lt;span class="s2"&gt;"resolved source repo"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;repo=&lt;/span&gt;&lt;span class="s2"&gt;"a2V5Y2xvYWstY2xpZW50LXNlY3JldDovL2NsaWVudC1pZD1ibGVhdGVyLWFwaQ=="&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;component=helm-renderer&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;The base64 in the repo field decodes to 'keycloak-client-secret://client-id=bleater-api'. The fanout service was logging the resolved value of a config key that should never have resolved to a Secret.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We pulled the live ConfigMap. The offending key was named ARGOCD_APP_SOURCE_REPO_URL and its value was 'gitea.internal/platform/fanout/../keycloak-secrets'. That single '../' segment is the entire CVE-2022-24348 exposure. The ArgoCD Helm renderer, in vulnerable versions, would normalize the path after resolving it, walk out of the intended repo root, and read whatever Helm values or Secret references it found in the sibling directory. In this case the sibling directory was a Helm chart that templated the Keycloak client-secret Secret into its values. The fanout service's own application code, which logged its resolved configuration on startup and on every reconcile, then dumped fragments of that Secret into Loki as base64.&lt;/p&gt;

&lt;p&gt;Three days. The fanout service itself was healthy the entire time. RabbitMQ consumers were running, distribution was working, the SLO board was green. The exposure was completely silent from a functional standpoint.&lt;/p&gt;

&lt;h2&gt;
  
  
  Log volume alerts without log content are noise generators
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Why the alert sat for three days&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The Grafana OnCall alert that fired three days earlier said, in effect, 'fanout-service log volume is 30x baseline'. It was tagged P3 and routed to a Slack channel that the team treats as a digest. The runbook attached to the alert said to check for retry loops. The on-call engineer at the time did check, saw no retries in the RabbitMQ metrics, and silenced the alert for 24 hours. The silence got renewed twice by the rotation handoff.&lt;/p&gt;

&lt;p&gt;This is the part of the story we keep seeing across client engagements. A log volume alert that does not inspect log content tells you something changed, not what changed. If the alert had matched on the byte pattern of base64 strings longer than 32 characters in a service that does not normally emit base64, the page would have been P1 and would have gone to a human within minutes. Volume alone is not a signal anyone can act on in under an hour, so it gets silenced.&lt;/p&gt;

&lt;p&gt;We have written more on this in our &lt;a href="https://infraforge.agency/argocd-gitops-recovery/" rel="noopener noreferrer"&gt;GitOps and ArgoCD recovery cluster&lt;/a&gt;, where the same pattern shows up under different vectors. The constant is that GitOps systems concentrate trust in the manifest pipeline, and any leak in that pipeline tends to surface first as 'weird logs' before it surfaces as anything else.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pod restart that the ConfigMap patch alone does not give you
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Patching the ConfigMap was not the fix&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The instinct, once we identified the bad key, was to kubectl edit configmap and delete the line. We did not do that, for two reasons. First, the ConfigMap was managed by ArgoCD; a live edit would last until the next sync. Second, even after the ConfigMap was clean, the existing pods would still have the malicious URL in their environment because envFrom only resolves at pod start. The leak would continue until the pods were rolled.&lt;/p&gt;

&lt;p&gt;The correct sequence had four steps and the order mattered.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1. Commit the fix to Git first&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;We removed the ARGOCD_APP_SOURCE_REPO_URL key from the ConfigMap manifest in the platform repo and opened a PR. ArgoCD was the source of truth, so any cluster-side edit would be reverted. The PR also added a comment explaining the CVE so the next person reading the repo would understand the deletion.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2. Sync ArgoCD with prune disabled&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;We forced the sync immediately rather than waiting for the next polling interval. We left prune disabled because we wanted to confirm exactly one diff: the deletion of the bad key. Surprise prunes during a security remediation are how secondary incidents start.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3. Roll the pods explicitly&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;kubectl rollout restart deployment/fanout-service. The ConfigMap was clean but the pod environments still held the resolved value. Until the pods restarted, every reconcile loop in the running process kept logging the leaked fragment. The rollout took 90 seconds.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4. Verify in Loki before declaring done&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;We ran the same Loki query that found the leak, scoped to the time window after the rollout completed. Zero matches. Then we ran it across a 30-minute window to be sure we were not just hitting log buffer lag. Still zero. That was the moment we stopped holding our breath.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Verify the ConfigMap is clean&lt;/span&gt;
kubectl get configmap fanout-service-config &lt;span class="nt"&gt;-n&lt;/span&gt; bleater &lt;span class="nt"&gt;-o&lt;/span&gt; json &lt;span class="se"&gt;\&lt;/span&gt;
  | jq &lt;span class="s1"&gt;'.data | keys[]'&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; repo_url
&lt;span class="c"&gt;# (should return nothing)&lt;/span&gt;

&lt;span class="c"&gt;# Verify pods restarted after the ConfigMap fix timestamp&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; bleater &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;fanout-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{range .items[*]}{.metadata.name}{"\t"}{.status.startTime}{"\n"}{end}'&lt;/span&gt;

&lt;span class="c"&gt;# Verify the ArgoCD Application source has no traversal&lt;/span&gt;
kubectl get application fanout-service &lt;span class="nt"&gt;-n&lt;/span&gt; argocd &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{.spec.source.repoURL}'&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-F&lt;/span&gt; &lt;span class="s1"&gt;'../'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'STILL VULNERABLE'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'clean'&lt;/span&gt;

&lt;span class="c"&gt;# Loki check, post-restart window only&lt;/span&gt;
logcli query &lt;span class="nt"&gt;--since&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;15m &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s1"&gt;'{app="fanout-service"} |~ "a2V5Y2xvYWs|keycloak"'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;The four checks we ran in order. Any non-empty result on any of them would have meant the remediation was not complete.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Treating the leak as breached until proven otherwise
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Auditing whether the Secret was actually read&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The harder question was not 'did we stop the leak' but 'did anyone read the leaked data while it was leaking'. The fragments were in Loki, which meant anyone with Loki read access to the bleater namespace logs could have seen them. We pulled the Loki audit log for the three-day window and listed every query that matched fanout-service logs. Twelve queries from four engineers, all of them looking at unrelated debugging work, none of them filtering on the byte pattern that would have exposed the credential.&lt;/p&gt;

&lt;p&gt;That was reassuring but not sufficient. The Keycloak client-secret had to be treated as compromised regardless, because we could not prove the absence of external log exfiltration with high confidence. We rotated the client-secret, redeployed the services that used it, and audited Keycloak's own access log for any token issuance using the old secret from an unexpected source IP in the exposure window. We found none. The rotation took about 25 minutes including service redeployment.&lt;/p&gt;

&lt;p&gt;We then went back to the original question that nobody had asked yet: how did the malicious ConfigMap key get there in the first place. The automation script that applied it was a 'config sync' job that pulled key-value pairs from a shared spreadsheet and wrote them into the ConfigMap. The spreadsheet was editable by a wider group than the cluster was. Somebody had added the URL three days earlier, probably as a copy-paste mistake from a different document, and the sync job had faithfully applied it. There was no Git commit, no PR, no review.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three controls that close this class of failure
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;What we changed so it cannot happen quietly again&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We made three changes after this incident, in priority order.&lt;/p&gt;

&lt;p&gt;The first was an admission webhook that rejects any ConfigMap apply where a string value contains '../' or matches the shape of a URL pointing outside an allowlist of internal domains. The rule is 12 lines of OPA policy. We tested it against six months of historical ConfigMap diffs and it would have caught this exact incident on day zero. It also catches the more common case of someone pasting a localhost URL into a shared config.&lt;/p&gt;

&lt;p&gt;The second was retiring the spreadsheet-driven sync job. Every ConfigMap that lands in the cluster now comes from a Git commit, has a commit SHA annotation, and fails admission if the annotation is missing or does not match a real commit in the repo. The work to migrate the existing key-value pairs took about a week. The job is gone and is not coming back.&lt;/p&gt;

&lt;p&gt;The third was rewriting the log volume alert. The new version fires when fanout-service log lines contain base64-encoded strings longer than 24 characters at a rate above 5 per minute, scoped to services that do not normally emit base64. It is a Loki recording rule with a regex match and it pages a human at P1. The first week it ran it caught two false positives (both were legitimate JWT logging that we then removed) and zero real incidents. We consider that a healthy signal-to-noise ratio for a security alert.&lt;/p&gt;

&lt;p&gt;We also upgraded ArgoCD past the CVE-2022-24348 fix line. That should have happened a year earlier. If you are running an ArgoCD version older than 2.3.4, 2.2.9, or 2.1.15, stop reading this and go check, because the same vector is sitting in your cluster waiting for an unlucky ConfigMap edit.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fkroki.io%2Fmermaid%2Fpng%2FeJxlkLFuwkAQRHu-YjoqE0FQSiJj4yaJYjnpHBfLebEvnG8t3wFCOP8enQsaqi3eG2l2DkYuqqXB4zudAXGZiD3o5oN6UN-ba4Uo2mB7i-tOO6fF4sL7VuT4NwO2AY5nMidGSw7zxeJpPqJYlgX_svLV3bGCRvtISddpH7mWQNaKJ6_FjihWD4GgWPHQFmSMXLjGwL2MKJ4fXDIGqmV1dOjJuRFJGYfu8AJlTs7zEORkeiUt46GRJMXASqzShgNLJ7Yrc6nB9oyzpnCzQbqAdxPObu_SQIn1bD068qr9sXty_LLGBqs1wpDuNSyTTcWu7EbkyzKnhiE2UqFpvqzugpURn2_ll2eqr3CePFf_bcOAxA" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fkroki.io%2Fmermaid%2Fpng%2FeJxlkLFuwkAQRHu-YjoqE0FQSiJj4yaJYjnpHBfLebEvnG8t3wFCOP8enQsaqi3eG2l2DkYuqqXB4zudAXGZiD3o5oN6UN-ba4Uo2mB7i-tOO6fF4sL7VuT4NwO2AY5nMidGSw7zxeJpPqJYlgX_svLV3bGCRvtISddpH7mWQNaKJ6_FjihWD4GgWPHQFmSMXLjGwL2MKJ4fXDIGqmV1dOjJuRFJGYfu8AJlTs7zEORkeiUt46GRJMXASqzShgNLJ7Yrc6nB9oyzpnCzQbqAdxPObu_SQIn1bD068qr9sXty_LLGBqs1wpDuNSyTTcWu7EbkyzKnhiE2UqFpvqzugpURn2_ll2eqr3CePFf_bcOAxA" alt="The control surface after the incident. The two new gates are the admission webhook on apply and the content-aware log alert at runtime." width="809" height="1135"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The control surface after the incident. The two new gates are the admission webhook on apply and the content-aware log alert at runtime.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  When a CVE has been silently active in your cluster for days
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;If you are staring at a similar exposure window&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The hard part of this kind of incident is not the patch. The patch is one line. The hard part is reconstructing the exposure window with enough confidence to know what to rotate, what to disclose, and what to audit. That work requires log retention you can query precisely, audit trails for the systems that read those logs, and the discipline to treat any leaked credential as compromised until the access logs say otherwise. Teams that have not rehearsed this work tend to do all three poorly the first time.&lt;/p&gt;

&lt;p&gt;We run GitOps and ArgoCD recovery engagements every month. We have seen the path traversal CVE three times in the last year, all on clusters running ArgoCD versions the operators thought were current, and we have seen the same 'silent ConfigMap injection via shared spreadsheet' antipattern more often than that. The remediation pattern is the same. The audit pattern is the same. The controls that close the gap are the same.&lt;/p&gt;

&lt;p&gt;If you suspect a similar exposure in your cluster right now, &lt;a href="https://infraforge.agency/review/" rel="noopener noreferrer"&gt;request an infrastructure review&lt;/a&gt; and we will start with a 30-minute diagnostic call this week to scope the audit window and the rotation list. If the exposure is active, we will be on a bridge with your team the same day.&lt;/p&gt;




&lt;p&gt;Originally published at &lt;a href="https://infraforge.agency/insights/argocd-cve-2022-24348-path-traversal-secret-leak-recovery/" rel="noopener noreferrer"&gt;https://infraforge.agency/insights/argocd-cve-2022-24348-path-traversal-secret-leak-recovery/&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — &lt;a href="https://infraforge.agency/review/" rel="noopener noreferrer"&gt;see /review&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>gitops</category>
      <category>recovery</category>
      <category>gitopsargocd</category>
    </item>
    <item>
      <title>Why Grafana OnCall acknowledgments hang after a Helm upgrade migration</title>
      <dc:creator>Muhammad Hassaan Javed</dc:creator>
      <pubDate>Mon, 01 Jun 2026 02:59:51 +0000</pubDate>
      <link>https://dev.to/infraforge/why-grafana-oncall-acknowledgments-hang-after-a-helm-upgrade-migration-3khh</link>
      <guid>https://dev.to/infraforge/why-grafana-oncall-acknowledgments-hang-after-a-helm-upgrade-migration-3khh</guid>
      <description>&lt;p&gt;The call did not come from our on-call rotation. It came from a customer who noticed two unrelated degradations on their side and asked why we had not paged. We had not paged because Grafana OnCall had been silently swallowing alerts for roughly 72 hours. Every new firing alert was being deduplicated into the same zombie incident, and every attempt to acknowledge or resolve that incident returned HTTP 500. The on-call engineer who first tried to clear it that morning had assumed the spinner was a UI bug and moved on. The thing meant to wake us up was the thing that was broken.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem signals:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OnCall UI Acknowledge and Resolve buttons spin and time out with a generic 500&lt;/li&gt;
&lt;li&gt;New alerts from real degradations get deduplicated into an incident that cannot be cleared&lt;/li&gt;
&lt;li&gt;OnCall pod logs show ORM errors referencing a column that does not exist in the table&lt;/li&gt;
&lt;li&gt;The Helm post-upgrade migration job reported success but Postgres logs show a lock_timeout on one ALTER TABLE&lt;/li&gt;
&lt;li&gt;There is no Prometheus alert on OnCall's own API error rate, so the regression went undetected&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  72 hours of swallowed alerts and one zombie incident absorbing all of them
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;The alerting platform was the incident&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When we got on the bridge, OnCall's incident list looked almost healthy. Two incidents in firing state, both from three days earlier, both with zero acknowledgment events. That should have been impossible. The on-call rotation had been live the whole time, and the runbook said any firing incident over 15 minutes old gets escalated. Nothing had been escalated because nothing new had appeared. Every alert fired by Prometheus Alertmanager in those 72 hours had been deduplicated by labels and folded into one of those two zombies.&lt;/p&gt;

&lt;p&gt;The first thing we tried was the obvious one. Click Acknowledge in the UI. The spinner ran for about 20 seconds and the page returned a 500. Same for Resolve. Same for Snooze. Same when we called the API directly with curl. The web pods were up, the database was reachable, Redis was fine. Nothing in any dashboard suggested a problem, because nobody had built a dashboard that watched OnCall itself.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ curl -s -X POST -H "Authorization: Bearer $TOKEN" \
    https://oncall.internal/api/v1/alert_groups/I8KZ.../acknowledge/
{"detail": "Internal server error"}

# from the oncall-engine pod
$ kubectl logs deploy/oncall-engine -c engine --tail=50 | grep -A2 ERROR
DatabaseError: column alerts_alertgroup.acknowledged_by_confirmation_phone does not exist
LINE 1: ...ledged_by_user_id", "alerts_alertgroup"."acknowledged_by_co...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;The ORM was reaching for a column the table did not have.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  A silent ALTER TABLE timeout the Helm hook never noticed
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Why the migration job exited 0 with a half-finished schema&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Our first guess was a bad release. The previous Helm upgrade had bumped OnCall by a minor version, and we assumed the new application code was looking at a field that genuinely had not shipped yet. That was wrong. The release notes said the column had been added in this version, and django_migrations on the OnCall database said the migration had been applied. Both things were true, and the column was still not there.&lt;/p&gt;

&lt;p&gt;The clue was in Postgres logs from three days earlier, exactly when the Helm post-upgrade hook ran the migration job. One line, easy to miss, in the middle of dozens of normal statement logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2024-XX-XX 02:14:07 UTC ERROR:  canceling statement due to lock timeout
2024-XX-XX 02:14:07 UTC STATEMENT:  ALTER TABLE alerts_alertgroup
    ADD COLUMN acknowledged_by_confirmation_phone varchar(20) NULL;
2024-XX-XX 02:14:07 UTC LOG:  duration: 30001.114 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;alerts_alertgroup is one of the highest-write tables in OnCall. At 02:14 a backlog of inserts was holding row locks, the ALTER hit the lock_timeout we had set globally to 30 seconds (a sensible default we put in years ago to stop one bad migration from wedging the whole database), and Postgres killed the statement. The migration script caught the exception, logged it to stderr, moved on to the next statement, and finished. The Helm hook checked the job's exit code, saw 0, and marked the release Succeeded. ArgoCD synced. The new pods rolled. And from that moment, every code path that touched the new column returned 500.&lt;/p&gt;

&lt;p&gt;The migration was also blocked from completing on a retry because the previous attempt had left a trigger in place on alerts_alertgroup, which we only found by checking pg_trigger directly. Without dropping that trigger first, re-running the migration would have hit the same lock window and failed the same way.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fkroki.io%2Fmermaid%2Fpng%2FeJw9kN1OAjEQhe95ivMAEEmMtyb8rIBBY4S7DSGlPS6VbrtpZ_lRfHezxXg1ycx3zpyZDxdOeq-iYD3tAaNyTlejbaqoDDcYDB4xLt9CksFfD7WtohIbPD7DbtMDxpmafI-W6-Id69F4WUA5RknbXKoY2uanB0w68OqCPmzF1gyt4H6YrpiWK1HCml6gldd0jqZznmbnolzpaJtuJnrPBJ41my5BHy5UCRKQxDDGTlNkzVP58h8z3dQ8W0kYdsxTZma3W2sVDwmRjioRq1Zr0tz2zzI3L195QhNMgmHjwoWmD54baoHnCTq4tvYdP8_8oiyOjBcofbiLTMEdiUhpo094GOYAiww-l6P8JhiatnFWK6GB9RLwFeqdJazX1tBL2vwC85qMdw" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fkroki.io%2Fmermaid%2Fpng%2FeJw9kN1OAjEQhe95ivMAEEmMtyb8rIBBY4S7DSGlPS6VbrtpZ_lRfHezxXg1ycx3zpyZDxdOeq-iYD3tAaNyTlejbaqoDDcYDB4xLt9CksFfD7WtohIbPD7DbtMDxpmafI-W6-Id69F4WUA5RknbXKoY2uanB0w68OqCPmzF1gyt4H6YrpiWK1HCml6gldd0jqZznmbnolzpaJtuJnrPBJ41my5BHy5UCRKQxDDGTlNkzVP58h8z3dQ8W0kYdsxTZma3W2sVDwmRjioRq1Zr0tz2zzI3L195QhNMgmHjwoWmD54baoHnCTq4tvYdP8_8oiyOjBcofbiLTMEdiUhpo094GOYAiww-l6P8JhiatnFWK6GB9RLwFeqdJazX1tBL2vwC85qMdw" alt="diagram" width="294" height="1374"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why we forward-fixed instead of rolling the Helm release back
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Drop the trigger, add the column, then unstick the zombies&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We considered rolling back to the previous OnCall version. It looked clean on paper: the old image did not need the missing column, so the schema would match again and acks would work. We talked ourselves out of it for two reasons. First, the new pods had been running for three days and had written data shaped for the new version, including new fields in adjacent tables. A rollback would have meant either accepting writes that the old code did not understand or restoring a 72-hour-old database snapshot, which would erase three days of incident history including the zombies we wanted to clean up. Second, the next upgrade would just hit the same lock_timeout the same way. We would be back here in a week.&lt;/p&gt;

&lt;p&gt;Forward-fix it was. The sequence had to be careful, because the table was still taking writes and we were going to ALTER it. We picked a low-write window, paused Celery workers that wrote to alerts_alertgroup (not the web tier, which we wanted up so the API stayed responsive), and ran the work inside one transaction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;-- 1. confirm the column is genuinely missing
SELECT column_name FROM information_schema.columns
WHERE table_name = 'alerts_alertgroup'
  AND column_name = 'acknowledged_by_confirmation_phone';
-- (0 rows)

-- 2. find the blocking trigger left over from the failed attempt
SELECT tgname FROM pg_trigger
WHERE tgrelid = 'alerts_alertgroup'::regclass
  AND tgname LIKE 'pgtrigger_%';

-- 3. drop it inside the same transaction we ALTER in
BEGIN;
SET LOCAL lock_timeout = '5min';
DROP TRIGGER IF EXISTS pgtrigger_oncall_protect_finished
  ON alerts_alertgroup;
ALTER TABLE alerts_alertgroup
  ADD COLUMN acknowledged_by_confirmation_phone varchar(20) NULL;
COMMIT;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Raise lock_timeout for this transaction only; do not touch the global.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We did not change the global lock_timeout. Setting it LOCAL inside the transaction lets this one ALTER wait up to five minutes, and any other migration that runs in normal conditions still gets the 30-second guard. Once the column existed, we unpaused the Celery workers and watched the engine pod logs. The 500s stopped within seconds.&lt;/p&gt;

&lt;p&gt;That left the zombies. Acknowledging them was not enough. An acknowledged incident still sits in the firing state from OnCall's deduplication perspective, so new alerts would still fold into it. We had to mark them resolved. We did it through the API first to make sure the lifecycle hooks fired and downstream integrations got the resolved webhook, and only fell back to a direct UPDATE for two records that the API still refused for an unrelated reason (their integration had been deleted, so the API could not look up the routing). For those, we set resolved=TRUE and resolved_at to the current timestamp in the database directly, with a note in the incident's raw payload explaining the manual close.&lt;/p&gt;

&lt;p&gt;We then fired a synthetic alert from Alertmanager and watched a new incident appear, ack it from the UI in under two seconds, resolve it, and confirm a follow-up alert created a fresh incident instead of folding into the resolved one. That was the real all-clear.&lt;/p&gt;

&lt;h2&gt;
  
  
  Meta-monitoring for the platform that does the monitoring
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;What we wired up so the next silent migration trips an alarm&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The thing that kept us up afterward was not the migration. Migrations fail. Database locks happen. The thing that kept us up was that OnCall had been broken for three days and not one signal in our monitoring stack had told us. We had alerts on Prometheus being down, on Alertmanager being down, on Grafana being down, on every customer-facing service. We had nothing watching the incident management platform itself.&lt;/p&gt;

&lt;p&gt;We added two rules the same week. The first is a straight error-rate alert on OnCall's API. If more than 1% of requests to /api/v1/ return 5xx for five minutes, page the platform team at critical severity. Five minutes is short enough that a real outage gets caught but long enough that a single bad deploy rolling does not page. We picked critical because if OnCall is degraded, nothing else paging matters; alerts get swallowed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;groups:
- name: oncall-meta
  rules:
  - alert: OncallApiErrorRateHigh
    expr: |
      sum(rate(django_http_responses_total_by_status_total{job="oncall",status=~"5.."}[5m]))
      /
      sum(rate(django_http_responses_total_by_status_total{job="oncall"}[5m]))
      &amp;gt; 0.01
    for: 5m
    labels:
      severity: critical
      service: oncall
    annotations:
      summary: "OnCall API returning &amp;gt;1% 5xx for 5m"
      runbook: "https://internal/runbooks/oncall-api-errors"

  - alert: OncallMigrationJobStderr
    expr: |
      sum(increase(kube_job_status_failed{namespace="oncall"}[10m])) &amp;gt; 0
      or
      sum(increase(log_messages_total{namespace="oncall",app="migration",level="ERROR"}[10m])) &amp;gt; 0
    for: 1m
    labels:
      severity: critical
      service: oncall
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;The second rule catches a migration job that logs errors even if its exit code is 0.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The second rule is the lesson from this specific incident. Helm trusts the exit code. Django migrations swallow individual statement errors and continue. The only place the truth lives is in the job's log stream. We now alert on ERROR-level log lines from any pod with the migration label in the oncall namespace, regardless of whether the job reported success. We have caught two real issues with this rule in the months since (neither as bad as this one, both worth knowing about within minutes instead of days).&lt;/p&gt;

&lt;p&gt;The broader pattern, and one we now apply on every recovery engagement we run, is that any tool you depend on to notice problems needs an independent way to notice when that tool itself is the problem. We have written more about this category of failure in &lt;a href="https://infraforge.agency/migrations/" rel="noopener noreferrer"&gt;our migration recovery work&lt;/a&gt;, because the same shape appears in database cutovers, queue platform upgrades, and identity provider migrations: the system you rely on to tell you the truth is the system that has stopped telling the truth, and you only find out from a customer.&lt;/p&gt;

&lt;h2&gt;
  
  
  When acks are silently 500ing and you cannot tell what data is real
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;If your OnCall is doing this right now&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The hard part of this incident is not the SQL. The hard part is making the call between forward-fix and rollback when your incident history, your zombie state, and your live alert routing are all entangled in a database that is currently being written to by application code that expects a schema it does not have. Roll back without a plan and you lose three days of incident records. Forward-fix without checking for leftover triggers and migration locks and your second attempt fails the same way as the first. Run an ALTER on a hot table during business hours and you find out what your application's actual timeout tolerance is.&lt;/p&gt;

&lt;p&gt;We do these engagements every few weeks. Partial Django migrations on Grafana OnCall is the specific case we have now seen three times this year, twice from lock_timeout and once from a custom trigger that blocked the ALTER outright. Adjacent variants we have handled: Sentry post-deploy migrations that left a column nullable when the code expected NOT NULL, Mattermost upgrades where one index creation timed out, Keycloak realm migrations that completed on the primary but failed on a replica. The pattern is identical and the recovery sequence rhymes.&lt;/p&gt;

&lt;p&gt;If your team is staring at a 500 on every ack and trying to decide whether to roll back the Helm release, &lt;a href="https://infraforge.agency/review/" rel="noopener noreferrer"&gt;book an infrastructure review with our team&lt;/a&gt; and we will be on a bridge with you the same day. We will help you confirm the schema delta, plan the forward-fix or the rollback with the data implications spelled out, and clean up the zombie incidents without losing the history you need for the postmortem.&lt;/p&gt;




&lt;p&gt;Originally published at &lt;a href="https://infraforge.agency/insights/grafana-oncall-stuck-incidents-partial-migration/" rel="noopener noreferrer"&gt;https://infraforge.agency/insights/grafana-oncall-stuck-incidents-partial-migration/&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — &lt;a href="https://infraforge.agency/review/" rel="noopener noreferrer"&gt;see /review&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>migration</category>
      <category>recovery</category>
      <category>migrations</category>
    </item>
    <item>
      <title>Why a deleted backup Lambda kept billing 9,400 EBS snapshots</title>
      <dc:creator>Muhammad Hassaan Javed</dc:creator>
      <pubDate>Sat, 30 May 2026 22:26:35 +0000</pubDate>
      <link>https://dev.to/infraforge/why-a-deleted-backup-lambda-kept-billing-9400-ebs-snapshots-1hg4</link>
      <guid>https://dev.to/infraforge/why-a-deleted-backup-lambda-kept-billing-9400-ebs-snapshots-1hg4</guid>
      <description>&lt;p&gt;The EBS Snapshot line on the monthly bill was $1,830. There was no active EBS snapshot policy on the account. The backup Lambda that had produced these snapshots had been deleted thirteen months earlier, replaced by AWS Backup, and forgotten. Nobody had deleted what it created. Two volumes' worth of daily snapshots times 400 days came to 9,408 orphans sitting on 14 TB of storage, billed at the EBS Snapshot rate every month since.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem signals:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EBS Snapshot line is several hundred dollars a month and no active EBS snapshot pipeline is running on the account&lt;/li&gt;
&lt;li&gt;describe-snapshots --owner-ids self returns thousands of entries when you expect dozens&lt;/li&gt;
&lt;li&gt;Sampling a few snapshot IDs shows SourceVolumeId values that no longer resolve in describe-volumes&lt;/li&gt;
&lt;li&gt;A backup Lambda or custom snapshot script was deprecated in the last 12 to 24 months&lt;/li&gt;
&lt;li&gt;AWS Backup is the active tool and its dashboard shows normal counts, but the cost line tells a different story&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  $1,830 a month on a backup product the account no longer used
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;The line item that should have been zero&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The EBS Snapshot line had been climbing slowly for thirteen months. Nobody had flagged it. The quarterly cost review surfaced it because the line item ranked sixth on the account, and the team's mental model said it should have ranked nowhere. There was no EBS snapshot policy running. AWS Backup had taken over RDS and EBS backups a year earlier, with the old Lambda plus EventBridge pipeline retired the same week.&lt;/p&gt;

&lt;p&gt;The first instinct in the room was to pull AWS Backup's plan and see if a retention window had widened. The plan was clean. Snapshot counts there were in the low dozens, exactly what the new policy specified. So the snapshots driving the bill were coming from somewhere else.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ aws ec2 describe-snapshots --owner-ids self \
    --query 'length(Snapshots)' --output text
9408
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;The number that turned a routine cost review into an incident.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That number was the moment the room got quiet. AWS Backup writes maybe forty snapshots a month on this account. Nine thousand was a different category of problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS Backup was clean, so who made these 9,408 snapshots
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Ruling out the obvious suspect&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;With AWS Backup ruled out and no other named pipeline running, the question became: who created these 9,408 snapshots, and is anything still creating more. We pulled the StartTime field on the most recent hundred. The newest one was thirteen months old. Whatever pipeline made them had stopped, which meant we were looking at a stable population, not a leak that was still growing. That mattered because it meant the cleanup had a known size.&lt;/p&gt;

&lt;p&gt;The next question was whether the source volumes were still around. We sampled twenty random snapshots and ran describe-volumes against their SourceVolumeId. All twenty came back InvalidVolume.NotFound. The pattern was clear: the snapshots were referencing two specific volume IDs (the daily Lambda backed up two production EBS volumes), both of which had been deleted along with the EC2 instances they served when the application moved to a managed service.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws ec2 describe-snapshots --owner-ids self \
    --query 'Snapshots[*].[SnapshotId,VolumeId,StartTime]' \
    --output text &amp;gt; all-snapshots.tsv

awk -F'\t' '{print $2}' all-snapshots.tsv | sort -u \
  | while read vid; do
      if ! aws ec2 describe-volumes --volume-ids "$vid" \
          &amp;gt;/dev/null 2&amp;gt;&amp;amp;1; then
        echo "$vid orphan"
      fi
    done &amp;gt; orphan-source-volumes.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Group snapshots by their source volume, then check which source volumes still exist.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Only two volume IDs appeared in the orphan list. Two volumes, 400 days of daily snapshots each, give or take retries, gave 9,408. The arithmetic lined up. The Lambda that snapshot them was gone, but AWS does not garbage-collect snapshots when their creator disappears. Snapshots are first-class objects with their own lifecycle, and that lifecycle is whatever you set when you create them. The Lambda set nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why we sampled twenty before touching the other 9,388
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;What we did before running delete-snapshot in a loop&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The temptation at this point is to write a one-line loop and delete everything. delete-snapshot is irreversible. The cost was real, $1,830 a month for storage of data that referenced infrastructure that no longer existed. Two reasons we did not run the loop immediately.&lt;/p&gt;

&lt;p&gt;First, orphan is sometimes a transient state. A volume gets deleted on Tuesday during a planned migration. On Wednesday the orphan-finder runs. A snapshot taken two hours before the volume's deletion looks orphaned but is actually the most recent backup of a service that was just migrated. Deleting it would destroy the only remaining copy of that data. We checked the StartTime on every snapshot in our sample against the deletion date of its source volume. Every one was older than the deletion by at least nine months. The cohort was uniformly historical. No active workflow could be depending on any of them.&lt;/p&gt;

&lt;p&gt;Second, we needed to be sure these snapshots were not being referenced as the base for any AMI or any live AWS Backup recovery point. We ran describe-images with a block-device-mapping.snapshot-id filter on the sample, expecting nothing, and got nothing. We checked the AWS Backup recovery point inventory. None of the orphan snapshot IDs appeared there. The deletion was safe.&lt;/p&gt;

&lt;p&gt;The actual delete loop took three calendar days. delete-snapshot is rate-limited at roughly 5 requests per second per account with bursts. At 9,400 deletes with retries on the occasional 503, the math runs to about 30 wall-clock minutes of perfect throughput. We never get perfect throughput. We wrote the loop with a 250ms sleep, a checkpoint file, and an append-only deleted.log so we could resume after any interruption without re-trying ones that already succeeded.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;while read sid; do
  if grep -qx "$sid" deleted.log; then continue; fi
  aws ec2 delete-snapshot --snapshot-id "$sid" \
    &amp;amp;&amp;amp; echo "$sid" &amp;gt;&amp;gt; deleted.log \
    || echo "$sid" &amp;gt;&amp;gt; failed.log
  sleep 0.25
done &amp;lt; orphan-snapshot-ids.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Resumable, rate-limited delete loop. The checkpoint file is the load-bearing part.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;After three days the EBS Snapshot line on the next monthly forecast dropped to under $20. The fourteen terabytes of orphan storage was gone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tag at creation, schedule the cleanup, watch the lines that should be zero
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;The rule that meant the next deprecated pipeline could not do this&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The deletion fixed the symptom. The interesting part of this engagement was the cause. AWS does not couple a snapshot's lifecycle to the lifecycle of whatever process created it. A Lambda gets deleted, an EventBridge rule gets removed, the IAM role goes with them, and the snapshots they made keep existing and keep being billed, forever, until something explicitly deletes them. There is no warning email. There is no dashboard widget. The only signal is the monthly bill, and the bill takes a year to be loud enough to investigate.&lt;/p&gt;

&lt;p&gt;Two changes went in after the cleanup. The first was a tag-at-creation rule. Every snapshot the account creates now carries three tags applied at creation time: Owner (a team or service name), Retention (an ISO date past which the snapshot is safe to delete), and CreatedBy (the pipeline that made it). AWS Backup applies these automatically through its backup plan. The handful of custom Lambdas that survived the migration were rewritten to apply them. A weekly cleanup Lambda walks the account, deletes anything past its Retention date, and flags anything older than 90 days with no Retention tag. For the first 60 days the Lambda posted a Slack message and waited for a thumbs-up before deleting. After that it ran automatic.&lt;/p&gt;

&lt;p&gt;The second change was to the quarterly cost review process. It now starts with the line items that should be zero or near zero, not the ones that are already big. The big lines get watched constantly by capacity planners. The lines that should be zero are where deleted infrastructure leaves footprints, and they are the ones least likely to be on anybody's dashboard. EBS Snapshot on a no-EBS-snapshot account. Lambda invocations on a service that was migrated to ECS six months ago. NAT Gateway hours on a workload that should not need cross-AZ egress. These are the lines where deprecated pipelines keep paying rent.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fkroki.io%2Fmermaid%2Fpng%2FeJw9jkFqwzAQRfc5xb-AodtuWhK7SVah1IFShBeDNJGFVU2QpgRT9e7FTun6vf_4lyg3O1JWnLsNsDV9omsZRWEzk7Ib0DRP2H0fqeCNlZMGSVDyzz8bYLfQ-sGlojXvzFOcYSNT4gw7sp0KHCkP_-pJKjqzj-Q9O9BFOePxAY7mskjtKr1S0XVX8WK2XyqN48j3TPuXUcysFXvTT-G6gG49ejB9JDtBBXJLIXko0-fCDys_mjN5SMa9CMc2lCBp-AWs91UQ" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fkroki.io%2Fmermaid%2Fpng%2FeJw9jkFqwzAQRfc5xb-AodtuWhK7SVah1IFShBeDNJGFVU2QpgRT9e7FTun6vf_4lyg3O1JWnLsNsDV9omsZRWEzk7Ib0DRP2H0fqeCNlZMGSVDyzz8bYLfQ-sGlojXvzFOcYSNT4gw7sp0KHCkP_-pJKjqzj-Q9O9BFOePxAY7mskjtKr1S0XVX8WK2XyqN48j3TPuXUcysFXvTT-G6gG49ejB9JDtBBXJLIXko0-fCDys_mjN5SMa9CMc2lCBp-AWs91UQ" alt="The lifecycle every snapshot now goes through. Untagged snapshots cannot live past 90 days without an explicit decision." width="611" height="670"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The lifecycle every snapshot now goes through. Untagged snapshots cannot live past 90 days without an explicit decision.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost archaeology on accounts where a deprecated pipeline is still paying rent
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;When the bill is the only thing telling you what you forgot&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The shape of this incident is common. A pipeline gets shipped, the engineer who wrote it leaves, the policy gets replaced but the outputs survive, and the bill slowly bends upward. EBS snapshots are the most common shape we see. Detached EIPs are close behind. Idle NAT gateways and orphaned ElastiCache clusters round out the top four. None of these line items alarm on a CloudWatch dashboard because nothing is actively misbehaving. The deprecated pipeline is the misbehavior, and the pipeline no longer exists.&lt;/p&gt;

&lt;p&gt;We run these cost-archaeology engagements regularly. In the last quarter we walked through three accounts where a single deprecated backup pipeline accounted for more than half of the account's EBS Snapshot line. We have an inventory script that finds orphan snapshots, detached volumes, unused EIPs, and idle NAT gateways across an account in about 20 minutes, plus a sample-then-delete workflow we walk the team through live so nothing irreversible happens on autopilot. The deletion is always the easy half. The work is figuring out which orphans are safe and writing the tag-at-creation policy that stops the next one.&lt;/p&gt;

&lt;p&gt;If your bill has a line that does not match anything that should be running, the orphan audit is usually the fastest way to find out where it is going. &lt;a href="https://infraforge.agency/review/" rel="noopener noreferrer"&gt;Request an infrastructure review&lt;/a&gt; and we will run the audit with your team on a 30-minute diagnostic call this week. You can also see the broader pattern in our &lt;a href="https://infraforge.agency/services/" rel="noopener noreferrer"&gt;services overview&lt;/a&gt; for cloud cost spike work.&lt;/p&gt;




&lt;p&gt;Originally published at &lt;a href="https://infraforge.agency/insights/orphan-ebs-snapshots-deleted-backup-pipeline-cost-spike/" rel="noopener noreferrer"&gt;https://infraforge.agency/insights/orphan-ebs-snapshots-deleted-backup-pipeline-cost-spike/&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — &lt;a href="https://infraforge.agency/review/" rel="noopener noreferrer"&gt;see /review&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>cloud</category>
      <category>cost</category>
      <category>triage</category>
      <category>costspikes</category>
    </item>
    <item>
      <title>Why one shared Terraform module made every PR a 14-service change</title>
      <dc:creator>Muhammad Hassaan Javed</dc:creator>
      <pubDate>Tue, 26 May 2026 19:30:02 +0000</pubDate>
      <link>https://dev.to/infraforge/why-one-shared-terraform-module-made-every-pr-a-14-service-change-hn1</link>
      <guid>https://dev.to/infraforge/why-one-shared-terraform-module-made-every-pr-a-14-service-change-hn1</guid>
      <description>&lt;p&gt;The PR that shipped the bug had three approvals and a comment that read "LGTM, plans look normal." The plans were not normal. They were 14 separate terraform plan outputs stacked in the CI log, each touching 80 to 120 resources, totaling around 1,400 resource changes for what the author described as a typo fix in a shared module. Buried somewhere in plan number nine was a change to an IAM policy attachment that broke three services on apply. Nobody had read past plan three. The team had spent six months congratulating themselves on collapsing 8,000 lines of Terraform into 1,200, and the bill for that consolidation had just arrived.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem signals:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every PR touching a shared module shows N service plans in CI, each with 50+ resource changes&lt;/li&gt;
&lt;li&gt;Reviewers approve with 'plans look normal' without scrolling through them&lt;/li&gt;
&lt;li&gt;A single shared module has accumulated 25 to 40 input variables to handle per-service edge cases&lt;/li&gt;
&lt;li&gt;CI plan time grows superlinearly because each consumer plan loads its own remote state&lt;/li&gt;
&lt;li&gt;A bug in the module breaks multiple unrelated services in the same apply window&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How a 1,400-resource plan output stopped being read
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;The PR that broke three services had a clean LGTM&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The original consolidation was, on paper, exactly the refactor every platform team is told to do. Fourteen service-specific Terraform configs, each maintained by a different feature team, each with its own subtle drift from the others. The platform team pulled the common shape out into one service-stack module, parameterized the differences, and pointed all 14 services at it. Eight thousand lines of HCL became twelve hundred. A change to add a shared observability sidecar landed across all 14 services in a single PR. Everyone celebrated.&lt;/p&gt;

&lt;p&gt;The failure mode took six months to surface because the early signals looked like wins. Module changes shipped faster than the per-service changes they replaced. The platform team felt productive. What nobody tracked was that the CI plan output for every module PR had grown from one service's plan to fourteen, and the reviewers had silently adapted by reading the first plan, skimming the second, and rubber-stamping the rest.&lt;/p&gt;

&lt;p&gt;Then a module-level change to how IAM policies were attached introduced a subtle bug: for services that overrode the default policy document, the new code path replaced rather than merged. Three of the 14 services overrode that default. The plan output showed the destruction and recreation of those policy attachments quite clearly, on lines somewhere around 870 of the GitHub diff view. The PR had three approvals.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Title: fix: typo in service-stack variable description

Diff: 1 line changed (a comment)

CI: terraform-plan-all ✓
  - service-a/plan: 84 changes
  - service-b/plan: 91 changes
  - service-c/plan: 102 changes
  - service-d/plan: 88 changes
  ... (10 more)

Total: 1,388 resource changes

Reviews:
  @platform-lead   approved 'LGTM, plans look normal'
  @service-b-eng   approved 'looks fine'
  @service-g-eng   approved
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;What the PR description looked like, paraphrased from the post-mortem&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix that was not reviewer discipline
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;What we thought it was, what it actually was&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The first instinct, and the one the team had spent two weeks pursuing before we got involved, was that this was a code review hygiene problem. They had written a PR template that required reviewers to acknowledge they had read each plan. They had a Slack bot that posted a daily "unreviewed plan changes" count. The platform lead had given a brown bag talk titled "Read Your Plans." None of it stuck, because none of it could stick. Asking a human to read 1,400 lines of plan output for a one-character comment fix is asking them to do something nobody should do, and they will not do it for long even if you make them feel guilty about it.&lt;/p&gt;

&lt;p&gt;The actual problem was structural. The module had become a dependency surface that 14 consumers were forced to redeploy together, on every change, whether the change affected them or not. That is not a code review problem. That is the same coupling problem distributed systems people argue about with monoliths and microservices, except it had snuck in through the back door of a Terraform refactor. The cost of coupling does not show up the day you consolidate. It shows up the first time a small change has to ship and the blast radius is the entire fleet.&lt;/p&gt;

&lt;p&gt;We have written more on the broader pattern in &lt;a href="https://infraforge.agency/terraform-iac-debt/" rel="noopener noreferrer"&gt;the Terraform and IaC debt pillar&lt;/a&gt;, but the specific recovery for this shape of problem has three layers, and they have to land in order.&lt;/p&gt;

&lt;h2&gt;
  
  
  How we cut the blast radius from 14 to 1 in an afternoon
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Pinning the module per service was the bleeding stopper&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The immediate move was to stop every consumer from being forced to re-plan on every module change. The mechanism is dumb and effective: pin each service's module reference to an explicit git ref instead of letting them all track main.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Before: every consumer floats on main
module "service" {
  source = "git::https://github.com/org/modules.git//service-stack"
  name   = "auth-api"
  # ...
}

# After: every consumer is pinned to an explicit version
module "service" {
  source = "git::https://github.com/org/modules.git//service-stack?ref=v1.4.2"
  name   = "auth-api"
  # ...
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Before and after: the module block in each service's Terraform config&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;After the pin, a module change ships as a tagged release in the modules repo, then ships to consumers one at a time via a per-service PR that bumps the ref. Each of those PRs shows exactly one service's plan, and that plan is short enough to read. The reviewers can do their job again. The author has to think about which services they actually want this change in, in what order, and on what schedule.&lt;/p&gt;

&lt;p&gt;There is a real cost to this, and we want to name it honestly: you have given up some of the consolidation win. You can no longer ship an observability change to all 14 services in one PR. You can ship it in one tagged module release plus 14 small bump PRs, which is more clicks. We have not had a client regret the trade once they lived with it for a month. The clicks are cheap; the missed bug in plan number nine is not.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fkroki.io%2Fmermaid%2Fpng%2FeJx9j7EOgjAURXe_4u4Gk6KTgwm6SkLqSBgKPISktM2j6O8bSzUMxvWde5LzOm2fTa_Y4yo3wDTXd1aux5k6y7QBgFyUuW1nTRjVYFCzMk1fIUlOuImsnIgfQ0OJgtPKVFGJ-PzF9S98KcUh3Cd465XGYGANoZDvIZl2nZR1nnjx00_RQ-wOuxRe3Zcg-QFMzqKQ2AYWLBkWhVgV1_PoUMgjxKou7tJV-r_dvgyUrdYTHDGiFj94AYEWaSo" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fkroki.io%2Fmermaid%2Fpng%2FeJx9j7EOgjAURXe_4u4Gk6KTgwm6SkLqSBgKPISktM2j6O8bSzUMxvWde5LzOm2fTa_Y4yo3wDTXd1aux5k6y7QBgFyUuW1nTRjVYFCzMk1fIUlOuImsnIgfQ0OJgtPKVFGJ-PzF9S98KcUh3Cd465XGYGANoZDvIZl2nZR1nnjx00_RQ-wOuxRe3Zcg-QFMzqKQ2AYWLBkWhVgV1_PoUMgjxKou7tJV-r_dvgyUrdYTHDGiFj94AYEWaSo" alt="The change-propagation shape before and after pinning" width="902" height="711"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The change-propagation shape before and after pinning&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the 30-input module became 5 inputs plus an advanced object
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Tiering inputs and splitting by change velocity&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Pinning bought time. It did not fix the underlying reason the module had become hard to change. We sat with the platform team and looked at all 30 inputs the module had grown. Most services used 5 of them. The other 25 existed because, over six months, individual services had asked for an escape hatch ("can the module take a custom IAM policy document?", "can we override the security group rules?", "can we set a node selector?") and the module owners had said yes, every time, because saying no felt like blocking a teammate. The module had become an everything-bagel.&lt;/p&gt;

&lt;p&gt;We refactored the input surface into two tiers. The common five became first-class top-level inputs. The other 25 went into an optional advanced object with optional() fields, so a normal consumer never sees them and an exotic consumer has to opt in deliberately.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Common path: every service uses these
variable "name"        { type = string }
variable "image"       { type = string }
variable "replicas"    { type = number }
variable "environment" { type = string }
variable "port"        { type = number }

# Escape hatch: explicit, optional, and visible in code review
variable "advanced" {
  type = object({
    custom_iam_policy_json = optional(string)
    extra_sg_rules         = optional(list(object({ ... })))
    node_selector          = optional(map(string))
    # ... 22 more rarely-used knobs
  })
  default = {}
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;variables.tf after the tiering refactor&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Then we did the harder work: splitting the module along change-velocity boundaries. The monitoring submodule was changing roughly weekly, the database submodule once a quarter, and the networking submodule about twice a year. Bundling them together meant every monitoring tweak forced a re-plan of database and networking resources for all 14 consumers. We pulled them apart into separate modules with separate version pins, so a consumer can bump monitoring from v2.1 to v2.2 without touching database at all.&lt;/p&gt;

&lt;p&gt;This is the same argument microservice advocates make about service boundaries, and the rule of thumb is the same: couple things that change together, decouple things that change at different rates. The cost of getting it wrong in Terraform is not latency or distributed-transaction pain. It is plan output nobody reads, and bugs that ship because of it.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pin per consumer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Each service references the module at an explicit git tag. Module changes ship as releases, then propagate per service.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tier the inputs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Five common inputs stay first-class. The long tail moves into an optional advanced object so escape hatches are explicit.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Split by change velocity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Submodules that change at different rates become separate modules with separate version pins. A weekly change does not drag a quarterly one along.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gate multi-service plans&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;An OPA or CI check fails any PR whose plan touches more than three workspaces unless the description includes allow-multi-service: yes.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The OPA gate is worth its own sentence. It is twenty lines of Rego that counts distinct workspaces touched by the plan and fails the PR over a threshold unless the author explicitly opts in. It does not prevent fleet-wide changes; it forces the author to acknowledge they are making one. That single check has caught two accidental fleet-wide PRs at the clients who have adopted it, both of which would have shipped under the old regime.&lt;/p&gt;

&lt;h2&gt;
  
  
  When the consolidation win has become a coupling tax
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;If your shared modules feel like this right now&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The hard part of this kind of recovery is not the technical work. Pinning a module ref is ten minutes of typing per service. Splitting a module along change-velocity lines is a weekend. The hard part is convincing the team that the consolidation they are proud of has become a liability, and doing the unwind in an order that does not cause an outage. We have done this engagement at four SaaS platforms in the last year, and the pattern of "reviewer fatigue followed by a buried bug" shows up in three of the four. The fourth caught it before the bug shipped, only because their CI plan output had grown past GitHub's diff size limit and forced the conversation.&lt;/p&gt;

&lt;p&gt;We run these recovery engagements every week. If your platform team is shipping module changes that produce thousand-line plan outputs and your reviewers have started writing "plans look normal" without reading them, the next bug is already on its way. &lt;a href="https://infraforge.agency/review/" rel="noopener noreferrer"&gt;Book an infrastructure review with our team&lt;/a&gt; and we will spend a 30-minute diagnostic call this week mapping your module consumer graph and naming the first three pins to put in place.&lt;/p&gt;




&lt;p&gt;Originally published at &lt;a href="https://infraforge.agency/insights/terraform-shared-module-coupling-fleet-wide-plans/" rel="noopener noreferrer"&gt;https://infraforge.agency/insights/terraform-shared-module-coupling-fleet-wide-plans/&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — &lt;a href="https://infraforge.agency/review/" rel="noopener noreferrer"&gt;see /review&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>iac</category>
      <category>debt</category>
      <category>terraformiacdebt</category>
    </item>
    <item>
      <title>When ArgoCD shows Healthy but Keycloak silently strips JWT claims</title>
      <dc:creator>Muhammad Hassaan Javed</dc:creator>
      <pubDate>Fri, 22 May 2026 22:19:02 +0000</pubDate>
      <link>https://dev.to/infraforge/when-argocd-shows-healthy-but-keycloak-silently-strips-jwt-claims-5f24</link>
      <guid>https://dev.to/infraforge/when-argocd-shows-healthy-but-keycloak-silently-strips-jwt-claims-5f24</guid>
      <description>&lt;p&gt;ArgoCD reported Synced and Healthy. The Keycloak Helm release was green. And the downstream timeline service was returning 401 on every authenticated request. That was the call we got: every dashboard says the platform is fine, and authentication is broken across three services. The JWTs auth-service was issuing had stopped carrying the groups claim and the email_verified claim about 40 minutes earlier, right after an ArgoCD auto-sync rolled out a Keycloak chart bump. Six OIDC clients had silently lost protocol mappers and role mappings during that sync, and we did not yet know it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem signals:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ArgoCD shows Synced and Healthy on the Keycloak application, but downstream services return 401 on tokens they accepted an hour ago&lt;/li&gt;
&lt;li&gt;JWTs decoded at jwt.io are missing claims that production code depends on (groups, email_verified, audience)&lt;/li&gt;
&lt;li&gt;Engineers have been making emergency fixes directly in the Keycloak admin console during recent incidents and not committing them back&lt;/li&gt;
&lt;li&gt;The realm import ConfigMap in git has not been touched in weeks, yet the live realm has clearly changed&lt;/li&gt;
&lt;li&gt;Helm values for the Keycloak chart set realm import strategy to OVERWRITE or leave it unset (which defaults to OVERWRITE on most charts)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The sync that looked clean and quietly stripped six clients
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;ArgoCD said Healthy. Auth said 401.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Our first guess was wrong. The team had been staring at auth-service for 25 minutes when we joined the bridge, because the tokens it was issuing were obviously malformed. The groups claim was gone. The email_verified claim was gone on a different client. Surely auth-service had shipped a bad release. Except auth-service had not shipped in nine days, and the failure had started 40 minutes ago, not nine days ago.&lt;/p&gt;

&lt;p&gt;The shape of the failure is what gave it away. Three OIDC clients had each lost a different mapper at the same moment. Auth-service had lost a groups protocol mapper. The profile service had lost an email_verified client scope mapping. The api gateway had lost role mappings for a downstream audience. Three services do not lose three unrelated pieces of OIDC config simultaneously unless something upstream rewrote all of them at once. The only thing that had touched Keycloak in that window was an ArgoCD auto-sync of the Keycloak Helm release.&lt;/p&gt;

&lt;p&gt;We pulled the ArgoCD sync history and found the sync 41 minutes earlier. It was a chart version bump, nothing that should have changed realm content. But the chart ships a realm import ConfigMap, and the realm JSON inside that ConfigMap had not been updated in weeks. Meanwhile the live realm in the Keycloak PostgreSQL database had been edited through the admin console at least a dozen times during recent incidents. None of those console changes had been committed back to git.&lt;/p&gt;

&lt;p&gt;So the chart redeployed the ConfigMap. The Keycloak init container read it. And the realm import ran with the strategy set to OVERWRITE. Every console change made during the previous two weeks of incident response got reverted to the stale git version, silently, with no error and no event surfaced to ArgoCD.&lt;/p&gt;

&lt;h2&gt;
  
  
  Diffing live realm state against the ConfigMap before doing anything destructive
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Six clients had drifted and the next sync would make it worse&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The first thing we did was not a fix. The first thing we did was freeze. Auto-sync was still enabled on the Keycloak ArgoCD application. If anyone touched a Helm value for any reason in the next hour, another sync would fire and a second OVERWRITE pass would run against whatever state we had managed to reconstruct. We paused auto-sync first and removed the self-heal annotation, then started the diagnosis.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# 1. Freeze the ArgoCD app so the next sync cannot fire mid-recovery
argocd app set keycloak --sync-policy none
argocd app set keycloak --self-heal=false

# 2. Pull live realm state from the Keycloak Admin REST API
TOKEN=$(curl -s -X POST "$KC/realms/master/protocol/openid-connect/token" \
  -d "grant_type=password" -d "client_id=admin-cli" \
  -d "username=$ADMIN_USER" -d "password=$ADMIN_PASS" | jq -r .access_token)

curl -s -H "Authorization: Bearer $TOKEN" \
  "$KC/admin/realms/primary/clients" | jq . &amp;gt; live-clients.json

curl -s -H "Authorization: Bearer $TOKEN" \
  "$KC/admin/realms/primary/client-scopes" | jq . &amp;gt; live-scopes.json

# 3. Extract the realm JSON ArgoCD just pushed
kubectl -n keycloak get cm keycloak-realm-import -o jsonpath='{.data.realm\.json}' \
  | jq . &amp;gt; configmap-realm.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Snapshot live state before any reconciliation. The live API is now the source of truth, not the ConfigMap.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Diffing live-clients.json against the clients block in configmap-realm.json showed six clients with material differences. Two were missing protocol mappers entirely. Three had client scopes that had been removed. One had role mappings that were present in the ConfigMap but missing in production, which told us that client had also been changed in the console at some point and the change had been overwritten on a previous sync we had not even noticed. That last finding was the one that mattered most: this was not the first time the OVERWRITE strategy had quietly destroyed live config. It was just the first time the destruction had cascaded far enough to break downstream services.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fkroki.io%2Fmermaid%2Fpng%2FeJxNkNFOwlAMhu95iv8B5BU0wAZOEXQQMVmIOdm60Xh2ztIehov47mZbot416d-vX1taf8lPRgL20QSYZbGr2BEJqOCgEDK2BjuYomaH3Dv1llCchV0FdjkX5MIR0-kt5tniZFxFsNySwjvb9ZOP1OXWmw88ew2V0O5lfZwAi2wmlV9EMOfgp9q5HL78Cw9SIzfK0tGibrwELLwruXoyDYSmpmksU9EDoyEcfyVjTIOYQFV39z0B4r533b7G6SFN9vEVy2zNLf3ekw_iCmVLLtgOQi1JGMHjcLLabNP4PX5Ldvtks7piNSI0mEBohJSkpeLmnx9XzgsVKL2APllD_zMh9WfJSXv0fHBeToDlUN1nkb84DUKmxsNhPwieaxLt90hAadj2FO9Qs2pf5tZwrccfWZ-cNA" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fkroki.io%2Fmermaid%2Fpng%2FeJxNkNFOwlAMhu95iv8B5BU0wAZOEXQQMVmIOdm60Xh2ztIehov47mZbot416d-vX1taf8lPRgL20QSYZbGr2BEJqOCgEDK2BjuYomaH3Dv1llCchV0FdjkX5MIR0-kt5tniZFxFsNySwjvb9ZOP1OXWmw88ew2V0O5lfZwAi2wmlV9EMOfgp9q5HL78Cw9SIzfK0tGibrwELLwruXoyDYSmpmksU9EDoyEcfyVjTIOYQFV39z0B4r533b7G6SFN9vEVy2zNLf3ekw_iCmVLLtgOQi1JGMHjcLLabNP4PX5Ldvtks7piNSI0mEBohJSkpeLmnx9XzgsVKL2APllD_zMh9WfJSXv0fHBeToDlUN1nkb84DUKmxsNhPwieaxLt90hAadj2FO9Qs2pf5tZwrccfWZ-cNA" alt="Two write paths to the same realm. OVERWRITE makes one of them silently win." width="679" height="796"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Two write paths to the same realm. OVERWRITE makes one of them silently win.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Reconstructing realm state without invalidating active sessions
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Why we did not re-import the ConfigMap&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The obvious recovery path was to fix the realm JSON in git, commit it, and let ArgoCD re-sync. We did not do that, and the reason matters. A full realm re-import, even with the right content, runs through the Keycloak realm import flow on startup. Depending on the chart and the Keycloak version, that can rotate signing keys, drop active sessions, or invalidate refresh tokens. We had roughly 8,000 active user sessions at that moment. Forcing all of them to re-authenticate at 11pm during an active incident was not a recovery; it was a second outage on top of the first.&lt;/p&gt;

&lt;p&gt;So we split the fix into two phases. Phase one was to restore live realm state using the Admin REST API directly, client by client, mapper by mapper. The REST API can add a protocol mapper or attach a client scope to a client without bouncing anything. Phase two was to update the ConfigMap in git to match the now-correct live state AND change the import strategy, so that the next ArgoCD sync would be a no-op rather than another OVERWRITE pass.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Phase 1: restore each missing mapper live via Admin REST API
# Example: re-add the groups protocol mapper to auth-service client
CLIENT_ID=$(jq -r '.[] | select(.clientId=="auth-service") | .id' live-clients.json)

curl -s -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  "$KC/admin/realms/primary/clients/$CLIENT_ID/protocol-mappers/models" \
  -d '{
    "name": "groups",
    "protocol": "openid-connect",
    "protocolMapper": "oidc-group-membership-mapper",
    "config": {
      "claim.name": "groups",
      "full.path": "false",
      "id.token.claim": "true",
      "access.token.claim": "true",
      "userinfo.token.claim": "true"
    }
  }'

# Verify a freshly issued token now carries the claim before moving on
curl -s -X POST "$KC/realms/primary/protocol/openid-connect/token" \
  -d 'grant_type=client_credentials' \
  -d "client_id=auth-service" -d "client_secret=$SECRET" \
  | jq -r .access_token | cut -d. -f2 | base64 -d 2&amp;gt;/dev/null | jq .
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Restore each mapper live, then verify the issued token actually carries the claim before moving to the next client.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We worked through the six clients in dependency order: auth-service first because every other service consumed its tokens, then the api gateway, then profile, then the rest. After each client we curl'd a fresh token and base64-decoded the payload to confirm the claim was present. Twenty-two minutes from the start of restoration, timeline-service was returning 200s again. No sessions dropped. No users re-authenticated. The Keycloak pods were never restarted.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we changed so the next sync becomes a no-op
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;The one Helm value that should never be OVERWRITE&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;With live state correct, the dangerous artifact in the system was still the stale realm JSON in the ConfigMap and the OVERWRITE strategy that would re-apply it on any future sync. We exported the now-correct realm via the Admin API, ran it through a diff against what was in git, and committed the result. We also patched the Keycloak Helm values to set the realm import strategy to IGNORE_EXISTING.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# values.yaml for the Keycloak chart
extraEnv: |
  - name: KEYCLOAK_IMPORT_STRATEGY
    value: IGNORE_EXISTING
  # On Keycloak 22+ via Quarkus distribution:
  - name: KC_SPI_IMPORT_SINGLE_FILE_STRATEGY
    value: IGNORE_EXISTING

# For the operator/CR variant:
# spec:
#   realmImport:
#     strategy: IGNORE_EXISTING   # NOT OVERWRITE_EXISTING
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;IGNORE_EXISTING means the ConfigMap seeds a realm on first creation but never overwrites existing resources. This is the correct setting for any realm that humans also edit.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We re-enabled ArgoCD auto-sync and watched it run. The sync diffed clean: ConfigMap content matched live realm, import strategy was IGNORE_EXISTING, no resources were touched. Green for the right reason this time.&lt;/p&gt;

&lt;p&gt;We changed two things in the way the team operates going forward. First, we wrote a small drift detector that runs nightly. It pulls the live realm via the Admin API, diffs it against the realm JSON in git, and posts to a Slack channel if they disagree. It is roughly 80 lines and it has caught two console-edits-not-committed in the six weeks since. Second, we now treat OVERWRITE as a forbidden value for any realm that is also editable in the admin console. If you want OVERWRITE semantics, you must also remove admin console write access for everyone except a break-glass account, because otherwise you are building a system where one of two writers silently destroys the other's work. We have written more about this category of GitOps failure in the &lt;a href="https://infraforge.agency/argocd-gitops-recovery/" rel="noopener noreferrer"&gt;ArgoCD and GitOps recovery cluster&lt;/a&gt;, and the same pattern shows up with Grafana dashboards, Argo Workflows templates, and anything else where humans and a controller both have write access to the same object.&lt;/p&gt;

&lt;h2&gt;
  
  
  When GitOps is silently rewriting your identity provider
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;If your realm config and your cluster disagree&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The hard part of this kind of incident is not the Keycloak knowledge. It is recognizing that a green ArgoCD dashboard can coexist with a destroyed production configuration, and knowing which fixes preserve sessions versus which ones lock out every user in the building at midnight. The team we worked with had the Keycloak skills. What they did not have was a recovery sequence that prioritized live state capture over git reconciliation, and a clear rule about when to apply via the Admin API versus when to let ArgoCD do it.&lt;/p&gt;

&lt;p&gt;We run these recovery engagements every week. The OVERWRITE-vs-IGNORE_EXISTING trap has hit two other teams this quarter, both on Keycloak, and we have seen the same shape on Grafana provisioning, Argo Workflows ClusterWorkflowTemplates, and a memorable case with Vault policies. The pattern is always: controller writes, human writes, controller wins on the next reconcile, nobody notices for hours.&lt;/p&gt;

&lt;p&gt;If your identity provider, your dashboards, or any other system with human-editable state is sitting behind ArgoCD and you have ever wondered whether you are quietly losing changes, &lt;a href="https://infraforge.agency/review/" rel="noopener noreferrer"&gt;book an infrastructure review with our team&lt;/a&gt; and we will be on a bridge with you the same day. The first 30 minutes will tell you whether you have a drift problem, and from there we can scope a recovery that does not require kicking your users out.&lt;/p&gt;




&lt;p&gt;Originally published at &lt;a href="https://infraforge.agency/insights/keycloak-realm-overwrite-argocd-sync-drift/" rel="noopener noreferrer"&gt;https://infraforge.agency/insights/keycloak-realm-overwrite-argocd-sync-drift/&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — &lt;a href="https://infraforge.agency/review/" rel="noopener noreferrer"&gt;see /review&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>gitops</category>
      <category>recovery</category>
      <category>gitopsargocd</category>
    </item>
    <item>
      <title>Why a Terraform apply hangs 90 minutes on a custom provider with no timeout</title>
      <dc:creator>Muhammad Hassaan Javed</dc:creator>
      <pubDate>Fri, 22 May 2026 10:14:24 +0000</pubDate>
      <link>https://dev.to/infraforge/why-a-terraform-apply-hangs-90-minutes-on-a-custom-provider-with-no-timeout-13hh</link>
      <guid>https://dev.to/infraforge/why-a-terraform-apply-hangs-90-minutes-on-a-custom-provider-with-no-timeout-13hh</guid>
      <description>&lt;p&gt;Two hundred destroys that needed 40 seconds of real work hung for 90 minutes. The platform team kicked off a terraform apply to remove stale config entries from an internal service, watched the progress bar stop at minute 12, and then stared at a frozen terminal until someone finally ran kill -9. By that point the state file was half-updated, the DynamoDB lock was still held, and nobody was sure which of the 200 entries had actually been deleted. The custom Terraform provider doing the destroys had a synchronous HTTP call with no context timeout, and the backend behind it was rate-limiting at 5 RPS. Neither side was wrong on its own. The contract between them was broken.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem signals:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;terraform apply prints no output for 20+ minutes after destroys begin, no progress, no errors&lt;/li&gt;
&lt;li&gt;The backend service is healthy on its dashboard but throttling requests at a low RPS limit&lt;/li&gt;
&lt;li&gt;kill -9 on the terraform process leaves the DynamoDB state lock held forever&lt;/li&gt;
&lt;li&gt;After force-unlock, terraform state list shows resources that no longer exist in the cloud&lt;/li&gt;
&lt;li&gt;The custom provider in use was written internally and has no timeouts {} block support documented&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What the team thought was happening, and what was actually happening
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Forty seconds of work, ninety minutes of silence&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The first assumption was that the internal config service was hung. It was not. Its dashboard showed it healthy and serving requests, just slowly. The second assumption was that terraform was making progress and just not printing anything. That one was half true. Terraform was making progress, at exactly 5 deletes per second, which is the rate limit the backend was enforcing. With 200 entries that is 40 seconds of real work. The team waited 90 minutes.&lt;/p&gt;

&lt;p&gt;The reason for the gap was a custom Terraform provider written by a previous platform team. Its DeleteResource function looked roughly like the snippet below. No context. No timeout. No retry-with-backoff. No progress emission back to Terraform's UI layer. When the backend returned a 429, the provider's HTTP client did its own internal retry, swallowed the error, and tried again. Forever. Because the provider never returned from Delete, Terraform's supervisor saw a working call and waited.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;func resourceConfigEntryDelete(d *schema.ResourceData, meta interface{}) error {
    client := meta.(*ConfigClient)
    id := d.Id()

    // No context. No timeout. No bound on retries.
    for {
        err := client.DeleteEntry(id)
        if err == nil {
            return nil
        }
        if isRateLimited(err) {
            time.Sleep(1 * time.Second)
            continue
        }
        return err
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;The shape of the broken Delete function (reconstructed from the provider source)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;What this should have been is below. The schema.ResourceTimeout block lets users set a timeouts {} block on the resource. The context carries that deadline. When the deadline expires, the provider returns an error and Terraform marks the resource as tainted, not as silently in-progress for the rest of human history.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;func resourceConfigEntryDelete(ctx context.Context, d *schema.ResourceData, meta interface{}) diag.Diagnostics {
    client := meta.(*ConfigClient)
    id := d.Id()

    return retry.RetryContext(ctx, d.Timeout(schema.TimeoutDelete), func() *retry.RetryError {
        err := client.DeleteEntryWithContext(ctx, id)
        if err == nil {
            return nil
        }
        if isRateLimited(err) {
            return retry.RetryableError(err)
        }
        return retry.NonRetryableError(err)
    })
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;What the Delete function should look like&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The half-updated state and the stuck DynamoDB lock
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Why kill -9 left us worse off&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When the engineer finally ran kill -9 on the terraform process, two things happened that compounded the problem. First, the DynamoDB lock entry stayed exactly where it was. Terraform releases its lock on graceful shutdown, not on SIGKILL. So the next person who ran terraform plan got the familiar error and assumed someone else was still working on it. They were not. The lock was a ghost.&lt;/p&gt;

&lt;p&gt;Second, because the destroys had been happening serially at 5 RPS for the 12 minutes before the hang became obvious (the team realized later they had actually waited longer than they thought before noticing the silence), roughly 60 of the 200 entries had actually been deleted from the backend. Terraform had updated the state file in memory as each delete returned, but it had not yet flushed state to the remote backend, because in the default terraform workflow state is written at the end of the apply, not after each resource. So all 60 of those successful deletes were lost from the state file. The cloud was missing 60 entries that tfstate still claimed existed.&lt;/p&gt;

&lt;p&gt;Before doing anything else we confirmed the terraform process was actually dead on the operator's machine. ps aux | grep terraform, on the actual machine, not a tmux pane from yesterday. We have force-unlocked locks that turned out to belong to a process still doing useful work, and the damage is worse than a stuck lock. Once confirmed dead, terraform force-unlock with the lock ID from the error message released DynamoDB.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# 1. Confirm no terraform process is running on the operator's machine
ssh operator-host 'ps aux | grep -v grep | grep terraform'

# 2. Release the lock (lock ID comes from the error message)
terraform force-unlock 7c4a3e22-1b9d-4e8a-b6d7-9f2a8c5e4d11

# 3. See what state thinks vs what the cloud actually has
terraform plan -refresh-only

# 4. Apply the refresh so state matches reality
terraform apply -refresh-only
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;The recovery sequence after confirming the process is dead&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Scripting state rm and import for 200 entries
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Reconciling state against a half-finished destroy&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;After the refresh-only apply, state and cloud agreed on what existed. But the original goal, deleting all 200 entries, was still only partially done. We now had two populations to handle: entries that still existed both in tfstate and in cloud (the destroy had not gotten to them), and entries that had been removed from cloud during the hung apply but were no longer in tfstate either (the refresh had cleaned them up). The first group we could destroy normally. The second group needed nothing further.&lt;/p&gt;

&lt;p&gt;Where it got annoying was a third population we discovered later: a handful of entries that had been deleted from cloud by the hung apply, but where the refresh had failed to notice because the provider's Read function had the same no-timeout bug and was returning stale cached data. Those entries were ghosts in tfstate. For each one we had to run terraform state rm by address. With 47 of them, we scripted it from a diff.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Pull current tfstate resource list&lt;/span&gt;
terraform state list | &lt;span class="nb"&gt;grep &lt;/span&gt;config_entry &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; tfstate_entries.txt

&lt;span class="c"&gt;# Pull live entries from the backend (after rate-limit-aware fetch)&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;--rate-limit&lt;/span&gt; 5 &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CONFIG_API&lt;/span&gt;&lt;span class="s2"&gt;/entries"&lt;/span&gt; | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.[].id'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; live_entries.txt

&lt;span class="c"&gt;# Entries in tfstate but not in cloud: these are ghosts&lt;/span&gt;
&lt;span class="nb"&gt;comm&lt;/span&gt; &lt;span class="nt"&gt;-23&lt;/span&gt; &amp;lt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;sort &lt;/span&gt;tfstate_entries.txt&lt;span class="o"&gt;)&lt;/span&gt; &amp;lt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;sort &lt;/span&gt;live_entries.txt | &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="s1"&gt;'s|^|module.config.config_entry.|'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; ghosts.txt

&lt;span class="c"&gt;# Remove them from state&lt;/span&gt;
&lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="nb"&gt;read &lt;/span&gt;addr&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;terraform state &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$addr&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;done&lt;/span&gt; &amp;lt; ghosts.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Generating the state rm commands from a diff between tfstate and the live backend&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;For the inverse case (entry exists in cloud but not in tfstate), the recovery is terraform import. We did not hit this on this incident but we have hit it on similar ones, and the same diff approach works in the other direction. The general pattern for any half-finished Terraform operation against a custom provider is laid out in &lt;a href="https://dev.to/terraform-state-recovery/"&gt;our Terraform state recovery playbook&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The contract every custom Terraform provider has to honor
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;What the provider should have done&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A custom Terraform provider is a contract. Terraform's whole supervision model assumes the provider plays by it. The contract is short: Create, Read, Update, and Delete each accept a context, each respect the user's timeouts {} block, each emit clear errors when something goes wrong, and each return in bounded time. When a provider violates the contract, Terraform's user-facing behavior degrades in ways that look like Terraform bugs but are not.&lt;/p&gt;

&lt;p&gt;Internal providers skip the contract more often than vendor ones, because the team that writes the provider also runs the backend it talks to, and they convince themselves they have full visibility. They do not. terraform-cli is a separate process. It cannot see your retry loop. It cannot see your in-flight HTTP call. All it sees is a function that has not yet returned. The fix for this provider was three changes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1. Accept context on every CRUD function&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Migrate from the legacy schema.CreateFunc signatures to the context-aware schema.CreateContextFunc variants. This is a non-optional change on terraform-plugin-sdk v2.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2. Declare and honor timeouts on every resource&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Add a Timeouts: &amp;amp;schema.ResourceTimeout{Create: schema.DefaultTimeout(5 * time.Minute), Delete: schema.DefaultTimeout(5 * time.Minute)} block on every resource schema. Use d.Timeout(schema.TimeoutDelete) inside the function.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3. Replace internal retry loops with retry.RetryContext&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The retry helper respects the context deadline and surfaces retryable vs non-retryable errors cleanly. Hand-rolled for-loops over time.Sleep do not.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4. Pin the fixed version via .terraform.lock.hcl&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Release a new patch version of the provider, update the lockfile, and remove the old version from your internal registry so nobody can fall back to it.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The apply pattern itself also needed a change. Destroying 200 entries in one shot against a 5 RPS backend is asking for trouble even with a correct provider, because a 5-minute timeout per resource is generous when one resource genuinely takes 200ms but useless when the queue ahead of you is 199 other deletes. We split future bulk operations into batches of 10 using -target, or we push the backend team to expose a bulk delete endpoint. The provider then wraps the bulk endpoint as a single resource operation instead of looping.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fkroki.io%2Fmermaid%2Fpng%2FeJxNjz1rw0AQRHv_igG3cuEkTVQEbEmQQIoQ1AkXm9OefXh1J1Yrf4B-fJACJu0y782sl3R1J1JDXa6AXWOsSj5pB-p7uR-w2bxNjkQGlCxsjGuwE1yKxjebsG-KcbDUodd0CS3rYQXsF-q9rr9QVp9VXf1BLVMrIfKEovmIxhpJZpMPRwysl-B4pouFfnl6hZIxJHRhLnp4v9n0Tj_ClWpSjNGCwNntn383_7KEU3ycc3SkZxiFaNxmkHQEz4oMysI0MCS584SyKYQpwlOQURmDkS3LBrsLo4QPIvl6659p6zOXJGm-9t7_AsvfcI0" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fkroki.io%2Fmermaid%2Fpng%2FeJxNjz1rw0AQRHv_igG3cuEkTVQEbEmQQIoQ1AkXm9OefXh1J1Yrf4B-fJACJu0y782sl3R1J1JDXa6AXWOsSj5pB-p7uR-w2bxNjkQGlCxsjGuwE1yKxjebsG-KcbDUodd0CS3rYQXsF-q9rr9QVp9VXf1BLVMrIfKEovmIxhpJZpMPRwysl-B4pouFfnl6hZIxJHRhLnp4v9n0Tj_ClWpSjNGCwNntn383_7KEU3ycc3SkZxiFaNxmkHQEz4oMysI0MCS584SyKYQpwlOQURmDkS3LBrsLo4QPIvl6659p6zOXJGm-9t7_AsvfcI0" alt="The relationship that broke and what fixes each side" width="641" height="350"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The relationship that broke and what fixes each side&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  When a custom provider has left your state in an unknown shape
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;If you are looking at a hung apply right now&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Hung Terraform applies against internal providers are the kind of incident that sounds boring in a postmortem and feels terrifying in the moment. You cannot tell if the apply is still doing useful work or stuck forever. You cannot kill it without risking a half-finished state. You cannot force-unlock until you are certain the process is dead. And once you do recover, you do not actually know which resources got modified and which did not, because the provider did not emit progress and the state file was not flushed.&lt;/p&gt;

&lt;p&gt;We run these recovery engagements often enough that the script above is templated. The no-timeout custom provider pattern shows up in maybe one in five of the Terraform recoveries we have done this year, almost always with internal providers written years ago by an engineer who has since left. The fix is mechanical once you know the shape of the failure: confirm process death, force-unlock, refresh-only plan, diff state against cloud, reconcile with state rm and import, then patch the provider so it cannot happen again.&lt;/p&gt;

&lt;p&gt;If you are staring at a hung apply right now and you are not sure whether to kill it, &lt;a href="https://dev.to/review/"&gt;book an infrastructure review with our team&lt;/a&gt; and we will be on a bridge with you the same day. If the apply is already dead and you are sorting through the wreckage, the same engagement covers the state reconciliation and the provider fix together.&lt;/p&gt;




&lt;p&gt;Originally published at &lt;a href="https://infraforge.agency/insights/terraform-apply-hung-custom-provider-no-timeout/" rel="noopener noreferrer"&gt;https://infraforge.agency/insights/terraform-apply-hung-custom-provider-no-timeout/&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — &lt;a href="https://infraforge.agency/review/" rel="noopener noreferrer"&gt;see /review&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>state</category>
      <category>recovery</category>
      <category>terraformstate</category>
    </item>
    <item>
      <title>Grafana 'No Data' after migration: 7 reconcilers we had to kill first</title>
      <dc:creator>Muhammad Hassaan Javed</dc:creator>
      <pubDate>Thu, 21 May 2026 21:13:36 +0000</pubDate>
      <link>https://dev.to/infraforge/grafana-no-data-after-migration-7-reconcilers-we-had-to-kill-first-4gfc</link>
      <guid>https://dev.to/infraforge/grafana-no-data-after-migration-7-reconcilers-we-had-to-kill-first-4gfc</guid>
      <description>&lt;p&gt;The first fix lasted 90 seconds. We had corrected the Grafana datasource URL from prometheus:9999 back to prometheus:9090, watched the pod roll, refreshed the dashboard, and seen one panel come alive. By the time we opened a second tab, the ConfigMap was back to 9999. That was the real incident. The 'No Data' dashboards were a symptom of an observability stack that someone, or something, was actively re-corrupting from at least seven places we had not yet found.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem signals:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Grafana dashboards show 'No Data' on every panel after a cluster migration, and kubectl edit fixes revert within 1-3 minutes&lt;/li&gt;
&lt;li&gt;Prometheus targets page is empty or stuck on a namespace that does not exist anymore&lt;/li&gt;
&lt;li&gt;ClusterRoleBindings you just recreated reference a ClusterRole name nobody on the team typed&lt;/li&gt;
&lt;li&gt;ps aux shows kworker-looking processes with elevated CPU that hold open file descriptors to a kubeconfig&lt;/li&gt;
&lt;li&gt;kubectl get cronjobs -A shows entries in namespaces nobody on the platform team remembers creating&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why we stopped fixing config and started looking for what was undoing it
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;The fix that lasted 90 seconds&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The team that called us had been at this for nine hours. After a cluster migration, every Grafana dashboard was blank. The on-call had walked through the obvious things. The Prometheus datasource in Grafana pointed at port 9999. The Loki datasource pointed at port 3199. The Prometheus scrape config had annotation keys nobody recognized (prometheus_io_metrics_enabled instead of prometheus_io_scrape) and targeted a namespace that did not exist. The Grafana deployment had a config-validator init container running sleep 3600. Each one of those was a real bug. Each one of those, fixed in isolation, would revert before the next pod rolled out.&lt;/p&gt;

&lt;p&gt;The shape of what they were describing was not a botched migration. A botched migration leaves bad state. This was bad state being re-applied. When manual kubectl edits revert in minutes, the question is no longer 'what is wrong with the manifest', it is 'what process has write access and is reconciling against a corrupt source of truth'. We told them to stop fixing config until we had inventoried every actor that could write to the cluster.&lt;/p&gt;

&lt;p&gt;This sounds obvious written down. In the middle of an incident, with executives asking for an ETA on dashboards, the instinct is to keep patching. We have run this play enough times now to know the patching never converges. You burn three more hours and your changes still revert. The only path out is persistence-first triage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Seven places state was being rewritten from
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;A kworker thread holding a kubeconfig&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We started on the nodes. ps auxf on each worker showed a process named [kworker/u8:2-events_unbound]. Square brackets usually mean a kernel thread, and you learn early not to touch kernel threads. We almost moved on. The thing that snagged our attention was CPU: a real kernel worker thread on an idle-ish node should not be sitting at 12 percent. We pulled its open file descriptors.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ls -l /proc/$(pgrep -f 'kworker/u8:2')/fd/ 2&amp;gt;/dev/null | head
lr-x------ 1 root root 64 ... 3 -&amp;gt; /root/.kube/config
lrwx------ 1 root root 64 ... 7 -&amp;gt; socket:[884213]
lr-x------ 1 root root 64 ... 9 -&amp;gt; /opt/.reconciler/state.json
$ cat /proc/$(pgrep -f 'kworker/u8:2')/comm
kworker/u8:2-events_unbound
$ readlink /proc/$(pgrep -f 'kworker/u8:2')/exe
/opt/.reconciler/agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Kernel threads do not hold kubeconfigs or have an exe link. This was a userspace binary with a spoofed comm name.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That was reconciler one. The same trick was on every node, with comm names rotating through plausible kworker patterns (flush-dm-0, mm_percpu_wq). We collected the binary, killed every instance, removed the systemd unit that was respawning it, and moved on. Then we did the boring sweep nobody wants to do in the middle of an incident.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;kubectl get cronjobs -A surfaced config-audit in kube-system and prometheus-metrics-federation in cattle-monitoring-system. Neither was ours. Both ran every 60 seconds and wrote ConfigMaps.&lt;/li&gt;
&lt;li&gt;systemctl list-timers on each node showed k8s-health-monitor.timer firing every two minutes against the API server with a node-local kubeconfig.&lt;/li&gt;
&lt;li&gt;ls /etc/cron.d/ had a host cron entry running a script under /opt/.reconciler/ once a minute as a belt-and-braces backup to the systemd timer.&lt;/li&gt;
&lt;li&gt;kubectl get validatingwebhookconfigurations,mutatingwebhookconfigurations turned up pod-policy-webhook, namespace-policy-webhook, and the one that hurt us most, rbac-policy-enforcer.&lt;/li&gt;
&lt;li&gt;chattr was set +i on /etc/cron.d/k8s-health and on the corrupted ConfigMap manifests staged on disk. Edits failed silently with 'operation not permitted'.&lt;/li&gt;
&lt;li&gt;Finalizers on the CronJobs prevented kubectl delete from completing until we patched them off.&lt;/li&gt;
&lt;li&gt;PodSecurity labels on cattle-monitoring-system were set to enforce a baseline that blocked our debug pods from running.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Seven places. Any one of them, left running, would have re-corrupted the stack within minutes of our fixes. Some teams have a reconciler. This cluster had a mesh of them, each one a backup for the others. That is not a thing healthy infrastructure does; it is a thing a previous incident or a hostile takeover does. Either way, the response is the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  The order we neutralized things, and why order matters
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Why we deleted the webhooks before touching RBAC&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;There is a trap in this kind of cleanup. If you fix the visible problem before you neutralize the actor reverting it, you have wasted a fix and burned credibility with the room. The worst version of this in our case was the RBAC webhook. The Prometheus ClusterRoleBinding had been deleted entirely, and the deployment had been swapped to the default service account. The obvious move was to recreate the CRB and patch the deployment back to a proper SA.&lt;/p&gt;

&lt;p&gt;We tried it once, in a scratch namespace, just to see. The CRB came back with roleRef pointing at a ClusterRole that did not exist. The mutating webhook was matching anything with 'prometheus' or 'monitoring' in the name and silently rewriting the roleRef. If we had run that against the real CRB in production with the team watching, we would have looked like we did not know what we were doing, and the fix would not have worked.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fkroki.io%2Fmermaid%2Fpng%2FeJxF0Dtuw0AMBNA-p5gDyECQIkWKALbkTz5FIAdpFi5WK8oiLC0Ncm3FOX2gTZGajwMOu0Gm0HtN-KzugKXbcGyhFCQGHkjtgMXiGStX0yhXQsfRD_xDavCxReh9SooFH-6AVaalq2igRPDtyGYsERM1vcjJZlRmVLl9kjNKlfgqjRVIPJJagV4sIajE2VbZrt0bD8Pf5P8wnFUCmVFOXWe5cV-k3N0QJfUcj5iUExk6UTzeZ7jJcOs2_I0gsePjEz5URko9XazAVn3noy_wLqdcapsXdq6moOQToV4ty9x9T3rlQMsQ5BLTbHfZvriaLM0vPUtrBaQx0ivhASNHWPIND5xumDi2Mh1-AY5GfqY" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fkroki.io%2Fmermaid%2Fpng%2FeJxF0Dtuw0AMBNA-p5gDyECQIkWKALbkTz5FIAdpFi5WK8oiLC0Ncm3FOX2gTZGajwMOu0Gm0HtN-KzugKXbcGyhFCQGHkjtgMXiGStX0yhXQsfRD_xDavCxReh9SooFH-6AVaalq2igRPDtyGYsERM1vcjJZlRmVLl9kjNKlfgqjRVIPJJagV4sIajE2VbZrt0bD8Pf5P8wnFUCmVFOXWe5cV-k3N0QJfUcj5iUExk6UTzeZ7jJcOs2_I0gsePjEz5URko9XazAVn3noy_wLqdcapsXdq6moOQToV4ty9x9T3rlQMsQ5BLTbHfZvriaLM0vPUtrBaQx0ivhASNHWPIND5xumDi2Mh1-AY5GfqY" alt="Neutralize first, then fix. RBAC and any 'monitoring'-named resource go last because the webhook would mutate them on creation." width="276" height="1094"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Neutralize first, then fix. RBAC and any 'monitoring'-named resource go last because the webhook would mutate them on creation.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;So the order was: strip finalizers from the CronJobs, chattr -i on the immutable files, delete the three webhook configurations, suspend and delete the CronJobs in kube-system and cattle-monitoring-system, mask the systemd timer, remove the host cron entry, kill the userspace reconciler processes on every node and remove their systemd unit. Then we sat for 60 seconds and watched. No ConfigMap mutations. No Deployment patches. Quiet cluster. That was the first time in nine hours the cluster had been quiet, and you could feel the room exhale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Restoring the observability stack once writes were ours alone
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;The order we put it back together&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;With the reconcilers gone, the config fixes were the easy part. We did them top-down by data flow: scrape config, then service routing, then the consumers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Prometheus ConfigMap: restore annotation keys, fix namespace, drop interval&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; monitoring get cm prometheus-config &lt;span class="nt"&gt;-o&lt;/span&gt; yaml &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /tmp/prom-cm.yaml
&lt;span class="c"&gt;# edit: prometheus_io_metrics_* -&amp;gt; prometheus.io/scrape, /metrics, port&lt;/span&gt;
&lt;span class="c"&gt;#       namespaces: [bleater-nonexistent] -&amp;gt; the real app namespace&lt;/span&gt;
&lt;span class="c"&gt;#       scrape_interval: 300s -&amp;gt; 30s&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; /tmp/prom-cm.yaml

&lt;span class="c"&gt;# 2. Prometheus Service: targetPort 9099 -&amp;gt; 9090&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; monitoring patch svc prometheus &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'[{"op":"replace","path":"/spec/ports/0/targetPort","value":9090}]'&lt;/span&gt;

&lt;span class="c"&gt;# 3. Service account and RBAC (webhooks already deleted)&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; monitoring create sa prometheus
kubectl create clusterrolebinding prometheus &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--clusterrole&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;prometheus &lt;span class="nt"&gt;--serviceaccount&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;monitoring:prometheus
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; monitoring &lt;span class="nb"&gt;set &lt;/span&gt;serviceaccount deploy/prometheus prometheus

&lt;span class="c"&gt;# 4. Prometheus readiness probe: port 9099 /-/healthz -&amp;gt; 9090 /-/ready&lt;/span&gt;
&lt;span class="c"&gt;# 5. Loki: drop -server.http-listen-port=3199 arg, fix svc selector loki-server -&amp;gt; loki&lt;/span&gt;
&lt;span class="c"&gt;# 6. Grafana: remove init container, fix probe ports, drop GF_SERVER_HTTP_PORT,&lt;/span&gt;
&lt;span class="c"&gt;#    fix volume refs (-v2 -&amp;gt; base name), reset admin secret&lt;/span&gt;
&lt;span class="c"&gt;# 7. Delete NetworkPolicy grafana-egress-restrict&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; monitoring delete networkpolicy grafana-egress-restrict
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;We applied these as separate kubectl operations on purpose, not a single helm rollout, so we could verify each one stuck before moving on.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;After every step we waited 30 seconds and re-read the resource. Nothing reverted. We rolled the Grafana deployment, watched it come up clean with no init container blocking startup, hit the Prometheus targets page and saw 11 active up series including the application pods, then loaded a dashboard. Data. The two-minute stability window passed with no drift. We held the bridge for another 20 minutes anyway, because the team needed to see it not break more than they needed us to leave.&lt;/p&gt;

&lt;h2&gt;
  
  
  Persistence-first triage is now the default for post-migration observability failures
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;What we changed in our own playbook&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We have changed how we open any incident where fixes do not stick. The first 15 minutes are no longer spent on config. They are spent on a sabotage sweep: cronjobs in every namespace (not just the obvious ones, cattle-monitoring-system bit us and we have seen it bite others), systemd timers on every node, /etc/cron.d, validating and mutating webhooks, finalizers on resources we expect to delete, immutable file attributes on staged manifests, and a ps auxf on every node with an eye on anything in square brackets that has an exe link.&lt;/p&gt;

&lt;p&gt;We also changed how we think about kubectl edit during a live incident. If a change has to land and the cluster has any chance of having a reconciler we have not yet found, we apply through git and watch the apply, not edit the live object. It is slower by 90 seconds and saves you from spending an hour wondering why your fix evaporated. We have written more on the same instinct in our notes on &lt;a href="https://dev.to/problems/kubernetes-release-failures/"&gt;Kubernetes release failures&lt;/a&gt; and on &lt;a href="https://dev.to/argocd-gitops-recovery/"&gt;ArgoCD self-heal traps&lt;/a&gt;, which is the friendly version of this same pattern.&lt;/p&gt;

&lt;p&gt;The non-obvious lesson from this incident is that hostile or accidental reconcilers do not announce themselves. The kworker spoof was the cleverest piece; it would have survived a casual ps. The cattle-monitoring-system namespace looked legitimate to anyone who had ever run Rancher. The webhook had a name (rbac-policy-enforcer) that sounded like something a security team would install. In each case the move that surfaced it was boring: enumerate the category exhaustively, then ask which entries the team can account for. Anything they cannot account for is the answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  When fixes revert, the problem is not the fix
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;If your post-migration monitoring keeps un-fixing itself&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The hard part of incidents like this is not the Prometheus annotation key or the Grafana port. Those take 20 minutes once the cluster stops fighting you. The hard part is having the discipline to stop patching and inventory every actor that can write to your cluster, especially when leadership is asking for an ETA and your instinct is to keep typing. The hard part is also knowing what the categories of reconciler are. If you have never had to look for a mutating webhook that rewrites RBAC, or a host process pretending to be a kworker, the search takes hours. If you have seen it before, it takes 15 minutes.&lt;/p&gt;

&lt;p&gt;We run these recovery engagements every week. We have seen the kworker spoof twice this year, the cattle-monitoring-system CronJob trick three times, and the RBAC-mutating webhook in two unrelated post-migration incidents. The playbook is portable; the patience to run it before patching is the part teams in the middle of an outage struggle with, and that is usually why they call us.&lt;/p&gt;

&lt;p&gt;If your dashboards are blank after a migration and your fixes are not sticking, &lt;a href="https://dev.to/review/"&gt;book an infrastructure review with our team&lt;/a&gt; and we will be on a bridge with you the same day. Bring node SSH access, kubectl with cluster-admin, and a list of every namespace you can name. We will handle the rest.&lt;/p&gt;




&lt;p&gt;Originally published at &lt;a href="https://infraforge.agency/insights/grafana-no-data-after-migration-reconcilers-reverting-fixes/" rel="noopener noreferrer"&gt;https://infraforge.agency/insights/grafana-no-data-after-migration-reconcilers-reverting-fixes/&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — &lt;a href="https://infraforge.agency/review/" rel="noopener noreferrer"&gt;see /review&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>k8s</category>
      <category>reliability</category>
      <category>kubernetescicd</category>
    </item>
    <item>
      <title>When MinIO Deny Wins Cause Silent Upload Failure</title>
      <dc:creator>Muhammad Hassaan Javed</dc:creator>
      <pubDate>Thu, 21 May 2026 01:44:45 +0000</pubDate>
      <link>https://dev.to/infraforge/why-minio-uploads-return-200-and-never-land-a-deny-wins-iam-trap-1eko</link>
      <guid>https://dev.to/infraforge/why-minio-uploads-return-200-and-never-land-a-deny-wins-iam-trap-1eko</guid>
      <description>&lt;p&gt;The dashboards were green. The api-gateway logged 12,400 successful media POSTs over six hours, the storage service SDK reported 200 on every PutObject, and the fanout queue happily processed every notification. The MinIO bucket had gained zero new objects in the same window. Users were seeing broken image tiles in their feeds and the on-call team had spent three hours chasing the fanout service because that was the only place the symptom was visible. The actual problem was an explicit Deny on s3:PutObject sitting inside a bucket policy that had been added during a security hardening sprint two days earlier, and MinIO was doing exactly what S3 IAM semantics say it should do: deny wins, even when the user policy says Allow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem signals:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Upload endpoints return HTTP 200 but the object never appears in the bucket&lt;/li&gt;
&lt;li&gt;Bucket notification webhooks fire and downstream consumers process phantom events&lt;/li&gt;
&lt;li&gt;Grafana shows upload throughput as healthy because SDK success metrics dominate the panel&lt;/li&gt;
&lt;li&gt;Users report broken image links while every service-level dashboard is green&lt;/li&gt;
&lt;li&gt;A recent IAM or bucket policy change correlates in time with the start of phantom uploads&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The discrepancy that should have been the first alert
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;12,400 successful uploads, zero new objects&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We came in on the third hour of the incident. The team had been chasing the fanout consumer because user reports were all of the form 'my avatar is broken' and the only service touching media after upload was fanout. Their working theory was that fanout was racing the CDN, or that the notification payload was missing a key, or that signed URLs were expiring early. They had three engineers staring at fanout-service logs and finding nothing wrong, because there was nothing wrong with fanout-service.&lt;/p&gt;

&lt;p&gt;The question we asked, which is the question we always ask first when an upload pipeline misbehaves: how many objects has the bucket actually gained in the last hour? Not how many uploads the API recorded. Not how many notifications fanout received. How many real objects exist now that did not exist sixty minutes ago. We ran the listing against the MinIO admin API and the answer was zero. The bucket had not gained a single object since 02:14 that morning, which lined up almost exactly with the merge time of a security hardening PR the platform team had landed two days prior.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# count objects added in the last hour
mc find local/bleater-media --newer-than 1h | wc -l
# 0

# meanwhile the storage-service success counter
curl -s http://prometheus/api/v1/query \
  --data-urlencode 'query=sum(increase(storage_service_put_object_success_total[1h]))'
# {"status":"success","data":{"result":[{"value":[..., "2074"]}]}}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Two views of the same hour. The SDK was confident. The bucket was not.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Once we had that gap on a shared screen the room changed. The fanout investigation got paused. The new question was: why is the SDK reporting success for writes that never persisted?&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the 200 came from when the object never landed
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;What the SDK thought, and what the server actually did&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is the part of the story that is worth understanding even if you never touch MinIO. The storage service was using a streaming PutObject path. The client opens a connection, the server accepts headers and begins reading the body, and the bucket notification configuration is wired to fire on the API receipt of the PutObject call. In a healthy run, the server then writes the object, the response is 200, and the notification correctly reflects a real write. In our broken run, the server accepted the headers, fired the notification, evaluated the IAM policies, hit the explicit Deny, and closed the stream. The client SDK saw the connection close after headers were ack'd and treated it as success because the response framing looked clean enough at the transport layer. The notification had already gone out. The audit log recorded the deny. Nobody was reading the audit log.&lt;/p&gt;

&lt;p&gt;Enabling the MinIO audit target was the diagnostic turn. Two commands and the lie unwound itself.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mc admin config set local audit_webhook:1 \
  endpoint="http://collector:8080/minio-audit" enable=on
mc admin service restart local

# tail the collector for a few seconds
# {"api":{"name":"PutObject","bucket":"bleater-media",
#        "object":"avatars/u-83421.jpg","status":"AccessDenied",
#        "statusCode":403},
#  "requestClaims":{"accessKey":"storage-service"},
#  "error":{"message":"Access Denied.",
#           "source":["cmd/auth-handler.go:checkRequestAuthTypeCredential"]}}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Audit log showed 403 AccessDenied on every PutObject from the storage-service identity. The client never saw it.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The storage-service identity had a user policy that explicitly granted s3:PutObject on arn:aws:s3:::bleater-media/*. We confirmed this in two seconds. Which meant the deny had to be coming from somewhere else.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bucket policy nobody had read since the hardening PR
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Where the explicit Deny was hiding&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;MinIO, like S3, evaluates IAM in two layers. The user (or service account) policy attached to the identity is one layer. The bucket policy attached to the resource is the other. An explicit Deny in either layer overrides any Allow in either layer. The hardening PR had added a bucket policy intended to lock down a different identity, an analytics reader that had been overprovisioned, and the author had used a wildcard Principal with a NotPrincipal exception that was wrong. The effective rule said: deny s3:PutObject on this bucket for everyone who is not the analytics-reader identity. Which of course included the storage service.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl -s -u $ADMIN:$SECRET \
  http://minio:9000/minio/admin/v3/get-bucket-policy?bucket=bleater-media \
  | jq .

# {
#   "Version": "2012-10-17",
#   "Statement": [
#     {
#       "Sid": "RestrictWritesToAnalyticsReader",
#       "Effect": "Deny",
#       "NotPrincipal": { "AWS": ["arn:aws:iam:::user/analytics-reader"] },
#       "Action": ["s3:PutObject"],
#       "Resource": ["arn:aws:s3:::bleater-media/*"]
#     }
#   ]
# }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;The bucket policy that swallowed every write. NotPrincipal with Deny is a footgun in any S3-compatible IAM.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We have seen NotPrincipal misused in three separate engagements this year. It reads as if it means 'apply this rule to everyone except this principal' the same way a NotAction would, but the semantics interact badly with cross-account and service-account identities. If you are writing a Deny that you want scoped to a specific identity, write the Deny with Principal naming the identity you mean to block. Do not invert it. The blast radius of a wrong inversion is the entire bucket.&lt;/p&gt;

&lt;p&gt;Before we touched anything we wanted to rule out the obvious adjacent causes, because removing a security-hardening policy at 06:00 without confirmation is the kind of fix that becomes its own incident. We checked credential expiry on the storage-service service account (valid for another 47 days), checked network policy for any new egress restrictions from the storage-service namespace (none), and confirmed bucket versioning was off so we were not chasing delete markers. The audit log had already told us the answer; we just wanted the rollback to be unambiguous when we wrote it up.&lt;/p&gt;

&lt;h2&gt;
  
  
  The four-minute patch and the queue we had to reconcile
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Removing the Deny without re-opening the bucket&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Two questions before patching. First, did we want to fix the bucket policy in place, or revert the hardening PR entirely? We chose patch in place. The hardening PR had also tightened three other identities correctly, and reverting would have undone work that was real. Second, did we want to leave the analytics-reader restriction in some form? Yes, but written correctly. We rewrote the statement as an explicit Deny on the analytics-reader principal for write actions, which is what the author had intended.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cat &amp;gt; /tmp/bleater-media-policy.json &amp;lt;&amp;lt;'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "BlockAnalyticsReaderWrites",
      "Effect": "Deny",
      "Principal": { "AWS": ["arn:aws:iam:::user/analytics-reader"] },
      "Action": ["s3:PutObject", "s3:DeleteObject"],
      "Resource": ["arn:aws:s3:::bleater-media/*"]
    }
  ]
}
EOF

curl -s -u $ADMIN:$SECRET \
  -X PUT \
  --data-binary @/tmp/bleater-media-policy.json \
  "http://minio:9000/minio/admin/v3/set-bucket-policy?bucket=bleater-media"

# validate with a real write from the storage-service identity
curl -s -X PUT -T /tmp/canary.bin \
  -H "Authorization: ...storage-service-sigv4..." \
  http://minio:9000/bleater-media/canary/$(date +%s).bin

mc ls local/bleater-media/canary/ | tail -1
# [2024-...] 4.0KiB STANDARD 1717420831.bin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Replace the inverted NotPrincipal with an explicit Principal Deny, then prove with a canary that the storage-service identity can write.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The canary landed. Real uploads from the application resumed within the next minute as new requests came in. That fixed the forward path. It did not fix the past six hours.&lt;/p&gt;

&lt;p&gt;The phantom notification problem was harder to bound. The fanout service had processed roughly 12,400 notification events for objects that did not exist, which meant 12,400 user timelines contained references to media that would 404 forever. We pulled the notification log from the RabbitMQ stream and diffed against the actual object listing in the bucket. The count of phantom references came in at 12,387. We pushed a one-shot reconciliation job that re-emitted upload prompts to the affected users for any media uploaded in that window, because we had no way to recover the original bytes; the storage service had streamed them to a connection that was closed before persistence.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fkroki.io%2Fmermaid%2Fpng%2FeJxdj81uwjAQhO88xRxTqVFR1VMOlipoKlQFaMMLbJwNbCF2am-QePvK_EgtR_v7NLMT-WdkZ3kutA3UT4CBgoqVgZxiBoqYHYSd3pG6TiiqD7TlPHI4iuU7p0pKJW6xugPLBL6oaUSrzztWJtaR86P-iZ3lxtR1gfWq3uCp51YImfS0ZTQn5fgwAeo6N6YqsB511XyzVWRRA1OfYJUbsyzQjHbPCudVOrGk4h2y1_UCgS3LoEld5saUxcU5wV-ibGBSbq9JVQE-0mEkZSxeq0fsRDFndzrz663WO8f2XGEPPnKLrJ5_IDC1MY18nk6vZ-fGzIr0xupjApSXhve3za298wFD8JZjFLe9lZQFXqYvyK6S4yMHDByiROX2PMUrw6fvssCwI6e-_zf-EU3we3Y4iNtDHDrm9hee86q0" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fkroki.io%2Fmermaid%2Fpng%2FeJxdj81uwjAQhO88xRxTqVFR1VMOlipoKlQFaMMLbJwNbCF2am-QePvK_EgtR_v7NLMT-WdkZ3kutA3UT4CBgoqVgZxiBoqYHYSd3pG6TiiqD7TlPHI4iuU7p0pKJW6xugPLBL6oaUSrzztWJtaR86P-iZ3lxtR1gfWq3uCp51YImfS0ZTQn5fgwAeo6N6YqsB511XyzVWRRA1OfYJUbsyzQjHbPCudVOrGk4h2y1_UCgS3LoEld5saUxcU5wV-ibGBSbq9JVQE-0mEkZSxeq0fsRDFndzrz663WO8f2XGEPPnKLrJ5_IDC1MY18nk6vZ-fGzIr0xupjApSXhve3za298wFD8JZjFLe9lZQFXqYvyK6S4yMHDByiROX2PMUrw6fvssCwI6e-_zf-EU3we3Y4iNtDHDrm9hee86q0" alt="The notification fires before the deny evaluation completes. Every layer below MinIO sees success." width="1460" height="682"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The notification fires before the deny evaluation completes. Every layer below MinIO sees success.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What we changed so the next deny-wins conflict is not silent
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;The synthetic that would have caught this in 90 seconds&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The deeper lesson here is not about MinIO. It is that SDK success and server persistence are different facts, and most observability stacks conflate them. Every metric on the storage service dashboard came from the SDK return code. Every metric on the fanout dashboard came from notification receipt. Nothing in the stack was sourced from the only ground truth that mattered, which was the count of objects actually present in the bucket. The hardening PR could have done much worse than this and we would still have been blind.&lt;/p&gt;

&lt;p&gt;We made three changes after this incident. First, a synthetic that writes a canary object every 60 seconds and then lists the bucket to confirm the canary is there. The metric is the gap between writes and confirmed reads, and it alerts at gap greater than two intervals. This is the kind of probe we now build into every object-storage path we touch. Second, the MinIO audit webhook now ships to the log aggregation pipeline with a Loki alert rule on any sustained rate of statusCode 403 for PutObject, scoped per identity. Third, we wrote a pre-merge check for bucket policy changes that flags any statement using NotPrincipal with Effect Deny and requires an explicit reviewer sign-off.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Loki alert: deny-wins on PutObject for any service identity
- alert: MinioPutObjectDenied
  expr: |
    sum by (accessKey) (
      rate({job="minio-audit"}
        | json
        | api_name = "PutObject"
        | api_statusCode = "403"
        [5m])
    ) &amp;gt; 0
  for: 2m
  labels:
    severity: page
  annotations:
    summary: "MinIO denying PutObject for {{ $labels.accessKey }}"
    runbook: "Check bucket policy and user policy for explicit Deny statements."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;The alert that would have paged the on-call within five minutes of the hardening PR rolling out.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If your bucket notifications drive downstream business logic, you have the same shape of risk we did. The notification path and the persistence path are not the same path, and the IAM evaluation sits between them. Assume nothing about server persistence based on SDK return codes. Read the audit log.&lt;/p&gt;

&lt;h2&gt;
  
  
  When a hardening PR silently revokes write access in production
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;If your object store is quietly lying to your monitors&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This class of incident is hard for a specific reason: every monitoring surface a normal team has built reports healthy, because every normal monitoring surface reads from the layer above the failure. The teams we work with that have hit this pattern were not careless. They had dashboards, they had alerts, they had error budgets. None of those instruments were positioned to see a server-side deny that the SDK swallowed. The fix is a small synthetic and an audit log alert, and they take an afternoon to build. Getting to the point of knowing you need them usually takes one bad incident.&lt;/p&gt;

&lt;p&gt;We run object-storage and IAM recovery engagements often enough that this exact shape, a hardening PR introducing a deny-wins conflict against a service account, has come up three times this year on three different stacks (MinIO, Ceph RGW, and AWS S3 with a SCP). The mechanics are the same in all three. If your team is staring at green dashboards and broken user reports, the gap between SDK success and ground-truth persistence is the first place to look. If you want a second set of eyes on a hardening rollout before it lands, or you are inside one of these incidents right now, &lt;a href="https://dev.to/review/"&gt;book an infrastructure review with our team&lt;/a&gt; and we will be on a bridge with you the same day. We also document the audit-log and synthetic patterns in more depth on the &lt;a href="https://dev.to/infrastructure-audit-readiness/"&gt;infrastructure audit readiness&lt;/a&gt; page if you want to read ahead.&lt;/p&gt;




&lt;p&gt;Originally published at &lt;a href="https://infraforge.agency/insights/minio-deny-wins-silent-upload-failure/" rel="noopener noreferrer"&gt;https://infraforge.agency/insights/minio-deny-wins-silent-upload-failure/&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — &lt;a href="https://infraforge.agency/review/" rel="noopener noreferrer"&gt;see /review&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>object</category>
      <category>storage</category>
      <category>recovery</category>
      <category>auditreadiness</category>
    </item>
    <item>
      <title>ArgoCD Drift: Three Namespaces, One JWT Hotfix</title>
      <dc:creator>Muhammad Hassaan Javed</dc:creator>
      <pubDate>Wed, 20 May 2026 22:09:53 +0000</pubDate>
      <link>https://dev.to/infraforge/argocd-drift-across-3-namespaces-after-a-jwt-hotfix-how-we-reconciled-without-breaking-auth-3g4l</link>
      <guid>https://dev.to/infraforge/argocd-drift-across-3-namespaces-after-a-jwt-hotfix-how-we-reconciled-without-breaking-auth-3g4l</guid>
      <description>&lt;p&gt;The on-call team had been chasing a 30% 401 rate on profile-service for two hours when we got pulled in. Only profile-service, only some pods, only authenticated requests. The shape of that number is what gave it away: a 30% failure rate on a service backed by a 3-pod deployment is what you see when one pod out of three is running with a different config. Except it was not a config rollout in flight. It was a week-old JWT key rotation hotfix that had landed in the live cluster, never made it to Git, and ArgoCD auto-sync had been disabled across three applications and quietly left off. By the time we opened a terminal there were four versions of the same ConfigMap floating around: one in Git, three in three namespaces, none of them in agreement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem signals:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A service is returning 401s on a fraction of requests that matches a pod count ratio (30% on 3 pods, 25% on 4 pods)&lt;/li&gt;
&lt;li&gt;ArgoCD shows applications as OutOfSync but auto-sync is disabled and nobody remembers turning it off&lt;/li&gt;
&lt;li&gt;kubectl diff against the rendered Helm or Kustomize output shows changes nobody can attribute to a recent PR&lt;/li&gt;
&lt;li&gt;Multiple namespaces have a propagated copy of the same ConfigMap and the copies disagree&lt;/li&gt;
&lt;li&gt;A recent incident postmortem mentions a manual kubectl edit or kubectl patch that was never followed by a Git commit&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The first 20 minutes: mapping how far the drift had spread
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Four ConfigMaps, four different values&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The initial theory from the on-call lead was that a pod had missed the last restart and was still holding the pre-rotation JWT public key. Reasonable theory. It was wrong, but only because it was incomplete.&lt;/p&gt;

&lt;p&gt;We ran the obvious diff first. Pull the ConfigMap from each of the three namespaces, pull the manifest from the Git repo at HEAD, compare. What we expected to find was two values: a correct one in the cluster and a stale one in Git, or the reverse. What we actually found was four.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# auth-service namespace
$ kubectl -n auth get cm auth-config -o jsonpath='{.data.JWT_ALGORITHM} {.data.JWT_PUBLIC_KEY_ID}'
RS256 key-2024-11-rot

# like-service namespace (propagated copy)
$ kubectl -n like get cm auth-config -o jsonpath='{.data.JWT_ALGORITHM} {.data.JWT_PUBLIC_KEY_ID}'
RS256 key-2024-09

# profile-service namespace (propagated copy)
$ kubectl -n profile get cm auth-config -o jsonpath='{.data.JWT_ALGORITHM} {.data.JWT_PUBLIC_KEY_ID}'
HS256 key-2024-09

# Git, main branch
$ grep -E 'JWT_(ALGORITHM|PUBLIC_KEY_ID)' deploy/*/auth-config.yaml
deploy/auth/auth-config.yaml:  JWT_ALGORITHM: HS256
deploy/auth/auth-config.yaml:  JWT_PUBLIC_KEY_ID: key-2024-09
# (and the same stale pair in like and profile manifests)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;What the diff actually showed. Four states of the same ConfigMap.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The story behind the four states reconstructed quickly from the previous week's incident channel. During the rotation, an SRE had patched auth-service's ConfigMap directly with the new RS256 key. They then walked the change into the like-service namespace and got the algorithm right but typo'd the key ID, leaving the old one. They ran out of focus before reaching profile-service, intended to come back to it, and did not. ArgoCD auto-sync had been disabled across all three applications during the incident as a guardrail and never re-enabled, which is the only reason the cluster state had survived a week without ArgoCD reverting it back to the stale Git values.&lt;/p&gt;

&lt;p&gt;So the 30% 401 rate had a clean explanation. profile-service's pods had been restarted at some point and picked up the HS256 config from the unpatched ConfigMap. The auth-service was now issuing RS256-signed tokens. profile-service was trying to validate them as HS256 with the wrong key ID. The only requests that did not 401 were the ones that happened to skip the auth path entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  The decision that almost broke production a second time
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Why Git was the wrong source of truth&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The instinct, when you find drift between Git and a cluster, is to trust Git. That is the whole point of GitOps. The pull request is the source of truth and the cluster is downstream. Run an ArgoCD sync, let it overwrite the live state, move on.&lt;/p&gt;

&lt;p&gt;That instinct would have broken auth-service inside of 30 seconds. Git held the pre-rotation HS256 values. The new private key that auth-service was signing tokens with did not match the public key Git was about to push into the ConfigMap. A sync from Git would have invalidated every token in flight across all three services, not just 30% of them.&lt;/p&gt;

&lt;p&gt;We had to invert the model. For this one incident, the auth-service namespace's live ConfigMap was the canonical truth, and Git was stale. The recovery had to flow live-to-Git first, then Git-to-cluster for the other two namespaces, and only then could auto-sync be turned back on. The order mattered.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fkroki.io%2Fmermaid%2Fpng%2FeJxl0EFPwkAQBeA7v-Ld7SIt4MFEk9KiIVFM0HhpOazbKWzY7m62C9KkP96UoClynGTyvTdTKvMtttx5fKQDIM743m9ZTe4gBUHJAyExupSbV25zvXqPpncBdtSwaBRNWBgyZ3yuk3j5tlwk8csajD224RDzozXOg-sCwlSV9C1m2bP09yjIKtPcdjnB76DkjoJcnyfrTCkVoeJallT7GntbcE_FegDMTgnRELHbmCRF3WiBfulcQxtmbACuHPGigTDOkfAtksvr3hst-uj4Eu1K9dBSHqlG7bmi7gFYpC3SrL90BU4uwfNdVyZXG-Ok31a4-ZPn2b_tHj4_4dMhPsnJssFkFMJxT3jAqMVTtiJGmn8p6v5i2CnbaHClMAa3tl7_AMw_qTk" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fkroki.io%2Fmermaid%2Fpng%2FeJxl0EFPwkAQBeA7v-Ld7SIt4MFEk9KiIVFM0HhpOazbKWzY7m62C9KkP96UoClynGTyvTdTKvMtttx5fKQDIM743m9ZTe4gBUHJAyExupSbV25zvXqPpncBdtSwaBRNWBgyZ3yuk3j5tlwk8csajD224RDzozXOg-sCwlSV9C1m2bP09yjIKtPcdjnB76DkjoJcnyfrTCkVoeJallT7GntbcE_FegDMTgnRELHbmCRF3WiBfulcQxtmbACuHPGigTDOkfAtksvr3hst-uj4Eu1K9dBSHqlG7bmi7gFYpC3SrL90BU4uwfNdVyZXG-Ok31a4-ZPn2b_tHj4_4dMhPsnJssFkFMJxT3jAqMVTtiJGmn8p6v5i2CnbaHClMAa3tl7_AMw_qTk" alt="Recovery flow. Live state was canonical for one application, Git was canonical after the commit for the other two." width="761" height="622"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Recovery flow. Live state was canonical for one application, Git was canonical after the commit for the other two.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How we got the canonical values into Git and synced the stragglers
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Committing a live hotfix back to Git without breaking auth&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The commit itself was unremarkable once we had a clear model. We pulled the auth-service ConfigMap, extracted the two fields, and updated all three manifests in the deploy repo in a single PR with a postmortem link in the description. The PR title was 'Hotfix reconcile: commit post-rotation JWT values from live state (incident #INC-441)' because future-us was going to want to know why these values arrived without an upstream change.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# 1. Export canonical values from auth-service namespace
KID=$(kubectl -n auth get cm auth-config -o jsonpath='{.data.JWT_PUBLIC_KEY_ID}')
ALG=$(kubectl -n auth get cm auth-config -o jsonpath='{.data.JWT_ALGORITHM}')

# 2. Patch the three manifests in the Git checkout, commit, push
for d in deploy/auth deploy/like deploy/profile; do
  yq -i ".data.JWT_PUBLIC_KEY_ID = \"$KID\" | .data.JWT_ALGORITHM = \"$ALG\"" "$d/auth-config.yaml"
done
git add deploy/auth deploy/like deploy/profile
git commit -m 'Reconcile JWT config from live auth-service (post-rotation hotfix, INC-441)'
git push

# 3. Trigger ArgoCD sync per application, in order
for app in auth-service like-service profile-service; do
  argocd app sync $app --prune=false
  argocd app wait $app --health --timeout 180
done
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;The commit and the sync sequence. auth-service syncs first as a no-op safety check before we touch the broken ones.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We synced auth-service first deliberately. It was already correct, so the sync should be a no-op. If it had shown a diff we did not expect, that was our signal to stop and re-audit before touching like-service or profile-service. It came back clean, which told us our commit matched the live state exactly. Then like-service synced and went healthy. Then profile-service synced and within 40 seconds the 401 rate in Prometheus went from 31% to 0.&lt;/p&gt;

&lt;p&gt;Auto-sync we left off until the 401 rate had been at zero for ten minutes and we had eyes on the Jaeger traces showing fresh successful auth flows end to end. Only then did we re-enable auto-sync on all three applications, in the same order as the sync. We have written more about the order-of-operations on multi-app reconciles in &lt;a href="https://dev.to/argocd-gitops-recovery/"&gt;the ArgoCD and GitOps recovery playbook&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two cheap controls that prevent the next split-state week
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;What we changed about hotfix discipline after this one&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The technical recovery was straightforward once the model was right. The interesting part of this incident was how a one-hour rotation hotfix turned into a week of latent drift. Two things had to go wrong together: a manual change that did not get committed, and an auto-sync toggle that did not get turned back on. Either one of those failing alone would have been caught within an hour by ArgoCD's reconciliation loop.&lt;/p&gt;

&lt;p&gt;We made two changes to the platform after this. The first was a scheduled job that lists ArgoCD applications with auto-sync disabled and posts to a channel if any of them have been in that state for more than four hours. It is twelve lines of bash around argocd app list -o json. It has caught the same pattern twice in the last quarter, both times within the same incident as the original change instead of a week later.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Posted to platform-alerts when auto-sync has been off for &amp;gt;4h on any app
argocd app list -o json \
  | jq -r '.[] | select(.spec.syncPolicy.automated == null)
            | [.metadata.name, .status.operationState.finishedAt] | @tsv' \
  | awk -v cutoff="$(date -u -d '4 hours ago' +%FT%TZ)" '$2 &amp;lt; cutoff'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;The auto-sync watchdog. The cheapest control with the highest ROI we shipped this year.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The second change was a rule we now apply to every incident we run: if a hotfix lands in the cluster via kubectl, the same incident does not close until the change is in a merged PR. Not the next day. Not 'we'll get to it'. The incident commander treats the Git commit as a recovery step, not a follow-up. That sounds like a process rule, and it is, but it has a sharp version: the on-call's runbook for manual ConfigMap patches now includes the export-and-PR commands at the bottom of the same page. The friction to do it right is now lower than the friction to defer it.&lt;/p&gt;

&lt;h2&gt;
  
  
  When the cluster and Git disagree and you cannot just sync your way out
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;If your GitOps is in a split state right now&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The hard part of this kind of incident is not the kubectl or the argocd CLI. The hard part is figuring out which system is the source of truth for which field right now, when the answer is not 'Git, always'. Get that wrong and an ArgoCD sync will take production down a second time on top of whatever is already broken. We have seen the same shape of failure four times this year: a rotation, a migration, an emergency schema change, and a CRD upgrade, each of which left some subset of clusters carrying values that Git did not yet know about.&lt;/p&gt;

&lt;p&gt;InfraForge runs these reconciles every week. We know the order to commit, the order to sync, the checks that catch a propagated copy you forgot about, and the questions to ask before you trust Git over the live state. If your auto-sync has been off for a week and you are not sure what would happen when you turn it back on, &lt;a href="https://dev.to/review/"&gt;book an infrastructure review with our team&lt;/a&gt; and we will be on a bridge with you the same day to walk the drift before you touch anything.&lt;/p&gt;




&lt;p&gt;Originally published at &lt;a href="https://infraforge.agency/insights/argocd-drift-three-namespaces-jwt-configmap-hotfix/" rel="noopener noreferrer"&gt;https://infraforge.agency/insights/argocd-drift-three-namespaces-jwt-configmap-hotfix/&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — &lt;a href="https://infraforge.agency/review/" rel="noopener noreferrer"&gt;see /review&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>gitops</category>
      <category>recovery</category>
      <category>gitopsargocd</category>
    </item>
    <item>
      <title>How we recovered tfstate after force-unlock raced a CI apply</title>
      <dc:creator>Muhammad Hassaan Javed</dc:creator>
      <pubDate>Tue, 19 May 2026 22:37:05 +0000</pubDate>
      <link>https://dev.to/infraforge/how-we-recovered-tfstate-after-force-unlock-raced-a-ci-apply-52mj</link>
      <guid>https://dev.to/infraforge/how-we-recovered-tfstate-after-force-unlock-raced-a-ci-apply-52mj</guid>
      <description>&lt;p&gt;The engineer pinged us at 4:48 pm on a Thursday. They had been trying to push a small IAM change to staging, terraform apply had failed with Error acquiring the state lock, and they did what most of us have done at least once: they ran terraform force-unlock with the ID from the error message and re-ran apply. The apply went through. Ten minutes later a teammate on a different branch ran terraform plan and the plan output wanted to destroy and recreate 38 resources that were sitting healthy in AWS, returning 200s, serving traffic. By the time we joined the bridge, the original engineer was halfway convinced they needed to let Terraform rebuild the whole staging environment. They did not. The cloud was fine. The state file was the thing that was broken.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem signals:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;terraform plan shows -/+ destroy and recreate for resources nobody touched and that are healthy in the cloud&lt;/li&gt;
&lt;li&gt;Teammates see Error: state snapshot was created by Terraform v1.5.7, which is newer than current v1.5.4&lt;/li&gt;
&lt;li&gt;S3 bucket versioning shows two or three tfstate writes inside a 60 to 90 second window&lt;/li&gt;
&lt;li&gt;The DynamoDB lock table is empty but the state file timestamps do not line up with anyone's apply log&lt;/li&gt;
&lt;li&gt;Someone on the team ran terraform force-unlock in the last hour&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  A stale lock from a dead CI job
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;What the engineer thought it was&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The first wrong model was reasonable. The engineer saw Error acquiring the state lock, looked at the lock ID, did not recognize it, and assumed it was a leftover from a CI job that had crashed earlier in the week. They had seen stale locks before. The fix last time was force-unlock. So they ran it again.&lt;/p&gt;

&lt;p&gt;What they did not check was whether the lock holder was actually still alive. The CI job that held the lock was a scheduled terraform plan cycle running on a 15-minute cadence, and that particular run was on the slow side because the workspace had grown to about 600 resources. It was not stuck. It was just working. The force-unlock removed the lock entry from DynamoDB while the CI process was still very much holding an in-memory version of the state file, mid-refresh. Two writers, no coordination.&lt;/p&gt;

&lt;p&gt;When the engineer's apply finished, it wrote its version of the state to S3. About forty seconds later, the CI run finished its refresh and wrote its version of the state to S3 on top of that. Two non-linear writes, each thinking it had the latest state, each clobbering parts of the other. S3 versioning preserved both, but the live state pointer was pointing at a Frankenstein.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three S3 versions in 90 seconds, and a plan that wanted to destroy healthy infrastructure
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;The moment the real cause became visible&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We pulled the S3 object versions for the state file first. That is the single most useful command in a Terraform state incident, and most teams do not run it until someone external suggests it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws s3api list-object-versions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--bucket&lt;/span&gt; acme-tfstate-staging &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--prefix&lt;/span&gt; &lt;span class="nb"&gt;env&lt;/span&gt;/staging/terraform.tfstate &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'Versions[?LastModified&amp;gt;=`2024-01-18T16:45:00Z`].[VersionId,LastModified,Size]'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table

&lt;span class="c"&gt;# Output (abridged):&lt;/span&gt;
&lt;span class="c"&gt;# VersionId                          LastModified               Size&lt;/span&gt;
&lt;span class="c"&gt;# 9f3aV2.JqL...                      2024-01-18T16:51:12Z       412847&lt;/span&gt;
&lt;span class="c"&gt;# 8h2nB1.KpM...                      2024-01-18T16:50:31Z       408992&lt;/span&gt;
&lt;span class="c"&gt;# 7g1mA0.LoN...                      2024-01-18T16:49:48Z       411203&lt;/span&gt;
&lt;span class="c"&gt;# 6f0lZ9.MnO...                      2024-01-18T16:42:15Z       411198   &amp;lt;-- last known good&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Three writes inside 84 seconds. The 16:42 version was the last clean write before the collision.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Three writes in 84 seconds was the smoking gun. A healthy workspace writes state once per apply, and the next write is usually hours away. Three writes that close together meant at least two processes had been racing. We cross-checked against the CI logs and the engineer's shell history and confirmed: the CI plan cycle had been refreshing state from 16:49:48 onwards, the engineer's force-unlock landed at 16:50:18, the engineer's apply wrote state at 16:50:31, and the CI refresh wrote its stale view back at 16:51:12. The 16:51 write was the one Terraform was now reading, and it had been built from a refresh that started before half the engineer's changes existed.&lt;/p&gt;

&lt;p&gt;That explained the plan output. The state Terraform was reading said the resources had attributes that did not match reality. Plan diffed state against the cloud, saw the mismatch, and proposed the only thing it knows how to propose: destroy and recreate. The cloud was correct. The state was lying. If we had let the apply run, we would have taken a healthy staging environment offline for somewhere between 40 minutes and two hours to rebuild things that did not need rebuilding.&lt;/p&gt;

&lt;h2&gt;
  
  
  Restore the pre-collision state version, then import only what actually drifted
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;How we worked through it&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The recovery had two parts and an order that mattered. First, replace the corrupted live state with the last clean S3 version. Second, figure out which resources genuinely changed during the collision window and re-import only those. Skipping the second step is how teams end up with the same incident a week later, because real changes from the engineer's apply have been silently rolled back.&lt;/p&gt;

&lt;p&gt;Before touching anything we pulled a local backup of the current (broken) state. If our restore went wrong, we wanted a way back.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Backup the current broken state to local disk&lt;/span&gt;
aws s3api get-object &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--bucket&lt;/span&gt; acme-tfstate-staging &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--key&lt;/span&gt; &lt;span class="nb"&gt;env&lt;/span&gt;/staging/terraform.tfstate &lt;span class="se"&gt;\&lt;/span&gt;
  ./tfstate.broken.&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt;.json

&lt;span class="c"&gt;# 2. Restore the last known good version in place&lt;/span&gt;
aws s3api copy-object &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--bucket&lt;/span&gt; acme-tfstate-staging &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--key&lt;/span&gt; &lt;span class="nb"&gt;env&lt;/span&gt;/staging/terraform.tfstate &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--copy-source&lt;/span&gt; &lt;span class="s1"&gt;'acme-tfstate-staging/env/staging/terraform.tfstate?versionId=6f0lZ9.MnO...'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--metadata-directive&lt;/span&gt; REPLACE

&lt;span class="c"&gt;# 3. Confirm the active version is now the restored one&lt;/span&gt;
aws s3api head-object &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--bucket&lt;/span&gt; acme-tfstate-staging &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--key&lt;/span&gt; &lt;span class="nb"&gt;env&lt;/span&gt;/staging/terraform.tfstate &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'VersionId'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;The copy-object call writes the old version as a new current version. Do not delete versions; you want the audit trail intact.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;With the state restored, we ran terraform plan. The output was much shorter, around six resources, and they were the ones the engineer had actually changed in their apply. That was the divergence window: changes that had been made for real in AWS but that the restored state did not know about. Each of those needed a terraform import to reattach the live resource to the state. We did them one at a time, ran plan between each, and watched the diff shrink.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Example: the engineer had created a new IAM role during their apply.&lt;/span&gt;
&lt;span class="c"&gt;# The restored state predates it, but the role exists in AWS.&lt;/span&gt;

terraform import &lt;span class="se"&gt;\&lt;/span&gt;
  module.platform.aws_iam_role.svc_runner &lt;span class="se"&gt;\&lt;/span&gt;
  acme-staging-svc-runner

&lt;span class="c"&gt;# After each import, re-run plan and confirm the resource is no longer in the diff.&lt;/span&gt;
terraform plan &lt;span class="nt"&gt;-out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/tmp/plan.out

&lt;span class="c"&gt;# Repeat for each resource genuinely changed during the divergence window:&lt;/span&gt;
&lt;span class="c"&gt;# - 1 IAM role&lt;/span&gt;
&lt;span class="c"&gt;# - 1 IAM role policy attachment&lt;/span&gt;
&lt;span class="c"&gt;# - 2 security group rules&lt;/span&gt;
&lt;span class="c"&gt;# - 1 SSM parameter&lt;/span&gt;
&lt;span class="c"&gt;# - 1 Lambda permission&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Import surgically. Do not bulk-import; you want a clean plan after each step so you can spot collateral damage.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;After the sixth import, terraform plan returned No changes. That was the success signal. The state matched the cloud, the engineer's intended changes were preserved, and nothing healthy had been destroyed. Total time on the bridge from first page to clean plan was 2 hours 40 minutes. About 45 minutes of that was the investigation; the rest was careful, slow imports with verification between each one.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart TD
  A[terraform plan shows mass destroy/recreate] --&amp;gt; B{Are the resources actually broken in cloud?}
  B -- No, healthy --&amp;gt; C[State file is the problem, not cloud]
  B -- Yes, broken --&amp;gt; Z[Different incident; investigate cloud-side]
  C --&amp;gt; D[list-object-versions on tfstate]
  D --&amp;gt; E{Multiple writes in short window?}
  E -- Yes --&amp;gt; F[Identify last clean version pre-collision]
  E -- No --&amp;gt; Y[Investigate other corruption causes]
  F --&amp;gt; G[Backup current broken state locally]
  G --&amp;gt; H[copy-object to restore clean version]
  H --&amp;gt; I[terraform plan: short diff = divergence window]
  I --&amp;gt; J[terraform import each drifted resource]
  J --&amp;gt; K{Plan empty?}
  K -- No --&amp;gt; J
  K -- Yes --&amp;gt; L[Recovery complete; write postmortem]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Decision flow we use for any state-collision incident. The first branch matters most: confirm the cloud is healthy before touching state.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Diagram renders at the &lt;a href="https://infraforge.agency/insights/terraform-force-unlock-state-divergence-recovery/#diagram" rel="noopener noreferrer"&gt;canonical version&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Two tempting shortcuts that would have made it worse
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;What we tried that we will not try again&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Two shortcuts came up on the bridge that we ruled out. They are worth naming because both of them sound reasonable when you are tired.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1. Let terraform apply rebuild everything&lt;/strong&gt;, The plan was already there. Just type yes. This would have caused 30 to 90 minutes of staging downtime for resources that did not need rebuilding, broken any data-layer resources with state of their own, and lost the audit trail of what had actually changed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2. terraform refresh to fix the state&lt;/strong&gt;, Refresh updates state from the live infrastructure for known resources. It does not learn about resources the state has forgotten, and it cannot undo a structurally corrupted state. Refresh on a Frankenstein state can deepen the damage by writing the merged view back as the new truth.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We have written about the broader pattern in &lt;a href="https://dev.to/terraform-state-recovery/"&gt;the Terraform state recovery playbook&lt;/a&gt;, specifically the rule we now apply on every state incident: the state file is the suspect until proven otherwise. Cloud is healthy until you have evidence it is not. That ordering keeps you from running destructive applies under time pressure.&lt;/p&gt;

&lt;h2&gt;
  
  
  A pre-apply lock check that prints the holder's age
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;What we changed afterwards&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The team made two changes the week after the incident. Both are small. Both have already paid for themselves.&lt;/p&gt;

&lt;p&gt;The first change is a pre-apply wrapper script that reads the DynamoDB lock table before terraform apply runs. If a lock exists, the script prints the lock holder, when the lock was acquired, and how long ago that was. If the lock is younger than the workspace's typical apply duration plus a safety margin, the script refuses to run and tells the engineer to wait. If the lock is genuinely old (older than any plausible live process), the script still does not force-unlock automatically; it prints the exact force-unlock command and makes the engineer paste it. The friction is the point.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/usr/bin/env bash&lt;/span&gt;
&lt;span class="c"&gt;# pre-apply-lock-check.sh&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail

&lt;span class="nv"&gt;WORKSPACE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;1&lt;/span&gt;:?workspace&lt;span class="p"&gt; name required&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nv"&gt;LOCK_TABLE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"acme-tfstate-locks"&lt;/span&gt;
&lt;span class="nv"&gt;MAX_PLAUSIBLE_APPLY_SECONDS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1800  &lt;span class="c"&gt;# 30 minutes&lt;/span&gt;

&lt;span class="nv"&gt;LOCK_ITEM&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;aws dynamodb get-item &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--table-name&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$LOCK_TABLE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--key&lt;/span&gt; &lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;LockID&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;S&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;acme-tfstate-staging/env/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;WORKSPACE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/terraform.tfstate-md5&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;}}"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; json 2&amp;gt;/dev/null &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'{}'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$LOCK_ITEM&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.Item // empty'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;""&lt;/span&gt; &lt;span class="o"&gt;]]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"No lock. Safe to proceed."&lt;/span&gt;
  &lt;span class="nb"&gt;exit &lt;/span&gt;0
&lt;span class="k"&gt;fi

&lt;/span&gt;&lt;span class="nv"&gt;HOLDER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$LOCK_ITEM&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.Item.Info.S'&lt;/span&gt; | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.Who + " @ " + .Operation'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;CREATED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$LOCK_ITEM&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.Item.Info.S'&lt;/span&gt; | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.Created'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;AGE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$((&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CREATED&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="k"&gt;))&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Lock present."&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"  Holder:  &lt;/span&gt;&lt;span class="nv"&gt;$HOLDER&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"  Created: &lt;/span&gt;&lt;span class="nv"&gt;$CREATED&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"  Age:     &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;AGE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;s"&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt; AGE &amp;lt; MAX_PLAUSIBLE_APPLY_SECONDS &lt;span class="o"&gt;))&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo
  echo&lt;/span&gt; &lt;span class="s2"&gt;"REFUSING TO PROCEED. Lock is younger than max plausible apply duration."&lt;/span&gt;
  &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Wait for the current holder to finish, or confirm out-of-band that it is dead."&lt;/span&gt;
  &lt;span class="nb"&gt;exit &lt;/span&gt;1
&lt;span class="k"&gt;fi

&lt;/span&gt;&lt;span class="nb"&gt;echo
echo&lt;/span&gt; &lt;span class="s2"&gt;"Lock is older than &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;MAX_PLAUSIBLE_APPLY_SECONDS&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;s. It may be stale."&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"To force-unlock, run manually (do NOT automate this):"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"  terraform force-unlock &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$LOCK_ITEM&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.Item.Info.S'&lt;/span&gt; | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.ID'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;exit &lt;/span&gt;2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;We run this from CI and from a pre-apply git hook on engineer laptops. Same script, same rules, both places.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The second change is operational. The team's runbook now says: if you ever run force-unlock, page the on-call channel immediately with the lock ID and the reason. That single message would have caught this incident before it became one. The CI job would have replied within seconds that it was still running, and the engineer would have known to wait the eight minutes instead of clobbering the state.&lt;/p&gt;

&lt;p&gt;We have stopped recommending that teams treat force-unlock as a routine command. It is a recovery command. It belongs in the same mental category as DROP TABLE: technically available, occasionally necessary, never the first thing you reach for. The TTL on the lock is generous on purpose. Wait it out, or confirm the holder is dead. Those are the only two paths.&lt;/p&gt;

&lt;h2&gt;
  
  
  When the state file is the suspect and the clock is running
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;If you are looking at a destroy plan you do not trust&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The hard part of state-collision incidents is not the recovery commands. The commands are mechanical once you know the shape of the problem. The hard part is the 20 minutes before that, when an apply plan is sitting in your terminal showing 30+ destroys, someone senior is asking on Slack whether you can just run it, and you have to decide whether the cloud is broken or the state is. Get that wrong under pressure and you cause the outage you were trying to prevent.&lt;/p&gt;

&lt;p&gt;We run these recovery engagements every week. The force-unlock-collision pattern has shown up four times this quarter alone, in three different shapes: a CI plan racing an engineer apply (this one), two engineers applying simultaneously after a Slack misunderstanding, and a long-running import operation that an engineer killed because they thought it had hung. The recovery shape is the same. The diagnostic discipline of confirming the cloud is healthy before touching state is the same. The thing that changes is which version of state is the right one to restore to, and that takes practice to spot quickly.&lt;/p&gt;

&lt;p&gt;If you are staring at a terraform plan that wants to destroy resources you know are healthy, do not run apply. &lt;a href="https://dev.to/review/"&gt;Book an infrastructure review with our team&lt;/a&gt; and we will be on a bridge with you the same day to work through the state restore and the surgical imports. We have done this enough times that we can usually have you back to an empty plan inside three hours.&lt;/p&gt;




&lt;p&gt;Originally published at &lt;a href="https://infraforge.agency/insights/terraform-force-unlock-state-divergence-recovery/" rel="noopener noreferrer"&gt;https://infraforge.agency/insights/terraform-force-unlock-state-divergence-recovery/&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — &lt;a href="https://infraforge.agency/review/" rel="noopener noreferrer"&gt;see /review&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>state</category>
      <category>recovery</category>
      <category>terraformstate</category>
    </item>
    <item>
      <title>Why a forgotten RDS replica added $8,600 to one AWS bill</title>
      <dc:creator>Muhammad Hassaan Javed</dc:creator>
      <pubDate>Tue, 19 May 2026 17:23:31 +0000</pubDate>
      <link>https://dev.to/infraforge/why-a-forgotten-rds-replica-added-8600-to-one-aws-bill-2k4d</link>
      <guid>https://dev.to/infraforge/why-a-forgotten-rds-replica-added-8600-to-one-aws-bill-2k4d</guid>
      <description>&lt;p&gt;The finance lead forwarded the AWS bill on a Monday morning with three question marks in the subject line. The number had gone from a steady $3,200/month to $11,800 in six days. The on-call engineer's first guess, sensible enough, was that a data scientist had left a cross-region Athena job running over the weekend. It was not. It was an RDS read replica in a different AZ from its primary, provisioned a month earlier for a one-off load test, never decommissioned, retrying a replication-stream write every 50 milliseconds because somebody had flipped the primary's binlog format mid-stream. Nobody had read from the replica in three weeks. It had been quietly burning cross-AZ data transfer the whole time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem signals:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS bill jumped 2-4x in under a week with no traffic or feature change&lt;/li&gt;
&lt;li&gt;Cost Explorer concentrates the spike on DataTransfer-Regional-Bytes and RDSInstance line items&lt;/li&gt;
&lt;li&gt;An RDS read replica sits in a different AZ than its primary and shows jagged ReplicaLag (spikes to 30s, drops to 0.5s, repeats)&lt;/li&gt;
&lt;li&gt;No application config or BI tool actually points at the replica's endpoint&lt;/li&gt;
&lt;li&gt;Recent schema or replication change on the primary that nobody coordinated with replica owners&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Chasing the analytics query that did not exist
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;What we thought it was first&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Almost every cost spike I have seen in the last three years gets blamed on analytics first. There is usually a junior data person, a notebook, a forgotten SELECT *, and a story everyone tells themselves. So we did the natural thing. We pulled the Athena query history for the previous ten days. Nothing unusual. We checked Redshift, which the team barely uses. Idle. We checked the data warehouse cluster's autoscaling history. Flat.&lt;/p&gt;

&lt;p&gt;The clue was in Cost Explorer, but only when we grouped by usage type instead of by service. The RDS line item was up, sure, but the line item that had really moved was DataTransfer-Regional-Bytes. That is the meter for cross-AZ traffic inside a single region. Analytics queries do not typically light that meter up unless somebody has put a compute node in one AZ and the data in another, which would have been a much weirder problem.&lt;/p&gt;

&lt;p&gt;Cross-AZ data transfer at that volume meant something was constantly shipping bytes between two availability zones. The shape of the bill said: find the thing that talks to itself across AZs at high frequency.&lt;/p&gt;

&lt;h2&gt;
  
  
  How we found the orphan replica
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;The diagnostic turn&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We listed every RDS instance in the account and compared the AZ of each replica to its primary. One read replica was in us-east-1b while its primary was in us-east-1a. That alone is not a problem; cross-AZ replicas exist for legitimate HA reasons. What was odd was that this replica was tagged with nothing. No Owner. No Purpose. No Environment. Just the default Name tag, which read load-test-replica-temp.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# List replicas with their AZ and their primary's AZ&lt;/span&gt;
aws rds describe-db-instances &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'DBInstances[?ReadReplicaSourceDBInstanceIdentifier!=`null`].[DBInstanceIdentifier,AvailabilityZone,ReadReplicaSourceDBInstanceIdentifier,DBInstanceStatus]'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table

&lt;span class="c"&gt;# Then for each primary, get its AZ&lt;/span&gt;
aws rds describe-db-instances &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-instance-identifier&lt;/span&gt; &amp;lt;primary-id&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'DBInstances[0].AvailabilityZone'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;The two commands that surfaced the orphan in about 30 seconds.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The replica's CloudWatch ReplicaLag metric was the giveaway that this was not a healthy idle replica. It would spike to 30 seconds, drop to 0.5 seconds, spike again, every minute or so. That sawtooth pattern means the replication thread is failing and retrying. We pulled the replica's error log and found the same line repeating roughly every 50 milliseconds: a binlog format mismatch. Someone had changed the primary from MIXED to ROW format three weeks earlier, and the replica had been retrying the broken stream ever since.&lt;/p&gt;

&lt;p&gt;Every retry shipped a chunk of binlog across the AZ boundary. At 50ms intervals, 24 hours a day, for three weeks. That was the bill.&lt;/p&gt;

&lt;h2&gt;
  
  
  The five-minute check that prevents the worse outcome
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;What we did before deleting anything&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The instinct, when you have found the thing burning money, is to kill it immediately. We did not. The worse outcome here is not 'replica costs another hour of cross-AZ transfer'. The worse outcome is 'replica gets deleted, a quarterly BI dashboard breaks on Friday, and finance is back in your inbox with a different question'.&lt;/p&gt;

&lt;p&gt;So we did the cheap verification first. We grepped the application monorepo for the replica's endpoint hostname. Zero hits. We checked the BI tool's data sources (Metabase in this case). Nothing pointed at it. We checked the data team's Airflow DAGs. Clean. We checked Terraform state to see how it had been created. It was in a workspace tagged load-test that had not been touched in a month, and the engineer who created it had left the company three weeks earlier.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;If something had pointed at it&lt;/strong&gt;, The right move would have been to keep the replica, fix the binlog format, and decide whether the read pattern actually justified cross-AZ. Deletion would have caused a worse incident than the cost spike.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nothing pointed at it&lt;/strong&gt;, Delete with --skip-final-snapshot. The replica was already corrupted by the binlog mismatch; a final snapshot was worthless. Cost stopped accruing within minutes.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws rds delete-db-instance &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-instance-identifier&lt;/span&gt; load-test-replica-temp &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--skip-final-snapshot&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;The actual delete, once we were confident nothing depended on the replica.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Tag hygiene, expiration sweeps, and an anomaly budget that would have caught this on day 2
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;What we changed afterwards&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Forgotten resources are the largest single category of cloud waste I see in client accounts. Bigger than oversized instances. Bigger than reserved-instance gaps. The fix is mechanical. Every cost-generating resource needs three tags: Owner, Purpose, ExpiresAt. ExpiresAt is the one most teams skip and the one that does the work.&lt;/p&gt;

&lt;p&gt;We deployed a small Lambda on a weekly schedule that walks RDS, EC2, ELB, ElastiCache, and OpenSearch, finds resources past their ExpiresAt date or missing tags entirely, and posts to a Slack channel pinging the Owner. The owner has two weeks to either re-tag with a new ExpiresAt or delete. Resources with no Owner go to the platform team's queue. The first sweep flagged 47 resources across the account. Six of them were costing real money.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart TD
  A[Weekly Lambda runs] --&amp;gt; B{Resource has&amp;lt;br/&amp;gt;Owner, Purpose,&amp;lt;br/&amp;gt;ExpiresAt tags?}
  B -- no --&amp;gt; C[Post to platform team queue]
  B -- yes --&amp;gt; D{ExpiresAt&amp;lt;br/&amp;gt;in past?}
  D -- no --&amp;gt; E[Skip]
  D -- yes --&amp;gt; F[DM the Owner in Slack]
  F --&amp;gt; G{Owner responds&amp;lt;br/&amp;gt;within 14 days?}
  G -- extends --&amp;gt; H[Update ExpiresAt]
  G -- no response --&amp;gt; I[Auto-tag for deletion&amp;lt;br/&amp;gt;review next sweep]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;The sweep logic. About 180 lines of Python in practice.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Diagram renders at the &lt;a href="https://infraforge.agency/insights/forgotten-rds-replica-cross-az-cost-spike/#diagram" rel="noopener noreferrer"&gt;canonical version&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The second change was AWS Budgets with anomaly detection scoped per service. The team had a single account-wide budget set at $5,000/month, which is useless for catching this kind of incident because the spike was concentrated in one service and the account total only crossed $5,000 on day five. A per-service budget on RDS set at $4,000 with a 20% variance threshold would have fired on day 2. The alert that matters is the one that fires before you have spent the money, not after.&lt;/p&gt;

&lt;p&gt;The third change was a process one. The original binlog format change had been an uncoordinated database tweak from a senior engineer who had not realized a replica existed. Schema and replication changes now require a checklist that includes 'list all replicas of this primary and confirm they support the new config' as a pre-flight step. It is not glamorous. It would have prevented the entire incident.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where cost spike triage gets stuck
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;If your AWS bill just jumped and you do not know why&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The hard part of a cost spike is not finding the resource. It is being confident enough to delete it. Most teams we work with have at least one orphan RDS, ElastiCache, or NAT gateway they are afraid to touch because nobody remembers what depends on it. The triage takes a day; the courage to act takes a week of meetings. By then the bill has run another $2,000.&lt;/p&gt;

&lt;p&gt;We run cost spike triage engagements every month. We have seen the orphan-replica case four times this year, the NAT-gateway-in-the-wrong-AZ case more often than that, and a half dozen variants of 'load test that never got cleaned up' across CloudWatch Logs, OpenSearch, and Aurora Serverless. The pattern is almost always the same: a resource that nobody owns, a tag policy that was never enforced, and a budget alert tuned too coarse to catch concentration in a single service. We have written more on the underlying patterns in &lt;a href="https://dev.to/problems/cloud-cost-spikes/"&gt;the cloud cost spikes problem brief&lt;/a&gt; and across &lt;a href="https://dev.to/services/"&gt;our services&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If your AWS bill jumped this month and you cannot point at the resource with confidence, &lt;a href="https://dev.to/review/"&gt;book an infrastructure review with our team&lt;/a&gt; and we will start with a 30-minute diagnostic call this week. Cost stops accruing the day we find the orphan.&lt;/p&gt;




&lt;p&gt;Originally published at &lt;a href="https://infraforge.agency/insights/forgotten-rds-replica-cross-az-cost-spike/" rel="noopener noreferrer"&gt;https://infraforge.agency/insights/forgotten-rds-replica-cross-az-cost-spike/&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — &lt;a href="https://infraforge.agency/review/" rel="noopener noreferrer"&gt;see /review&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>cost</category>
      <category>spike</category>
      <category>triage</category>
      <category>costspikes</category>
    </item>
    <item>
      <title>Why terraform apply fails when plan passes: the map(any) trap</title>
      <dc:creator>Muhammad Hassaan Javed</dc:creator>
      <pubDate>Tue, 19 May 2026 17:14:53 +0000</pubDate>
      <link>https://dev.to/infraforge/why-terraform-apply-fails-when-plan-passes-the-mapany-trap-50dg</link>
      <guid>https://dev.to/infraforge/why-terraform-apply-fails-when-plan-passes-the-mapany-trap-50dg</guid>
      <description>&lt;p&gt;The on-call engineer pinged me at 4:42pm on a Friday with the release window open until 5:30. terraform apply against the staging workspace had failed with &lt;code&gt;Error: Unsupported argument&lt;/code&gt; deep inside a child module nobody on the team had touched in seven months. terraform plan against the same workspace ran clean. They had already re-run plan twice and got fresh no-op output both times. The shape of the failure was off. plan and apply diverging is rare in the way they were describing, and you mostly see it on data sources that resolve at apply time, not on a static &lt;code&gt;merge()&lt;/code&gt; call inside a module whose code had not changed in six months.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem signals:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;terraform plan succeeds locally but terraform apply fails on a specific environment&lt;/li&gt;
&lt;li&gt;The error is &lt;code&gt;Error: Unsupported argument&lt;/code&gt; or &lt;code&gt;Inappropriate value&lt;/code&gt; deep inside a child module&lt;/li&gt;
&lt;li&gt;The traceback points at a &lt;code&gt;merge()&lt;/code&gt; or &lt;code&gt;lookup()&lt;/code&gt; call inside a module that has not been edited in months&lt;/li&gt;
&lt;li&gt;Your root module input list has crossed 20 variables and several are typed &lt;code&gt;any&lt;/code&gt; or &lt;code&gt;map(any)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;There is no CI job that runs terraform plan against every environment on every PR&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Three hypotheses, three dead ends, twenty-two minutes left in the release window
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;What we ruled out in the first 18 minutes&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The first thing the on-call lead suggested was state drift. Someone, somewhere, had &lt;code&gt;terraform import&lt;/code&gt;-ed a resource by hand. We checked the audit log. No &lt;code&gt;import&lt;/code&gt; events in the past 30 days. We checked the lock table in DynamoDB. The lock had been released cleanly by the previous successful apply at 2:11pm.&lt;/p&gt;

&lt;p&gt;The second hypothesis was provider version drift. The team had recently bumped &lt;code&gt;hashicorp/aws&lt;/code&gt; from 5.62 to 5.71 in &lt;code&gt;versions.tf&lt;/code&gt;. A breaking change in a resource schema can absolutely cause an &lt;code&gt;Unsupported argument&lt;/code&gt; error if apply pulls a newer provider than plan resolved against. We pinned both runs to 5.71 explicitly, deleted &lt;code&gt;.terraform/&lt;/code&gt;, re-ran &lt;code&gt;init&lt;/code&gt;, then &lt;code&gt;plan&lt;/code&gt;, then &lt;code&gt;apply&lt;/code&gt;. Same error, same module, same line.&lt;/p&gt;

&lt;p&gt;The third hypothesis was a stale workspace. terraform workspaces sometimes diverge from the configuration if &lt;code&gt;workspace select&lt;/code&gt; was bypassed by an engineer who exported &lt;code&gt;TF_WORKSPACE&lt;/code&gt; and forgot. We ran &lt;code&gt;terraform workspace show&lt;/code&gt; and verified it matched the intended target. The plan output even confirmed the right resource addresses.&lt;/p&gt;

&lt;p&gt;Three explanations, three dead ends, twenty-eight minutes burned. The release window was now twenty-two minutes wide and shrinking. The on-call lead asked whether we should just roll back the deploy and figure it out Monday. I asked one more question first.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 15th map(any) input that had been silently incubating for three weeks
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Where the collision actually lived&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I asked the on-call lead to walk me through what had merged into the workspace in the past two weeks. There were six commits. Five were obvious changes (image tags, a new IAM policy, a security group port). The sixth was a feature flag, added as a 15th &lt;code&gt;map(any)&lt;/code&gt; input on the root module by an engineer who had joined six weeks earlier.&lt;/p&gt;

&lt;p&gt;That was the lead.&lt;/p&gt;

&lt;p&gt;The root module had 28 input variables. 14 of them were &lt;code&gt;any&lt;/code&gt;-typed or &lt;code&gt;map(any)&lt;/code&gt; to absorb per-environment overrides accumulated over six years of feature additions. The new feature flag added a 15th &lt;code&gt;map(any)&lt;/code&gt; input named &lt;code&gt;feature_overrides&lt;/code&gt;. Its values flowed through a &lt;code&gt;merge()&lt;/code&gt; chain down to the database child module, which did its own &lt;code&gt;merge(var.feature_overrides, local.legacy_db_flags)&lt;/code&gt; inside &lt;code&gt;modules/services/database/locals.tf&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The two maps had a key collision. Both contained a key named &lt;code&gt;read_replica_routing&lt;/code&gt;. The new input's value was a &lt;code&gt;string&lt;/code&gt;. The legacy local's value was a &lt;code&gt;map(object({ host = string, weight = number }))&lt;/code&gt;. &lt;code&gt;merge()&lt;/code&gt; resolves collisions by taking the last argument's value, but the argument order in this case depended on which input was non-empty at apply time, and the new feature flag was only non-empty in staging.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sequenceDiagram
  participant Op as Operator
  participant Plan as terraform plan
  participant Apply as terraform apply
  participant Child as child module
  Op-&amp;gt;&amp;gt;Plan: feature_overrides (map(any))
  Plan-&amp;gt;&amp;gt;Child: merge(map(any), map(any))
  Child--&amp;gt;&amp;gt;Plan: any (type-check deferred)
  Plan--&amp;gt;&amp;gt;Op: 0 to add, 0 to change (PASS)
  Op-&amp;gt;&amp;gt;Apply: same input
  Apply-&amp;gt;&amp;gt;Child: merge resolved to concrete value
  Child--&amp;gt;&amp;gt;Apply: Error: Unsupported argument
  Apply--&amp;gt;&amp;gt;Op: FAIL at 4:42pm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;How &lt;code&gt;map(any)&lt;/code&gt; defers type-checking past plan and surfaces it at apply&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Diagram renders at the &lt;a href="https://infraforge.agency/insights/terraform-apply-fails-map-any-trap/#diagram" rel="noopener noreferrer"&gt;canonical version&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The collision had been latent for three weeks. plan succeeded because terraform's planner walked the call graph with both maps' element types collapsed to &lt;code&gt;any&lt;/code&gt;. The merged value passed type-check as &lt;code&gt;any&lt;/code&gt;, which type-checks against anything. apply, which actually constructs the resource, evaluated the merged value against the receiving attribute's concrete type signature and discovered the value was a string where an object was required.&lt;/p&gt;

&lt;p&gt;That is the part that hurts. Terraform's &lt;code&gt;any&lt;/code&gt; type defers all type-checking until apply. Every &lt;code&gt;map(any)&lt;/code&gt; input on a root module is a future apply-time failure waiting on a contributor who does not know the implicit shape.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three options, one open release window, seven minutes to pick
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;What we did before running apply again&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We had three options and one open release window. I walked the on-call lead through them on the bridge call.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1. Delete the legacy key&lt;/strong&gt;, Fastest. Also the riskiest: the legacy &lt;code&gt;read_replica_routing&lt;/code&gt; key was referenced by three modules-of-modules three layers down. Deleting it would have moved the failure from staging to production an hour later.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2. Rename the new key&lt;/strong&gt;, Safe-feeling. Left the underlying &lt;code&gt;any&lt;/code&gt;-typed contract intact. Two months later a different contributor would add another &lt;code&gt;map(any)&lt;/code&gt; input and we would be back on a Friday afternoon with the same shape of failure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3. Rename plus add validation&lt;/strong&gt;, Slower. Renamed the new key to &lt;code&gt;feature_routing_overrides&lt;/code&gt; AND added a &lt;code&gt;validation&lt;/code&gt; block on the input that explicitly rejected the colliding shape at plan time going forward. Stopped the immediate reoccurrence.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Option three carried the day. The rename took seven minutes. The validation block took twelve. apply succeeded at 5:14pm with sixteen minutes to spare on the release window. The release shipped on time.&lt;/p&gt;

&lt;p&gt;The audit work behind option one (the one we did NOT take) is what stuck with me. The next morning, we grep-ed the entire &lt;code&gt;terraform/&lt;/code&gt; tree for &lt;code&gt;read_replica_routing&lt;/code&gt; to map every consumer. Seven references across four modules. Three in &lt;code&gt;modules/services/database/locals.tf&lt;/code&gt; itself. One in &lt;code&gt;modules/monitoring/cloudwatch.tf&lt;/code&gt;. One in &lt;code&gt;modules/services/cache/lookups.tf&lt;/code&gt;, which read the value to construct its own routing decision and would have broken silently if we had deleted the legacy key the night before. The remaining two were in a state-recovery helper module the team had forgotten existed. We had nearly fired the second shot of our own foot.&lt;/p&gt;

&lt;p&gt;We left a tombstone comment on the legacy key and an open PR that would, the following week, replace its &lt;code&gt;map(any)&lt;/code&gt; type with a proper &lt;code&gt;object({ ... })&lt;/code&gt; schema. That work landed five days later. The downstream consumers caught the change at plan time, and three of them needed minor patches before the type tightening could merge. None of those patches would have caught the original collision. They all caught real existing bugs the &lt;code&gt;any&lt;/code&gt; type had been hiding.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two policy changes and one structural fix
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;What we changed afterwards&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Two policy changes came out of that night, and one structural fix took longer.&lt;/p&gt;

&lt;p&gt;The first policy: no new &lt;code&gt;map(any)&lt;/code&gt; or &lt;code&gt;any&lt;/code&gt;-typed inputs on root modules. The team's &lt;code&gt;terraform/&lt;/code&gt; directory has a pre-commit hook (8 lines of grep) that fails the commit if any new &lt;code&gt;variable&lt;/code&gt; block contains &lt;code&gt;type = any&lt;/code&gt; or &lt;code&gt;type = map(any)&lt;/code&gt;. Existing instances are grandfathered, with a TODO list tracked against each module. Three of the original 14 have been converted to typed objects so far. The hook has fired four times in the six weeks since.&lt;/p&gt;

&lt;p&gt;The second policy: every PR runs &lt;code&gt;terraform plan&lt;/code&gt; against every environment, not just the one the contributor cares about. A matrix job in CI runs &lt;code&gt;plan -var-file=envs/&amp;lt;env&amp;gt;.tfvars&lt;/code&gt; across all four environments and fails the PR if any of them errors. This would not have caught the original collision (plan succeeded everywhere), but it catches a different class of failure where one environment's tfvars hits an unwritten code path.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before: latent any-typed input&lt;/span&gt;
&lt;span class="k"&gt;variable&lt;/span&gt; &lt;span class="s2"&gt;"feature_overrides"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;any&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nx"&gt;default&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Per-environment feature flag overrides"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# In modules/services/database/locals.tf&lt;/span&gt;
&lt;span class="nx"&gt;locals&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;merged_flags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="kd"&gt;local&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;legacy_db_flags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;feature_overrides&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Above passes plan even when the two maps have a key&lt;/span&gt;
&lt;span class="c1"&gt;# whose value types disagree. The mismatch surfaces only&lt;/span&gt;
&lt;span class="c1"&gt;# at apply, when the receiving attribute is evaluated.&lt;/span&gt;

&lt;span class="c1"&gt;# After: typed, explicit, errors at plan time&lt;/span&gt;
&lt;span class="k"&gt;variable&lt;/span&gt; &lt;span class="s2"&gt;"feature_overrides"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;enabled&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;bool&lt;/span&gt;
    &lt;span class="nx"&gt;rollout_pct&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;optional&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nx"&gt;routing&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;optional&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"default"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;}))&lt;/span&gt;
  &lt;span class="nx"&gt;default&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Per-environment feature flag overrides"&lt;/span&gt;

  &lt;span class="nx"&gt;validation&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;condition&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;alltrue&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
      &lt;span class="nx"&gt;for&lt;/span&gt; &lt;span class="nx"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;v&lt;/span&gt; &lt;span class="nx"&gt;in&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;feature_overrides&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt;
      &lt;span class="nx"&gt;v&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rollout_pct&lt;/span&gt; &lt;span class="err"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="err"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;v&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rollout_pct&lt;/span&gt; &lt;span class="err"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="nx"&gt;error_message&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"rollout_pct must be between 0 and 100."&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;The same variable, before and after. The lower form fails plan, not apply, when a contributor passes the wrong shape.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The structural fix took longer. A 28-input root module is not a configuration problem, it is a service-boundary problem. The team running the database stack should own a &lt;code&gt;database/&lt;/code&gt; root module with four inputs, not a 14-input subtree of a shared 28-input root. We split the original root into three roots along ownership boundaries (network, services, observability) using a thin terragrunt overlay for the cross-cutting variables. The split took six weeks of careful state-mv work to land without downtime. We have written more on the structural fix in &lt;a href="https://dev.to/terraform-iac-debt/"&gt;the Terraform and IaC debt playbook&lt;/a&gt;, which covers when a shared root module starts costing more than the consistency it buys.&lt;/p&gt;

&lt;p&gt;What we tell every team now: strong types in Terraform are not bureaucracy, they are the documentation. The half-day cost to write &lt;code&gt;object({ name = string, enabled = bool, ... })&lt;/code&gt; instead of &lt;code&gt;map(any)&lt;/code&gt; buys you a plan-time failure instead of an apply-time failure, and apply-time failures land at 4:42pm on Fridays. We have stopped accepting &lt;code&gt;map(any)&lt;/code&gt; inputs in any client engagement that involves an IaC audit, and we have not had a single contributor push back once they saw the cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  If you are looking at a 28-input root with map(any) sprinkled through it
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;When your own root module is past 20 inputs&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you are reading this and your &lt;code&gt;terraform/&lt;/code&gt; directory has a root module past 20 inputs with several &lt;code&gt;map(any)&lt;/code&gt; types in the input list, the failure you are heading toward is not a surprise. It is a scheduled event. The trigger will be a new contributor who does not know the implicit contract, plus one bad-enough Friday. The hardest part of cleaning it up is not the typing work itself; it is the audit of downstream consumers that have been silently depending on the loose contract for years. Two layers of modules-of-modules can hide a reference that breaks the moment you tighten the type, and your CI will not warn you because plan will keep passing right up to the apply that surfaces it.&lt;/p&gt;

&lt;p&gt;We run these recovery and audit engagements every week. The &lt;code&gt;map(any)&lt;/code&gt; collision pattern is the third-most-common shape we see in seed-to-Series-B SaaS Terraform repos, right after stale state lock holders and provider-version-drift cascades. It is one variant of the broader &lt;a href="https://dev.to/problems/terraform-apply-fear/"&gt;terraform apply fear&lt;/a&gt; problem we engage on most weeks. On a typical engagement we map every &lt;code&gt;any&lt;/code&gt;-typed input in your root modules within the first day, prioritize them by blast radius, and either convert them in-place or split the root if the input count is the real problem. If you are looking at a Terraform root with &lt;code&gt;map(any)&lt;/code&gt; sprinkled through it and a release window that does not forgive a 4pm apply failure, &lt;a href="https://dev.to/review/"&gt;book an infrastructure review with our team&lt;/a&gt; and we will start with a 30-minute diagnostic call this week.&lt;/p&gt;




&lt;p&gt;Originally published at &lt;a href="https://infraforge.agency/insights/terraform-apply-fails-map-any-trap/" rel="noopener noreferrer"&gt;https://infraforge.agency/insights/terraform-apply-fails-map-any-trap/&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — &lt;a href="https://infraforge.agency/review/" rel="noopener noreferrer"&gt;see /review&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>iac</category>
      <category>recovery</category>
      <category>terraformiacdebt</category>
    </item>
  </channel>
</rss>
