<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Amaan Ul Haq Siddiqui</title>
    <description>The latest articles on DEV Community by Amaan Ul Haq Siddiqui (@amaanx86).</description>
    <link>https://dev.to/amaanx86</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3440630%2Fe95bc793-ebf6-4193-b6d3-822ff6f20586.jpeg</url>
      <title>DEV Community: Amaan Ul Haq Siddiqui</title>
      <link>https://dev.to/amaanx86</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/amaanx86"/>
    <language>en</language>
    <item>
      <title>End-to-End Supply Chain Security for a Go Project: TUF on CI, cosign, and SLSA L3</title>
      <dc:creator>Amaan Ul Haq Siddiqui</dc:creator>
      <pubDate>Sat, 11 Apr 2026 21:25:52 +0000</pubDate>
      <link>https://dev.to/amaanx86/end-to-end-supply-chain-security-for-a-go-project-tuf-on-ci-cosign-and-slsa-l3-1575</link>
      <guid>https://dev.to/amaanx86/end-to-end-supply-chain-security-for-a-go-project-tuf-on-ci-cosign-and-slsa-l3-1575</guid>
      <description>&lt;p&gt;Adding &lt;code&gt;cosign sign&lt;/code&gt; to a CI pipeline and calling it "signed releases" is a bit like putting a lock on a glass door. The lock works. The glass does not. Signing the image proves a specific digest was signed by a specific identity at a specific time. It says nothing about whether the source commit matches what was built, whether the build environment was clean, or whether someone replaced the release asset after the fact.&lt;/p&gt;

&lt;p&gt;I had been going deep on supply chain security for a while - reading through TUF specs, the Sigstore design docs, how Fulcio issues short-lived certificates, how Rekor works as an append-only transparency log. At some point I came across the OpenSSF Best Practices requirements and saw the full picture laid out as a checklist. I could have just signed the image and moved on. Instead I used &lt;a href="https://github.com/amaanx86/oci-prometheus-sd-proxy" rel="noopener noreferrer"&gt;oci-prometheus-sd-proxy&lt;/a&gt; - a project that does OCI Prometheus service discovery - as the thing to actually build it on. I wanted to understand each layer well enough to explain it, not just wire it up. What I ended up with: cosign keyless signing, a CycloneDX SBOM attestation, SLSA L3 build provenance, and TUF metadata distribution via &lt;a href="https://github.com/amaanx86/oci-prometheus-sd-proxy-tuf-on-ci" rel="noopener noreferrer"&gt;tuf-on-ci&lt;/a&gt; published to GitHub Pages. No long-lived keys anywhere in the pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why each layer exists
&lt;/h2&gt;

&lt;p&gt;This is the question worth answering properly, because you can absolutely ship just cosign and be in a better position than 95% of projects. So why go further?&lt;/p&gt;

&lt;p&gt;HTTPS protects against network-level tampering. It does not protect against a compromised GitHub account pushing a backdoored release, or a build that was modified after it completed, or an asset quietly swapped post-publication.&lt;/p&gt;

&lt;p&gt;cosign on a container image proves that a specific digest was signed by a specific OIDC identity with the event recorded in Rekor's transparency log. That is genuinely useful - but it only covers the image. It says nothing about the source ref, the build environment, or the workflow that produced it. Someone could sign a backdoored binary and the cosign verification would pass.&lt;/p&gt;

&lt;p&gt;SLSA L3 provenance fills that gap, and it is the layer I found most interesting to wire up. The provenance is generated in a separate, isolated signing job with its own OIDC identity, not in the main build job. That isolation is what makes L3 meaningful: an attacker who compromises the main build job cannot forge L3 provenance because they do not control the signing job's OIDC token. The provenance attests to the exact source ref, the exact workflow, and the exact runner environment. You can look at a SLSA L3 attestation and know that the image you are running came from that commit in that repo via that verified builder.&lt;/p&gt;

&lt;p&gt;TUF adds something orthogonal to all of the above - it is about &lt;em&gt;distribution trust&lt;/em&gt;, not just signing trust. It adds a role-based metadata layer where clients can verify that a release target was authorised by the project's key holders, that the metadata has not been rolled back to a previous version, and that the metadata is actually fresh. The design difference that matters: TUF survives key compromise in a way that a single cosign keypair does not. If my cosign key leaked tomorrow, every past signature would be under a cloud. With TUF, key rotation is a defined protocol. The damage is bounded and recoverable.&lt;/p&gt;

&lt;p&gt;The point of having all four is that verification is fully independent. Verifying the image signature, the build provenance, and the TUF metadata chain are three separate operations against different data sources. Compromising any single one of them is not enough to ship a malicious release undetected. You need to compromise all of them simultaneously - and the transparency logs make doing that silently very hard.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two repos, one trust boundary
&lt;/h2&gt;

&lt;p&gt;The implementation lives across two repositories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/amaanx86/oci-prometheus-sd-proxy" rel="noopener noreferrer"&gt;amaanx86/oci-prometheus-sd-proxy&lt;/a&gt;&lt;/strong&gt; - the application, the build pipeline, the release workflow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/amaanx86/oci-prometheus-sd-proxy-tuf-on-ci" rel="noopener noreferrer"&gt;amaanx86/oci-prometheus-sd-proxy-tuf-on-ci&lt;/a&gt;&lt;/strong&gt; - TUF metadata, managed by tuf-on-ci, published to GitHub Pages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Keeping them separate was a deliberate trust boundary decision, not just organisation. The app CI can push a signing branch to the TUF repo, but it cannot merge that branch or sign &lt;code&gt;targets.json&lt;/code&gt;. That step requires my OIDC identity authenticating via Sigstore in a browser - not the CI system's token. An attacker who steals the app repo's CI tokens hits a hard wall at the TUF signing step. They can push a branch, but they cannot authorise the release. The human is the last gate.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;App repo CI (run #24290258383)
  └── builds image (linux/amd64, linux/arm64)
  └── pushes to GHCR
  └── cosign attest --type cyclonedx sbom.cyclonedx.json (SBOM)
  └── cosign sign (image signature, OIDC -&amp;gt; Fulcio cert -&amp;gt; Rekor)
  └── cosign attest --type  release-metadata.json
  └── slsa-github-generator -&amp;gt; SLSA L3 provenance (isolated job)
  └── pushes sign/release-1-4-2-rc-24290258383 branch -&amp;gt; TUF repo

TUF repo (tuf-on-ci) [PR #8]
  └── signing-event.yml detects branch push -&amp;gt; opens signing PR
  └── Maintainer: tuf-on-ci-sign (browser OIDC -&amp;gt; @amaanx86 identity)
  └── PR merged -&amp;gt; online-sign.yml refreshes snapshot + timestamp
  └── publish.yml -&amp;gt; GitHub Pages
  └── test.yml -&amp;gt; smoke-tests TUF client (scheduled)

Users / Policy Engines
  └── cosign verify - checks image signature against Rekor
  └── cosign verify-attestation --type cyclonedx - checks SBOM
  └── cosign verify-attestation --type  - checks release-metadata
  └── slsa-verifier verify-image - checks SLSA L3 provenance
  └── TUF client (python-tuf ngclient) - fetches metadata from GitHub Pages, verifies chain
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The build pipeline
&lt;/h2&gt;

&lt;p&gt;The workflow (&lt;code&gt;docker-build-push.yml&lt;/code&gt;) triggers on release publication. After pushing the multi-arch image (linux/amd64 and linux/arm64) to GHCR it does five more things.&lt;/p&gt;

&lt;h3&gt;
  
  
  SBOM generation and attestation
&lt;/h3&gt;

&lt;p&gt;First, Syft generates a CycloneDX SBOM for the pushed image, which gets attached as a cosign attestation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cosign attest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--yes&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--predicate&lt;/span&gt; sbom.cyclonedx.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--type&lt;/span&gt; cyclonedx &lt;span class="se"&gt;\&lt;/span&gt;
  ghcr.io/amaanx86/oci-prometheus-sd-proxy@sha256:759e255e607f623e0b1ee4ea9df02b2aefd89e2c9ec979842ee2e6f8b21772fd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SBOM is also uploaded as a release asset (the 80.7 KB &lt;code&gt;*.cyclonedx.json&lt;/code&gt; artifact visible on the workflow run).&lt;/p&gt;

&lt;h3&gt;
  
  
  cosign image signing
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cosign sign &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--yes&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  ghcr.io/amaanx86/oci-prometheus-sd-proxy@sha256:759e255e607f623e0b1ee4ea9df02b2aefd89e2c9ec979842ee2e6f8b21772fd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No &lt;code&gt;--key&lt;/code&gt; flag. cosign uses the GitHub Actions OIDC token to get an ephemeral certificate from Fulcio, signs the digest, and records the operation in Rekor. The private key is generated in memory and never stored. The Rekor entry pins the workflow identity to the specific digest at a specific timestamp.&lt;/p&gt;

&lt;p&gt;Note the image is signed by digest, not by tag. Tags are mutable; the digest is what the signature actually covers.&lt;/p&gt;

&lt;p&gt;No secret to rotate. No key to leak.&lt;/p&gt;

&lt;h3&gt;
  
  
  Release-metadata attestation
&lt;/h3&gt;

&lt;p&gt;The workflow generates a &lt;code&gt;release-metadata.json&lt;/code&gt; with the image digest, source commit, release tag, and build timestamp, then attaches it as an attestation under a custom predicate type:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cosign attest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--yes&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--predicate&lt;/span&gt; release-metadata.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--type&lt;/span&gt; https://github.com/amaanx86/oci-prometheus-sd-proxy/release-metadata &lt;span class="se"&gt;\&lt;/span&gt;
  ghcr.io/amaanx86/oci-prometheus-sd-proxy@sha256:759e255e607f623e0b1ee4ea9df02b2aefd89e2c9ec979842ee2e6f8b21772fd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using a project-specific type URI keeps the attestation namespaced and lets &lt;code&gt;cosign verify-attestation --type &amp;lt;uri&amp;gt;&lt;/code&gt; fetch exactly this attestation rather than every in-toto statement on the image.&lt;/p&gt;

&lt;h3&gt;
  
  
  SLSA L3 provenance
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;slsa-framework/slsa-github-generator/.github/workflows/generator_container_slsa3.yml@v2.0.0&lt;/span&gt;
&lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/amaanx86/oci-prometheus-sd-proxy&lt;/span&gt;
  &lt;span class="na"&gt;digest&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ needs.build-and-push.outputs.digest }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;slsa-github-generator runs as a reusable workflow with its own isolated OIDC identity. The provenance attestation is generated and signed there, not in the main build job. L3 specifically requires this isolation - the verified builder identity in the provenance is &lt;code&gt;https://github.com/slsa-framework/slsa-github-generator/.github/workflows/generator_container_slsa3.yml@refs/tags/v2.0.0&lt;/code&gt;, distinct from the app workflow's identity.&lt;/p&gt;

&lt;h3&gt;
  
  
  Triggering the TUF signing event
&lt;/h3&gt;

&lt;p&gt;Last step: clone the TUF repo and push a signing branch with a new target file at &lt;code&gt;targets/oci-prometheus-sd-proxy/releases/v1.4.2-rc.json&lt;/code&gt;. The branch name embeds the run ID to avoid collisions: &lt;code&gt;sign/release-1-4-2-rc-24290258383&lt;/code&gt; (dots in the version become hyphens, run ID appended). This branch push is what kicks off the TUF side of the pipeline.&lt;/p&gt;

&lt;p&gt;The Python tuf library (v5+) is used inline to update &lt;code&gt;metadata/targets.json&lt;/code&gt; with the new target entry before committing - the targets metadata version is incremented and signatures cleared, ready for the human signing step.&lt;/p&gt;

&lt;h2&gt;
  
  
  tuf-on-ci
&lt;/h2&gt;

&lt;p&gt;tuf-on-ci manages the TUF metadata lifecycle entirely within a GitHub repository. All online signing uses GitHub Actions OIDC. All offline signing uses Sigstore OIDC (browser-based). No key files anywhere.&lt;/p&gt;

&lt;p&gt;The TUF repo has four workflows. &lt;code&gt;signing-event.yml&lt;/code&gt; fires on any &lt;code&gt;sign/**&lt;/code&gt; branch push, opens a PR, and annotates it with what needs signing. &lt;code&gt;online-sign.yml&lt;/code&gt; runs after a signing PR is merged and refreshes &lt;code&gt;snapshot.json&lt;/code&gt; and &lt;code&gt;timestamp.json&lt;/code&gt; using the Actions OIDC token. &lt;code&gt;publish.yml&lt;/code&gt; deploys everything to GitHub Pages. &lt;code&gt;test.yml&lt;/code&gt; runs on a schedule and verifies the full metadata chain with a real TUF client to catch expiry or breakage before any user does.&lt;/p&gt;

&lt;h3&gt;
  
  
  Signing a release
&lt;/h3&gt;

&lt;p&gt;When &lt;code&gt;signing-event.yml&lt;/code&gt; opens PR #8 with title "Signing event: sign/release-1-4-2-rc-24290258383", I run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;tuf-on-ci-sign sign/release-1-4-2-rc-24290258383
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This opens a browser window for Sigstore OIDC. I authenticate as &lt;a class="mentioned-user" href="https://dev.to/amaanx86"&gt;@amaanx86&lt;/a&gt; via GitHub. Fulcio issues an ephemeral certificate tied to that identity, the tool signs &lt;code&gt;targets.json&lt;/code&gt;, and pushes the signature to the branch. Rekor gets an entry proving the person at &lt;a class="mentioned-user" href="https://dev.to/amaanx86"&gt;@amaanx86&lt;/a&gt; authorised this specific targets update at this specific time.&lt;/p&gt;

&lt;p&gt;Worth being clear on what "offline" means here: it requires a human with a verified identity, not a CI token. It does not require an air-gapped machine. The private key is still ephemeral.&lt;/p&gt;

&lt;p&gt;After merge, &lt;code&gt;online-sign.yml&lt;/code&gt; takes over and refreshes snapshot and timestamp automatically using the Actions OIDC token. No human needed for that part.&lt;/p&gt;

&lt;h3&gt;
  
  
  What gets published
&lt;/h3&gt;

&lt;p&gt;GitHub Pages at &lt;code&gt;amaanx86.github.io/oci-prometheus-sd-proxy-tuf-on-ci/metadata/&lt;/code&gt; serves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;root.json&lt;/code&gt; - signed by &lt;a class="mentioned-user" href="https://dev.to/amaanx86"&gt;@amaanx86&lt;/a&gt; via Sigstore OIDC; defines trusted key holders for all roles&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;targets.json&lt;/code&gt; - signed by &lt;a class="mentioned-user" href="https://dev.to/amaanx86"&gt;@amaanx86&lt;/a&gt;; lists all authorised release targets with digests&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;snapshot.json&lt;/code&gt; - signed by GitHub Actions OIDC; prevents any metadata file from being swapped with an older version&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;timestamp.json&lt;/code&gt; - signed by GitHub Actions OIDC; has a short validity window to prevent freeze attacks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each release gets a target file at &lt;code&gt;targets/oci-prometheus-sd-proxy/releases/v1.4.2-rc.json&lt;/code&gt;. The TUF metadata key (used inside &lt;code&gt;targets.json&lt;/code&gt;) is &lt;code&gt;oci-prometheus-sd-proxy/releases/v1.4.2-rc.json&lt;/code&gt; - the path is relative to the targets directory, not the repo root.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verifying v1.4.2-rc
&lt;/h2&gt;

&lt;p&gt;All verification uses the digest, not the tag, since the tag is a mutable pointer.&lt;/p&gt;

&lt;p&gt;Image signature:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cosign verify &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--certificate-identity&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://github.com/amaanx86/oci-prometheus-sd-proxy/.github/workflows/docker-build-push.yml@refs/tags/v1.4.2-rc"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--certificate-oidc-issuer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://token.actions.githubusercontent.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  ghcr.io/amaanx86/oci-prometheus-sd-proxy@sha256:759e255e607f623e0b1ee4ea9df02b2aefd89e2c9ec979842ee2e6f8b21772fd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SBOM attestation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cosign verify-attestation &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--certificate-identity&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://github.com/amaanx86/oci-prometheus-sd-proxy/.github/workflows/docker-build-push.yml@refs/tags/v1.4.2-rc"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--certificate-oidc-issuer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://token.actions.githubusercontent.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--type&lt;/span&gt; cyclonedx &lt;span class="se"&gt;\&lt;/span&gt;
  ghcr.io/amaanx86/oci-prometheus-sd-proxy@sha256:759e255e607f623e0b1ee4ea9df02b2aefd89e2c9ec979842ee2e6f8b21772fd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Release-metadata attestation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cosign verify-attestation &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--certificate-identity&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://github.com/amaanx86/oci-prometheus-sd-proxy/.github/workflows/docker-build-push.yml@refs/tags/v1.4.2-rc"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--certificate-oidc-issuer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://token.actions.githubusercontent.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--type&lt;/span&gt; &lt;span class="s2"&gt;"https://github.com/amaanx86/oci-prometheus-sd-proxy/release-metadata"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  ghcr.io/amaanx86/oci-prometheus-sd-proxy@sha256:759e255e607f623e0b1ee4ea9df02b2aefd89e2c9ec979842ee2e6f8b21772fd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SLSA L3 provenance (requires digest reference - slsa-verifier rejects mutable tags):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;slsa-verifier verify-image &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"ghcr.io/amaanx86/oci-prometheus-sd-proxy@sha256:759e255e607f623e0b1ee4ea9df02b2aefd89e2c9ec979842ee2e6f8b21772fd"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--source-uri&lt;/span&gt; &lt;span class="s2"&gt;"github.com/amaanx86/oci-prometheus-sd-proxy"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--source-tag&lt;/span&gt; &lt;span class="s2"&gt;"v1.4.2-rc"&lt;/span&gt;
&lt;span class="c"&gt;# Verified build using builder "...generator_container_slsa3.yml@refs/tags/v2.0.0"&lt;/span&gt;
&lt;span class="c"&gt;# at commit a68d4cd5d29cc6b865c6804fe63adff14ac74b27&lt;/span&gt;
&lt;span class="c"&gt;# PASSED: SLSA verification passed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;TUF metadata chain verification is documented in detail - including the full client walkthrough and what each step validates - in the &lt;a href="https://oci-prometheus-sd-proxy.readthedocs.io/en/latest/releasing.html" rel="noopener noreferrer"&gt;release verification docs&lt;/a&gt;. The short version: a compliant TUF client walks root trust, snapshot consistency, and timestamp freshness before fetching the target. If the metadata is expired, rolled back, or the signature chain is invalid, the client raises before returning anything. That freshness check is what separates TUF from static signature verification.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this covers and where the gaps are
&lt;/h2&gt;

&lt;p&gt;This pipeline secures the release artifact and its provenance chain. An operator fetching the image can independently verify who built it, from what source, in what environment, and that the release was authorised by a human identity. That is a lot more than most projects ship with.&lt;/p&gt;

&lt;p&gt;But supply chain security has layers, and the release artifact is only one of them. A few honest gaps:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Go module dependencies.&lt;/strong&gt; The SBOM shows what modules are in the binary, and &lt;code&gt;go.sum&lt;/code&gt; pins their hashes. But &lt;code&gt;govulncheck&lt;/code&gt; and a periodic dependency audit are what actually catch known-vulnerable transitive dependencies. The attestation proves the SBOM is authentic; it does not tell you the SBOM is safe.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Runtime enforcement.&lt;/strong&gt; Signing and provenance only matter if someone checks them at deploy time. Right now verification is a manual step. The more interesting place this is going is integrating the cosign attestations and SLSA provenance into a Kyverno or OPA/Gatekeeper policy engine that enforces admission control in Kubernetes. Policy engines like Kyverno can query Sigstore and reject any image that lacks a valid SLSA L3 attestation from the correct workflow identity - automatically, at admission time, not as a manual verification step. That closes the loop between what we proved at build time and what is allowed to run in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TUF root key compromise.&lt;/strong&gt; If the &lt;a class="mentioned-user" href="https://dev.to/amaanx86"&gt;@amaanx86&lt;/a&gt; GitHub account itself was compromised, an attacker could rotate the root TUF key in a way that would pass client verification. TUF supports threshold signatures across multiple root key holders to mitigate this, which becomes relevant as the project scales to multiple maintainers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The dependency of your dependencies.&lt;/strong&gt; None of this solves a compromised &lt;code&gt;slsa-github-generator&lt;/code&gt; or a backdoored Syft release. That is a solved problem in theory (pin action hashes, verify the tools themselves) but it is worth naming.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I am building toward
&lt;/h2&gt;

&lt;p&gt;The next step that interests me most is closing the loop at runtime.&lt;/p&gt;

&lt;p&gt;The next meaningful investment is runtime enforcement. Kyverno ClusterPolicies that require verified SLSA provenance before admission, OPA rules that check SBOM attestations against a known-safe package policy, and Sigstore-aware image admission are all achievable with what the pipeline already produces. The attestations are already there. The policy layer that consumes them is the missing piece.&lt;/p&gt;

&lt;p&gt;After that, expanding to multi-maintainer TUF with hardware-backed root keys and threshold signatures would make the trust model genuinely robust at scale.&lt;/p&gt;




&lt;p&gt;Full implementation details and the verification workflow are documented at &lt;a href="https://oci-prometheus-sd-proxy.readthedocs.io/en/latest/releasing.html" rel="noopener noreferrer"&gt;oci-prometheus-sd-proxy.readthedocs.io/en/latest/releasing.html&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>security</category>
      <category>opensource</category>
      <category>go</category>
      <category>cloudnative</category>
    </item>
    <item>
      <title>Built My Own Prometheus Service Discovery for Oracle Cloud Because a 3-Year-Old PR Never Got Merged</title>
      <dc:creator>Amaan Ul Haq Siddiqui</dc:creator>
      <pubDate>Wed, 11 Mar 2026 23:20:32 +0000</pubDate>
      <link>https://dev.to/amaanx86/built-my-own-prometheus-service-discovery-for-oracle-cloud-because-a-3-year-old-pr-never-got-merged-2fme</link>
      <guid>https://dev.to/amaanx86/built-my-own-prometheus-service-discovery-for-oracle-cloud-because-a-3-year-old-pr-never-got-merged-2fme</guid>
      <description>&lt;p&gt;There is a specific kind of frustration reserved for when you know a problem is solved, you can see the solution, and you still cannot use it. That is how this project started.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Context: Setting Up Observability From Scratch
&lt;/h2&gt;

&lt;p&gt;I was building out observability from scratch across Oracle Cloud Infrastructure - multiple tenancies, multiple regions, a decent number of compute instances spread across compartments. The goal was full coverage: every VM enrolled in monitoring, no gaps, no guesswork.&lt;/p&gt;

&lt;p&gt;When you are starting from zero, one of the first things you ask is how Prometheus is going to discover what it needs to scrape. For AWS you have EC2 service discovery built right in. Same for GCP, Azure, Hetzner, DigitalOcean. You configure credentials, set some filters, and Prometheus takes care of the rest.&lt;/p&gt;

&lt;p&gt;OCI is not on that list.&lt;/p&gt;

&lt;p&gt;I searched. I found a pull request in the Prometheus repository opened by an engineer at Oracle. It was exactly what I needed. It was also three years old and had never been merged. Comments, reviews, back and forth, and then silence. The PR is still open today. If you have gone looking for OCI service discovery in Prometheus you have probably landed on that same page, felt a brief moment of hope, and then noticed the date.&lt;/p&gt;

&lt;p&gt;So I built it myself.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Actually Needed
&lt;/h2&gt;

&lt;p&gt;The requirement was simple: new VM comes up, Prometheus finds it and starts scraping. No manual steps in that loop. Not because I was fixing a broken process - there was no process yet. I was designing this from scratch and I was not going to design it with a gap where a human has to remember to update a config file.&lt;/p&gt;

&lt;p&gt;The concern was blind spots. Infrastructure grows, VMs get provisioned, things change. If enrollment is manual, coverage is only as good as the last person who remembered to do it. I wanted observability that was structurally complete, not best-effort.&lt;/p&gt;

&lt;p&gt;The workflow I landed on:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tag the VM in OCI.&lt;/strong&gt;&lt;br&gt;
When provisioning a new instance, add a tag - something like &lt;code&gt;prometheus:scrape = true&lt;/code&gt;. That is the only enrollment step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Open the Prometheus port on the network security group.&lt;/strong&gt;&lt;br&gt;
Allow the Prometheus server IP to reach port 9100. One rule, specific source, no broad exposure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run the node exporter playbook.&lt;/strong&gt;&lt;br&gt;
An Ansible playbook installs and starts node exporter. Done.&lt;/p&gt;

&lt;p&gt;That is the full enrollment flow. The VM is in monitoring. No touching the Prometheus config. No reloading Prometheus. No rollout restart of a Kubernetes deployment. No SSH back into the server to verify anything.&lt;/p&gt;

&lt;p&gt;The proxy polls OCI, finds every instance with the right tag across all configured tenancies and compartments, and hands the target list to Prometheus via the HTTP service discovery API. Prometheus picks it up on its next refresh cycle. The whole thing is automatic by design, not patched in after the fact.&lt;/p&gt;


&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;oci-prometheus-sd-proxy&lt;/code&gt; is a lightweight Go service that implements the &lt;a href="https://prometheus.io/docs/prometheus/latest/http_sd/" rel="noopener noreferrer"&gt;Prometheus HTTP Service Discovery&lt;/a&gt; API for OCI.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmkzq34x8dgpt2eubp6fh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmkzq34x8dgpt2eubp6fh.png" alt="SD Proxy Architecture" width="800" height="507"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You point Prometheus at it like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;scrape_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;oci_instances&lt;/span&gt;
    &lt;span class="na"&gt;http_sd_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http://oci-sd-proxy:8080/v1/targets'&lt;/span&gt;
        &lt;span class="na"&gt;authorization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Bearer&lt;/span&gt;
          &lt;span class="na"&gt;credentials&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;YOUR_TOKEN'&lt;/span&gt;
        &lt;span class="na"&gt;refresh_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;60s&lt;/span&gt;
    &lt;span class="na"&gt;scrape_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
    &lt;span class="na"&gt;scrape_timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10s&lt;/span&gt;
    &lt;span class="na"&gt;metrics_path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/metrics&lt;/span&gt;
    &lt;span class="na"&gt;relabel_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source_labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;__meta_oci_instance_name&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;target_label&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;instance&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source_labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;__meta_oci_tenancy_name&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;target_label&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tenancy&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source_labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;__meta_oci_compartment_name&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;target_label&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;compartment&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source_labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;__meta_oci_region&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;target_label&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;region&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source_labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;__meta_oci_availability_domain&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;target_label&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;availability_domain&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source_labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;__meta_oci_instance_shape&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;target_label&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;shape&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The proxy handles the rest. It scans OCI, filters by tag, and returns targets with rich metadata: tenancy, compartment, region, availability domain, shape, fault domain, private IP, and all your custom OCI tags as Prometheus labels. Use them for relabeling, alerting rules, dashboards - whatever you need.&lt;/p&gt;

&lt;p&gt;A few things I cared about when building it:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-tenancy from day one.&lt;/strong&gt; The proxy handles all tenancies in parallel from a single config file. One deployment, full coverage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rate limiting.&lt;/strong&gt; OCI's API will return 429s if you push it too hard. The proxy uses a token bucket proactively and a retry policy reactively. Discovery does not silently fail under load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security.&lt;/strong&gt; Bearer token auth, distroless container, read-only config mounts. In an HA setup it sits on the local network, only reachable by Prometheus. It does not need to be internet-facing and it should not be.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caching.&lt;/strong&gt; Discovery results are cached so Prometheus always gets a response, even if the OCI API is momentarily slow or rate limiting.&lt;/p&gt;




&lt;h2&gt;
  
  
  Battle Tested
&lt;/h2&gt;

&lt;p&gt;This is running in production across 10+ Oracle Cloud tenancies - different regions, different compartments, different team setups. It has handled OCI API slowness, tenancy permission edge cases, and the general entropy that comes with real infrastructure at scale.&lt;/p&gt;

&lt;p&gt;The thing that still satisfies me is watching a new VM appear in Grafana within a minute of it booting, with nobody doing anything to make that happen. That is what good observability infrastructure should feel like. You build it right once and it stays right.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Did Not Exist Already
&lt;/h2&gt;

&lt;p&gt;OCI is a smaller player compared to AWS or GCP and Prometheus contributors naturally prioritize the platforms most of their users are on. Oracle engineers clearly wanted to solve this - that PR is evidence of that - but getting a first-party integration merged upstream is a long road with no guarantees.&lt;/p&gt;

&lt;p&gt;The HTTP SD API that Prometheus exposes is actually the right answer for this situation. It lets any platform plug in without touching the Prometheus codebase. I just had to build the other end of that interface.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Repository&lt;/strong&gt;: &lt;a href="https://github.com/amaanx86/oci-prometheus-sd-proxy" rel="noopener noreferrer"&gt;https://github.com/amaanx86/oci-prometheus-sd-proxy&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Documentation&lt;/strong&gt;: &lt;a href="https://oci-prometheus-sd-proxy.readthedocs.io/" rel="noopener noreferrer"&gt;https://oci-prometheus-sd-proxy.readthedocs.io/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The docs cover installation, the full configuration reference, OCI IAM permissions, and Docker Compose deployment examples. If you are building observability on OCI and want automatic enrollment from day one, this is what I use.&lt;/p&gt;

&lt;p&gt;And if you were one of the people who commented on that PR hoping it would eventually land - same. This is what I built instead.&lt;/p&gt;

&lt;p&gt;Original Post : &lt;a href="https://amaanx86.github.io/blog/oci-prometheus-service-discovery" rel="noopener noreferrer"&gt;https://amaanx86.github.io/blog/oci-prometheus-service-discovery&lt;/a&gt;&lt;/p&gt;

</description>
      <category>oracle</category>
      <category>devops</category>
      <category>monitoring</category>
      <category>prometheus</category>
    </item>
    <item>
      <title>What Broke During Our AWS DMS Migration (And How We Fixed It)</title>
      <dc:creator>Amaan Ul Haq Siddiqui</dc:creator>
      <pubDate>Sat, 24 Jan 2026 09:46:30 +0000</pubDate>
      <link>https://dev.to/amaanx86/what-broke-during-our-aws-dms-migration-and-how-we-fixed-it-198p</link>
      <guid>https://dev.to/amaanx86/what-broke-during-our-aws-dms-migration-and-how-we-fixed-it-198p</guid>
      <description>&lt;p&gt;Let me tell you about the time I thought migrating a database would be straightforward. Spoiler alert: it wasn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;I was tasked with migrating our MySQL database from DigitalOcean's managed service to AWS RDS. Armed with confidence and the AWS DMS documentation, I dove in headfirst.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5hc9abobdecrmsalx3y8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5hc9abobdecrmsalx3y8.png" alt="DMS-Arch" width="800" height="266"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The First Roadblock: Serverless Seemed Like a Good Idea
&lt;/h2&gt;

&lt;p&gt;Creating the source and target endpoints went surprisingly smooth. I felt like I was on a roll. Then came the task creation, and I thought, "Hey, let's use serverless DMS. Modern, scalable, perfect."&lt;/p&gt;

&lt;p&gt;That's when everything came to a grinding halt.&lt;/p&gt;

&lt;p&gt;The networking configuration for serverless DMS had me completely stumped. I spent way too long trying to figure out the right VPC setup, subnet configurations, and security group rules. Nothing seemed to work the way I expected. The documentation made sense in theory, but practice was a different story.&lt;/p&gt;

&lt;p&gt;Eventually, I gave up on serverless and pivoted to the traditional EC2-based replication instance. Sometimes the old way is the right way.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flvktygp6dk0r5xnjou2n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flvktygp6dk0r5xnjou2n.png" alt="DMS Console" width="800" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Excluding the Noise
&lt;/h2&gt;

&lt;p&gt;With my shiny new replication instance ready, I created the migration task. But I needed to make sure I wasn't migrating MySQL's system databases along with my actual data. Nobody needs that mess.&lt;/p&gt;

&lt;p&gt;I configured table mappings to include all user databases while explicitly excluding the system ones:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"rules"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"rule-type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"selection"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"rule-id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"rule-name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"include-all-user-dbs"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"object-locator"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"schema-name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"%"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"table-name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"%"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"rule-action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"include"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"rule-type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"selection"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"rule-id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"rule-name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"exclude-mysql"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"object-locator"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"schema-name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"mysql"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"table-name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"%"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"rule-action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"exclude"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"rule-type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"selection"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"rule-id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"rule-name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"exclude-sys"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"object-locator"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"schema-name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sys"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"table-name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"%"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"rule-action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"exclude"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"rule-type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"selection"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"rule-id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"rule-name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"exclude-information_schema"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"object-locator"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"schema-name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"information_schema"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"table-name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"%"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"rule-action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"exclude"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"rule-type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"selection"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"rule-id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"5"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"rule-name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"exclude-performance_schema"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"object-locator"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"schema-name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"performance_schema"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"table-name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"%"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"rule-action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"exclude"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Clean and specific. I felt good about this.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Premigration Checks Humbled Me
&lt;/h2&gt;

&lt;p&gt;I ran the premigration assessment checks, expecting maybe a warning or two. Instead, I was greeted with a wall of failures. Major ones. The kind that make you question your life choices.&lt;/p&gt;

&lt;p&gt;I spent the next few hours going through each failed check, cross-referencing with AWS documentation, and fixing issues one by one. Most of the critical warnings got resolved, though some minor ones remained. I figured those were acceptable and proceeded with the migration.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Migration Itself: A False Sense of Security
&lt;/h2&gt;

&lt;p&gt;This was our staging database, and it was massive. The migration kicked off, and surprisingly, it ran smoothly. Hours passed, data transferred, progress bars filled. Everything looked perfect.&lt;/p&gt;

&lt;p&gt;We switched the application endpoint to the new RDS instance, deployed the changes, and waited for the green light.&lt;/p&gt;

&lt;p&gt;Then the login feature stopped working entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Investigation That Nearly Broke Me
&lt;/h2&gt;

&lt;p&gt;Cue several hours of frantic debugging. We checked connection strings, verified credentials, tested queries manually, checked network rules, and compared database structures. Everything looked identical.&lt;/p&gt;

&lt;p&gt;Until we looked closer at the tables themselves.&lt;/p&gt;

&lt;p&gt;Our foreign keys were gone. Primary keys were missing. Auto-increment sequences had reset. DMS had essentially eaten the structural integrity of our database.&lt;/p&gt;

&lt;p&gt;Turns out, this is a known behavior. DMS focuses on moving data efficiently, and in doing so, it doesn't always preserve things like constraints and keys perfectly during the initial load.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Actual Solution
&lt;/h2&gt;

&lt;p&gt;After more digging through documentation and forums, we found the recommended approach: manually dump and restore the schema first, then let DMS handle just the data migration.&lt;/p&gt;

&lt;p&gt;We also discovered that you can pass additional connection parameters to the DMS endpoint configuration to better preserve database objects during migration. We updated our endpoint settings with these parameters and ran the migration again.&lt;/p&gt;

&lt;p&gt;This time, everything worked. Foreign keys intact, primary keys preserved, auto-increment sequences functioning as expected. The application came back to life, and logins worked perfectly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;First, serverless isn't always the answer, especially when you're still figuring out the networking intricacies of a new service.&lt;/p&gt;

&lt;p&gt;Second, premigration checks exist for a reason. Those warnings are trying to save you from pain later.&lt;/p&gt;

&lt;p&gt;Third, and most importantly, when migrating databases with DMS, take the time to migrate your schema separately. Don't rely on DMS to handle everything. It's a data migration service, not a complete database cloning tool.&lt;/p&gt;

&lt;p&gt;Fourth, if you're planning to use CDC (Change Data Capture) for ongoing replication, make sure binary logging is enabled on your source database with the correct format. MySQL requires &lt;code&gt;binlog_format&lt;/code&gt; set to &lt;code&gt;ROW&lt;/code&gt; for DMS to capture changes properly. Without this, your CDC tasks will fail silently or miss updates entirely.&lt;/p&gt;

&lt;p&gt;The whole experience was frustrating, time-consuming, and honestly a bit embarrassing. But it taught me more about AWS DMS in a few days than I would have learned in weeks of casual reading.&lt;/p&gt;

&lt;p&gt;If you're planning a DMS migration, learn from my mistakes. Your future self will thank you.&lt;/p&gt;

</description>
      <category>database</category>
      <category>devops</category>
      <category>aws</category>
      <category>automation</category>
    </item>
    <item>
      <title>Designing a Secure AWS Landing Zone with Control Tower (What Most Blogs Don’t Tell You)</title>
      <dc:creator>Amaan Ul Haq Siddiqui</dc:creator>
      <pubDate>Sun, 04 Jan 2026 10:50:10 +0000</pubDate>
      <link>https://dev.to/amaanx86/designing-a-secure-aws-landing-zone-with-control-tower-what-most-blogs-dont-tell-you-20oh</link>
      <guid>https://dev.to/amaanx86/designing-a-secure-aws-landing-zone-with-control-tower-what-most-blogs-dont-tell-you-20oh</guid>
      <description>&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;imagine ur the architect of this massive growing org right and ur job is basically to design a secure compliant scalable aws environment that wont fall apart when business needs change which they always do. u gotta ensure the right governance is there sensitive data is locked down and everything can scale up. sounds like a headache honestly but with aws control tower u actually have a shot at making this work without losing ur mind&lt;/p&gt;

&lt;p&gt;lets go on a trip thru the cloud setting up a landing zone using control tower and seeing how organizational units aka OUs can be the backbone of ur security game&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Building the foundation: the org management account
&lt;/h2&gt;

&lt;p&gt;so every solid landing zone starts somewhere and that somewhere is the organization management account. think of this as the heart of ur aws world where policies access and the basic structure live. its where u define global security rules and where control tower actually does its magic&lt;/p&gt;

&lt;p&gt;for me it all starts by making this org management account the master key to everything. first step i take is locking this thing down tight access control everything because if this account gets compromised its game over&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzpqjwwskpjg6xg345g8y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzpqjwwskpjg6xg345g8y.png" alt="MGMT Account Control Tower Org" width="800" height="287"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The structure takes shape: OUs and governance
&lt;/h2&gt;

&lt;p&gt;now that the management account is chillin i start carving out the structure. this is where control tower is actually super useful. using organizational units i make a hierarchy that actually makes sense for the business&lt;/p&gt;

&lt;p&gt;i always gotta balance letting devs do their thing while keeping control. for big companies i usually set up separate OUs for security production and staging just to keep sanity&lt;/p&gt;

&lt;p&gt;so i generally define some high level OUs&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;security ou&lt;/strong&gt; – the gatekeeper making sure audits happen&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;production ou&lt;/strong&gt; – where the money is made live services live here&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;staging ou&lt;/strong&gt; – the playground where we break stuff before prod&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmy7dwazrrc3pq8cd65cv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmy7dwazrrc3pq8cd65cv.png" alt="Control Tower Arch" width="800" height="522"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Nested OUs: a growing ecosystem
&lt;/h2&gt;

&lt;p&gt;as the org gets bigger the aws environment gets messy so to keep it clean i start nesting OUs. this is basically putting folders inside folders but for cloud accounts&lt;/p&gt;

&lt;p&gt;now i gotta support multiple teams so i typically make groups like app-1 and app-2 each with their own prod and staging accounts. this ensures app-1 cant mess with app-2s stuff. nice and isolated&lt;/p&gt;

&lt;p&gt;then the internal ops need love too so i make an internal operations OU with sub-OUs for finance hr and it departments so hr doesnt accidentally delete the finance database&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Audit and compliance: the log archive
&lt;/h2&gt;

&lt;p&gt;structure is done but u need visibility. audit logs compliance retention policies all that boring but super critical stuff needs to be set up so u dont fail an audit&lt;/p&gt;

&lt;p&gt;i set up audit and log archive accounts immediately. these are like the black box of an airplane immutable records of everything that happens. logs go in they dont come out (unless u need to investigate something) and automated backups keep everything safe&lt;/p&gt;

&lt;h2&gt;
  
  
  5. The real magic: terraform and automation with AFT
&lt;/h2&gt;

&lt;p&gt;ok so the structure is cool and all but clicking buttons in the console is for amateurs. we want speed and consistency. enter the aws control tower account factory for terraform or AFT for short. this is where things get super interesting because now we are treating our account vending machine as code&lt;/p&gt;

&lt;p&gt;check this out u can use this module here&lt;br&gt;
&lt;a href="https://registry.terraform.io/modules/aws-ia/control_tower_account_factory/aws/latest" rel="noopener noreferrer"&gt;https://registry.terraform.io/modules/aws-ia/control_tower_account_factory/aws/latest&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;and the code lives here&lt;br&gt;
&lt;a href="https://github.com/aws-ia/terraform-aws-control_tower_account_factory" rel="noopener noreferrer"&gt;https://github.com/aws-ia/terraform-aws-control_tower_account_factory&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;so instead of manually provisioning accounts i deploy AFT. now when a new team needs an account they just push a change to a terraform repo. AFT sees the change spins up the account using control tower and then—this is the best part—it automatically applies all the baseline customization. we are talking setting up security groups iam roles and connecting to the vpc automatically. no human error just pure automation pipelines running smooth&lt;/p&gt;

&lt;p&gt;it basically lets u maintain a global account customization repo so every single account that gets birthed by control tower comes out pre-configured with my specific tooling and security baselines right out of the box. massive time saver&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Conclusion: a secure scalable future
&lt;/h2&gt;

&lt;p&gt;so i step back drink some coffee and look at the dashboard. i didnt just build some servers i built a foundation. the org is secure scalable and aligned with business goals. thanks to control tower and that sweet AFT automation the journey from a messy handful of accounts to a compliant enterprise grade environment was actually kinda smooth and easy access and control setup with IAM Identity Centre SSO &amp;amp; Directory Services which we can integrate with Azure EntraID as well :)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fllku1pct1bk23d4s6ll7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fllku1pct1bk23d4s6ll7.png" alt="IAM" width="800" height="467"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>aws</category>
      <category>devops</category>
      <category>security</category>
    </item>
    <item>
      <title>Why We Didn’t Move to EKS (Yet): Choosing ECS Over Kubernetes in Production</title>
      <dc:creator>Amaan Ul Haq Siddiqui</dc:creator>
      <pubDate>Sun, 28 Dec 2025 09:47:19 +0000</pubDate>
      <link>https://dev.to/amaanx86/why-we-didnt-move-to-eks-yet-choosing-ecs-over-kubernetes-in-production-1hbo</link>
      <guid>https://dev.to/amaanx86/why-we-didnt-move-to-eks-yet-choosing-ecs-over-kubernetes-in-production-1hbo</guid>
      <description>&lt;p&gt;In the cloud-native world, Kubernetes (EKS) is often treated as the default destination for container orchestration. It’s powerful, flexible, and industry-standard. But for many engineering teams, it’s also overkill.&lt;/p&gt;

&lt;p&gt;We recently faced the classic "build vs. buy" decision for our infrastructure. The pressure to adopt EKS was there, but after evaluating our actual needs, we made a conscious choice to stick with &lt;strong&gt;Amazon ECS&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The result? We saved a handsome amount of money, months of engineering time, and avoided the operational tax that comes with managing Kubernetes clusters. Here is how we architected a robust, scalable production environment on ECS without the K8s complexity.&lt;/p&gt;

&lt;h3&gt;
  
  
  The "Kubernetes Tax" We Wanted to Avoid
&lt;/h3&gt;

&lt;p&gt;Kubernetes is amazing, but it requires a significant investment in tooling and maintenance. To run EKS properly in production, you aren't just managing containers; you're managing a platform. You need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitOps tools:&lt;/strong&gt; ArgoCD or FluxCD for deployments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability:&lt;/strong&gt; Fluentd or similar for log shipping.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ingress Controllers:&lt;/strong&gt; NGINX or ALB controllers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security:&lt;/strong&gt; Constant patching of the control plane and worker nodes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For our team, we wanted to focus 100% on &lt;strong&gt;shipping application code&lt;/strong&gt;, not managing infrastructure plumbing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fulddb1ipfh1uebh25jv2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fulddb1ipfh1uebh25jv2.png" alt="confused k8s engineer" width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Our Hybrid ECS Architecture
&lt;/h3&gt;

&lt;p&gt;We designed a hybrid ECS strategy that leverages the best of both serverless and provisioned compute.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Fargate for Stateless Workloads
&lt;/h4&gt;

&lt;p&gt;For our main application servers and Sidekiq background workers, we used &lt;strong&gt;ECS Fargate&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No Servers to Manage:&lt;/strong&gt; We don't worry about OS patching or scaling instances.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Right-Sizing:&lt;/strong&gt; We pay only for the vCPU and RAM the tasks actually use.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability:&lt;/strong&gt; Fargate handles the heavy lifting of launching thousands of containers if needed.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  2. EC2 Launch Type for Cron Jobs
&lt;/h4&gt;

&lt;p&gt;Interestingly, we didn't go 100% Fargate. For our scheduled Cron jobs, we stuck with the &lt;strong&gt;EC2 Launch Type&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Why?&lt;/strong&gt; Cron jobs run frequently and often use the same base images.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Cost Hack:&lt;/strong&gt; By running these on EC2 instances, we can cache Docker layers locally on the host. This drastically reduces data transfer costs from ECR (Elastic Container Registry) and speeds up start times, something Fargate doesn't support as efficiently for frequent, short-lived tasks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqd5eu7qaqacp2tz1jvar.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqd5eu7qaqacp2tz1jvar.png" alt="AWS ECS Console" width="800" height="267"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Stack: Simple and Managed
&lt;/h3&gt;

&lt;p&gt;We offloaded state management to AWS managed services to keep the compute layer purely ephemeral:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Database:&lt;/strong&gt; Amazon RDS for PostgreSQL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Caching:&lt;/strong&gt; Amazon ElastiCache (Redis).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  CI/CD: Skipping the Complexity
&lt;/h3&gt;

&lt;p&gt;One of the biggest wins was avoiding the "GitOps" complexity of ArgoCD or Flux. Our pipeline is a straightforward &lt;strong&gt;GitHub Actions&lt;/strong&gt; workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Build:&lt;/strong&gt; Create the Docker image.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Scan:&lt;/strong&gt; Run security vulnerability scans.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Push:&lt;/strong&gt; Upload to ECR.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Deploy:&lt;/strong&gt; Update the ECS Task Definition and force a new deployment.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That’s it. No separate synchronization server, no complex CRDs (Custom Resource Definitions), and no managing Helm charts. The pipeline is robust, easy to debug, and requires zero maintenance.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Verdict: Time is Money
&lt;/h3&gt;

&lt;p&gt;By choosing ECS, we:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Skipped the Learning Curve:&lt;/strong&gt; No need to train the team on &lt;code&gt;kubectl&lt;/code&gt;, manifests, or cluster networking.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reduced Operational Overhead:&lt;/strong&gt; No node patching, no control plane upgrades.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lowered Bill:&lt;/strong&gt; We aren't paying for EKS control plane fees ($73/month per cluster) or the overhead of system pods running on worker nodes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We might move to EKS one day if our requirements for custom networking or service mesh become complex enough to warrant it. But for now, ECS allows us to run a stable, high-performance production environment where the only thing we have to take care of is our application code.&lt;/p&gt;

&lt;p&gt;Sometimes, the best engineering decision is the boring one.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>kubernetes</category>
      <category>aws</category>
      <category>devops</category>
    </item>
    <item>
      <title>Memory-Based Auto Scaling: Saving Our Sidekiq Jobs When CPU Metrics Lied to Us</title>
      <dc:creator>Amaan Ul Haq Siddiqui</dc:creator>
      <pubDate>Fri, 26 Dec 2025 15:13:41 +0000</pubDate>
      <link>https://dev.to/amaanx86/memory-based-auto-scaling-saving-our-sidekiq-jobs-when-cpu-metrics-lied-to-us-234k</link>
      <guid>https://dev.to/amaanx86/memory-based-auto-scaling-saving-our-sidekiq-jobs-when-cpu-metrics-lied-to-us-234k</guid>
      <description>&lt;p&gt;We usually just default to &lt;strong&gt;CPU-based scaling&lt;/strong&gt; for our Auto Scaling Groups (ASGs). It’s the standard move. It’s easy, it’s familiar, and for web servers? It usually works fine.&lt;/p&gt;

&lt;p&gt;But sometimes, CPU utilization lies.&lt;/p&gt;

&lt;p&gt;We recently hit a wall where &lt;strong&gt;CPU scaling completely failed us&lt;/strong&gt;. This is the story of how a critical background job kept crashing even though our dashboards said everything was "healthy," and how switching to &lt;strong&gt;memory-based metrics&lt;/strong&gt; saved the day.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Silent Failure
&lt;/h3&gt;

&lt;p&gt;We run a Ruby on Rails app. It relies heavily on Sidekiq for background work. These workers run on EC2 instances in an Auto Scaling Group.&lt;/p&gt;

&lt;p&gt;On paper, everything looked great.&lt;br&gt;
CPU usage? A comfortable 20–30%.&lt;br&gt;
Network? Normal.&lt;br&gt;
Disk? Fine.&lt;br&gt;
AWS said we were green.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But the app was on fire.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Critical jobs were timing out. The queues were piling up. Retries were spiking. Worst of all? Workers were just... vanishing. Processes were dying, but since CPU was low, the auto-scaler didn't care. It didn't launch new instances. It just let them die.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Culprit: The OOM Killer
&lt;/h3&gt;

&lt;p&gt;I dug into the logs, and the answer was right there. Memory.&lt;/p&gt;

&lt;p&gt;Our Sidekiq jobs are hungry. As the Ruby processes chewed through heavy tasks, they ate up more and more RAM. The instances were running out of memory, and the Linux &lt;strong&gt;OOM (Out-of-Memory) Killer&lt;/strong&gt; stepped in to save the server by killing our Sidekiq process.&lt;/p&gt;

&lt;p&gt;The problem? &lt;strong&gt;EC2 doesn't send memory metrics to CloudWatch by default.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So, while our RAM was screaming for help, CloudWatch saw low CPU and thought, "Everything is chill."&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Making Memory Visible
&lt;/h3&gt;

&lt;p&gt;You can't fix what you can't see.&lt;/p&gt;

&lt;p&gt;First thing I did was install the &lt;strong&gt;CloudWatch Agent&lt;/strong&gt; on our instances. I needed it to ship custom metrics—specifically &lt;code&gt;mem_used_percent&lt;/code&gt;—to AWS.&lt;/p&gt;

&lt;p&gt;As soon as we turned it on, the graphs confirmed it.&lt;br&gt;
CPU was bored at 20%.&lt;br&gt;
Memory? &lt;strong&gt;It was spiking over 85%.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxi36gymq0zlwi3mswa0h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxi36gymq0zlwi3mswa0h.png" alt="Custom CloudWatch Memory Metrics" width="800" height="217"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Above: Finally seeing the truth. CPU was low, but RAM was maxed out.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Changing the Rules
&lt;/h3&gt;

&lt;p&gt;We stopped listening to CPU. I set up a &lt;strong&gt;Target Tracking scaling policy&lt;/strong&gt; that looks strictly at memory.&lt;/p&gt;

&lt;p&gt;I told the ASG: &lt;strong&gt;"Keep average memory at 40%."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Sounds low, right? But background workers are unpredictable. They need breathing room for sudden spikes. This setup does two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; It adds new servers &lt;strong&gt;before&lt;/strong&gt; we hit the danger zone (75%+).&lt;/li&gt;
&lt;li&gt; It doesn't kill servers too fast, so we avoid "thrashing" (booting up and shutting down constantly).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fslb95n429ujvaoqx3nc9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fslb95n429ujvaoqx3nc9.png" alt="ASG Dynamic Scaling Policy" width="800" height="473"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Above: The new policy. If RAM goes up, we scale out.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Result
&lt;/h3&gt;

&lt;p&gt;Instant fix.&lt;/p&gt;

&lt;p&gt;The scaling became predictive. Instead of waiting for a crash, the cluster sees the memory pressure building and adds more power &lt;em&gt;before&lt;/em&gt; things break.&lt;/p&gt;

&lt;p&gt;We haven't seen a single OOM kill since. The Sidekiq service is happy, the queues are empty, and I can finally sleep.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvizpssa0dve070ju5s2u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvizpssa0dve070ju5s2u.png" alt="Sidekiq Service Status" width="800" height="177"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Above: Stable, boring, and running perfectly.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why CPU Scaling Sucks for Workers
&lt;/h3&gt;

&lt;p&gt;Here's the takeaway.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Web traffic&lt;/strong&gt; is usually CPU-heavy. You get a request, you process it, you send a response. CPU spikes, you scale. Simple.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Background workers&lt;/strong&gt; (like Sidekiq, Celery, Bull) are different. They load big files. They process heavy data objects. They eat RAM. Your CPU can be totally asleep while your memory is completely full.&lt;/p&gt;

&lt;h3&gt;
  
  
  Final Thoughts
&lt;/h3&gt;

&lt;p&gt;If you're running background jobs in an ASG and you're only watching CPU... you might be one heavy job away from a silent outage.&lt;/p&gt;

&lt;p&gt;Don't just use the default settings. &lt;strong&gt;Scale on what actually hurts your application.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Note: The screenshots above are from a test setup. I’ve hidden sensitive stuff like Account IDs.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>aws</category>
      <category>rails</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>My First AWS Community Day Dubai &amp; that too as a Speaker :)</title>
      <dc:creator>Amaan Ul Haq Siddiqui</dc:creator>
      <pubDate>Thu, 25 Dec 2025 08:21:38 +0000</pubDate>
      <link>https://dev.to/amaanx86/my-first-aws-community-day-dubai-that-too-as-a-speaker--4hnl</link>
      <guid>https://dev.to/amaanx86/my-first-aws-community-day-dubai-that-too-as-a-speaker--4hnl</guid>
      <description>&lt;h1&gt;
  
  
  My First AWS Community Day Dubai (And That Too as a Speaker!)
&lt;/h1&gt;

&lt;p&gt;Speaking at an AWS event was always on my "someday" list.&lt;br&gt;
But "someday" arrived a lot sooner than I expected.&lt;/p&gt;

&lt;p&gt;This past Sunday, I attended &lt;strong&gt;AWS Community Day UAE 2025&lt;/strong&gt; in Dubai.&lt;br&gt;
And I didn't just attend.&lt;br&gt;
&lt;strong&gt;I gave my first-ever talk.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Announcement: Is This Real?
&lt;/h2&gt;

&lt;p&gt;When the speaker list dropped, seeing my name next to industry veterans felt surreal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session:&lt;/strong&gt; The Winning Stack: Lessons from a DevOps Hackathon Winner&lt;br&gt;
&lt;strong&gt;Who:&lt;/strong&gt; Me (DevOps Engineer at SUDO Consultants)&lt;/p&gt;

&lt;p&gt;Representing my team and sharing my own messy, real-world journey in front of the AWS community? Humbling. Terrifying. But mostly, exciting.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3wnkl97wep0hnbnjn31k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3wnkl97wep0hnbnjn31k.png" alt="Speaker Announcement" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Talk: No Fluff, Just Real DevOps
&lt;/h2&gt;

&lt;p&gt;I didn't want to give a lecture on textbook theory. We have documentation for that.&lt;br&gt;
I wanted to talk about what actually happens when things are on fire.&lt;/p&gt;

&lt;p&gt;I focused on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How winning a DevOps Hackathon changed my perspective.&lt;/li&gt;
&lt;li&gt;Why time pressure forces you to build better architectures.&lt;/li&gt;
&lt;li&gt;How to build a secure CI/CD pipeline that actually works in production.&lt;/li&gt;
&lt;li&gt;The massive gap between "Best Practices" and "Real Reality."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My goal was simple:&lt;br&gt;
&lt;strong&gt;Share what actually works.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  A Sunday Well Spent
&lt;/h2&gt;

&lt;p&gt;The event was on a Sunday, and the turnout was incredible.&lt;br&gt;
The best part wasn't even the sessions—it was what happened in the hallways.&lt;/p&gt;

&lt;p&gt;I saw students asking about cloud careers for the first time.&lt;br&gt;
I saw senior engineers swapping war stories about production outages.&lt;br&gt;
I saw a community that actually wants to help each other.&lt;/p&gt;

&lt;p&gt;One attendee told me later that hearing my story—going from a student to a DevOps pro—helped them understand what skills actually matter.&lt;br&gt;
That feedback alone? Worth all the prep time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.linkedin.com/posts/muaaz-syed-4367242b7_awscommunityday-firsttimers-cloudcommunity-ugcPost-7386031858746060800-jZPo" rel="noopener noreferrer"&gt;Check out this post from the event&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why You Should Go
&lt;/h2&gt;

&lt;p&gt;Events like this aren't just about the tech.&lt;br&gt;
They are about the reality check.&lt;/p&gt;

&lt;p&gt;You get honest career stories you won't find in a whitepaper.&lt;br&gt;
You get networking that feels human, not transactional.&lt;br&gt;
For anyone starting in Cloud or DevOps, this is how you accelerate.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F92cjydancd5ne5lxzqk2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F92cjydancd5ne5lxzqk2.png" alt="Community Backlight" width="800" height="454"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Big Thanks
&lt;/h2&gt;

&lt;p&gt;I have to give a shoutout to the people who made this happen:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SUDO Consultants&lt;/strong&gt; for trusting me and pushing me forward.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;AWS User Group UAE&lt;/strong&gt; volunteers—you guys run a world-class show.&lt;/li&gt;
&lt;li&gt;The audience for the questions and the energy.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Being part of a group that values sharing over competition is rare. I don't take it for granted.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7lk9ddx0755vywec7ohy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7lk9ddx0755vywec7ohy.png" alt="Event Image" width="800" height="525"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  To Students &amp;amp; First-Timers
&lt;/h2&gt;

&lt;p&gt;If you're a student or a junior engineer wondering if you should go to these events... or if you're "ready" to speak at one:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't wait until you feel ready.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You will never feel 100% ready.&lt;br&gt;
These events don't reward perfection. They reward showing up.&lt;br&gt;
Every expert on that stage started exactly where you are right now.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;This was my &lt;strong&gt;first&lt;/strong&gt; AWS Community Day Dubai.&lt;br&gt;
It definitely won't be my last.&lt;/p&gt;

&lt;p&gt;Thank you to the UAE AWS community for the platform and the inspiration.&lt;br&gt;
Here’s to more learning, more sharing, and better clouds.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhpoe6bm4dn6fjocv42ah.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhpoe6bm4dn6fjocv42ah.png" alt="Certificate Recival" width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cloudnative</category>
      <category>aws</category>
      <category>devjournal</category>
    </item>
    <item>
      <title>Why AWS CodeBuild Can Replace Self-Hosted GitHub Actions Runners</title>
      <dc:creator>Amaan Ul Haq Siddiqui</dc:creator>
      <pubDate>Sat, 25 Oct 2025 10:14:10 +0000</pubDate>
      <link>https://dev.to/amaanx86/why-aws-codebuild-can-replace-self-hosted-github-actions-runners-3m0m</link>
      <guid>https://dev.to/amaanx86/why-aws-codebuild-can-replace-self-hosted-github-actions-runners-3m0m</guid>
      <description>&lt;p&gt;Building CI/CD pipelines with GitHub Actions is usually pretty smooth. But the moment you decide to manage your own runners? That is where the headache starts.&lt;/p&gt;

&lt;p&gt;Recently i was trying to deploy a self-hosted runner on &lt;strong&gt;ECS Fargate&lt;/strong&gt; and honestly... it was a pain. I ran into so many issues with Docker-in-Docker (DinD) and realized i was just burning money on idle resources.&lt;/p&gt;

&lt;p&gt;So i switched to &lt;strong&gt;AWS CodeBuild&lt;/strong&gt;. Here is why.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Problem with Fargate Runners
&lt;/h3&gt;

&lt;p&gt;I thought putting runners on Fargate would be "serverless" and easy. I was wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The Docker-in-Docker Nightmare&lt;/strong&gt;&lt;br&gt;
Most of my workflows need to build Docker images. But Fargate doesn't support DinD natively. You have to use messy workarounds to get it running and it adds so much complexity to something that should be simple.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8zw0tfj37rbv2r4orcya.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8zw0tfj37rbv2r4orcya.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Paying for Air&lt;/strong&gt;&lt;br&gt;
A self-hosted runner consumes resources even when it's doing absolutely nothing. You are paying for CPU and RAM just to wait for a job. For a small team or a side project that creates a bill you don't need.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88tgqnrzw550qjpw1g40.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88tgqnrzw550qjpw1g40.png" alt=" " width="800" height="733"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why CodeBuild is Better
&lt;/h3&gt;

&lt;p&gt;I decided to try running my GitHub Actions workflows directly through &lt;strong&gt;AWS CodeBuild&lt;/strong&gt; as a Proof of Concept.&lt;/p&gt;

&lt;p&gt;It just worked. Seamlessly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Wins:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Native GitHub Support:&lt;/strong&gt; CodeBuild can run your GH pipeline jobs directly. You don't need complex connectors.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pay-Per-Use:&lt;/strong&gt; This is the biggest one. No idle costs. You only pay when a build is actually running.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Private Access:&lt;/strong&gt; Since it's in your AWS account it easily connects to your VPCs and private subnets without extra hassle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability:&lt;/strong&gt; You can run multiple builds in parallel and never worry about queueing or adding more runner instances.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How It Works
&lt;/h3&gt;

&lt;p&gt;The setup was surprisingly simple&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create a CodeBuild project&lt;/li&gt;
&lt;li&gt;Connect your GitHub repo (OIDC or Access Token)&lt;/li&gt;
&lt;li&gt;Point your workflow to the CodeBuild project&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That’s it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl1oe5disyfxio3imzdiq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl1oe5disyfxio3imzdiq.png" alt="AWS CodeBuild Integration GitHub" width="800" height="418"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I deployed my apps directly from there and it skipped all the drama i had with ECS.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8skruaqajfxzdrrcyivb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8skruaqajfxzdrrcyivb.png" alt="GitHub Image" width="800" height="418"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8cfqgv67088jeimdgoyk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8cfqgv67088jeimdgoyk.png" alt="GitHub Image" width="800" height="472"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Self-hosted runners give you control but they also bring operational overhead that most of us just don't have time for.&lt;/p&gt;

&lt;p&gt;If you are already on AWS and struggling with DinD on Fargate or just tired of managing runner fleets... check out CodeBuild. It’s cleaner, cheaper and it just gets out of your way so you can ship code.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>aws</category>
      <category>github</category>
      <category>cloudcomputing</category>
    </item>
  </channel>
</rss>
