DEV Community: Amaan Ul Haq Siddiqui

I Stopped Clicking Through the AWS Pricing Calculator. Now I Just Describe the Architecture.

Amaan Ul Haq Siddiqui — Sun, 28 Jun 2026 09:51:13 +0000

If you have built an estimate in the AWS Pricing Calculator by hand, you know the drill. Open calculator.aws, search a service, click in, stare at twenty fields half of which you do not need, guess at the ones the form does not explain, pick a region, repeat for every service. Then redo the whole thing next week when the customer asks what it looks like in Frankfurt.

For presales that is not a small annoyance. It is the gap between giving a number on the call and saying "let me get back to you." I wired the AWS Pricing Calculator MCP into Claude, and the first real estimate I built took one sentence.

What it is

An MCP server - an AWS Samples project - that exposes the Pricing Calculator as tools an agent can call. You describe the workload, the agent assembles the estimate, the server saves it to the real calculator, and you get a shareable calculator.aws URL back. Same link you would have built by hand, minus the form.

Three things make it usable in front of a customer:

No AWS credentials. It hits the public, unauthenticated calculator.aws endpoints. You are not pointing it at an account or assuming a role. There is no blast radius.
Live definitions. It pulls the calculator manifest at runtime - about 436 services - so it is current, not a snapshot from six months ago.
Real, editable estimates. The URL it returns opens in the actual calculator. Tweak it, send it, whatever. The agent just did the boring part.

It runs over stdio for local clients like Claude Desktop, Kiro, and Cursor, or over HTTP (MCP_TRANSPORT=http) if you want it hosted. It also handles the aws-iso and aws-eusc partitions, which matters for sovereign and regulated work.

Context is the whole job

The honest part: it is amazing when you feed it the right context. Ask for "an estimate for a web app" and you get back a web app someone else imagined. The calculator never knew your traffic - you did. The MCP does not change that.

What it changes is the translation. Once you know the shape - two m5.large instances, an ALB, 500 GB of S3, daily backups - turning that into a saved, priced, shareable estimate is instant instead of twenty minutes of clicking. You bring the requirements, the agent does the assembly. That split is the right one.

A real estimate, start to finish

Here is one I just built. I asked for a small web tier in us-east-1: two m5.large instances on demand with gp3 storage, one Application Load Balancer, 500 GB of S3 Standard, and AWS Backup holding 30 days of daily EBS recovery points. I named it "Presales Demo."

The agent searched the services, pulled the fields, filled them, ran a preflight check, and saved it:

{
  "name": "EC2 + ALB + S3 + Backup - Presales Demo",
  "sharable_url": "https://calculator.aws/#/estimate?id=59b31cbdf4a0a628b10748930fe53c8c023fd080",
  "services": [
    { "success": true, "service": "ec2Enhancement",         "group": "Prod" },
    { "success": true, "service": "applicationLoadBalancer", "group": "Prod" },
    { "success": true, "service": "amazonS3Standard",        "group": "Prod" },
    { "success": true, "service": "ebsBackup",               "group": "Prod" }
  ]
}

That link is live - it opens in the calculator with all four services priced and grouped under "Prod," ready to edit or hand off. I never opened a browser to make it.

The preflight is the part I lean on. Some calculator fields are optional to the save API but required by the pricing engine - leave one out and the estimate saves at $0. On my first pass it caught that EC2 needed tenancy set and flagged it instead of saving a wrong number. A confidently wrong estimate is worse than no estimate, so a tool that refuses to ship one is doing me a favor.

Import, swap region, re-export

The other half of presales is reworking last quarter's estimate, not starting fresh. The MCP imports any estimate by URL or ID - as JSON to edit (swap the region, bump the counts, re-export) or as Markdown to hand an LLM for analysis. The region-swap loop drops from "rebuild it in the UI" to import, change one field, export.

Why it matters for presales

The old loop: gather requirements, spend twenty minutes clicking them in, fat-finger a field, redo it, reply by end of week. The estimate was the bottleneck.

The new loop: describe the architecture while the customer is still on the call, get a shareable URL before it ends. "Actually it is closer to 80 million requests" - fine, update it, the link is still warm. The estimate moves to where the requirements live, which is the conversation. That is the real change, more than the speed.

Running it

It is an AWS Samples repo, so install follows the README: clone it, install deps, register it as an MCP server in your client. After that you do not call tools by name - you describe the architecture and let the agent reach for them in order. Source, the verified configs, and setup are in the repo on GitHub.

The summary is simple. It does not know AWS pricing for you and it does not invent your requirements. It deletes the twenty minutes between knowing what the customer needs and having a link to send. For anyone building these all day, that twenty minutes was the job.

Originally published at amaanx86.github.io.

DNS as Code: Managing Cloudflare Records with DNSControl

Amaan Ul Haq Siddiqui — Sat, 23 May 2026 14:16:44 +0000

Every DNS change I have made through a dashboard felt wrong. Click the right zone, add the right record, hope you did not fat-finger the value. No diff. No review. No way to know what the zone looked like six months ago without digging through provider audit logs that may or may not exist.

DNS controls traffic. It determines where your mail goes, which servers handle your HTTPS, which CAs are allowed to issue certificates for your domain. It deserves the same discipline as application code: version control, peer review, automated deployment.

DNSControl is how you get there. It is an open-source tool from Stack Exchange that treats your DNS zone as code. You define records in a config file, run a preview to see what would change, and push to apply. It supports Cloudflare, Route53, and about 30 other providers.

This post walks through my working POC at amaanx86/cloudflare-dnscontrol, built on runonaws.dev on Cloudflare DNS. Everything here is forkable.

The model

The repo has a simple structure:

dnsconfig.js           # Main DNSControl config
domains/               # One JSON file per subdomain
  @.json               # Apex / root records
  amaan.json           # amaan.runonaws.dev
  stg.json             # stg.runonaws.dev
  ...
utils/
  reserved.json        # Subdomain names that are protected
  internal.json        # Subdomains fully ignored by DNSControl
.github/workflows/
  preview-records.yml  # PR: dnscontrol preview → PR comment
  publish-records.yml  # Push to main: dnscontrol push → live

dnsconfig.js is the single source of truth for the zone. domains/ is where contributors make changes. utils/ holds policy that applies zone-wide. Adding a subdomain means dropping a JSON file in domains/ and opening a PR. Nothing else.

Declaring a subdomain

Each subdomain is a JSON file in domains/. The filename is the subdomain label, so amaan.json becomes amaan.runonaws.dev:

{
    "owner": {
        "username": "amaanx86",
        "email": "[email protected]"
    },
    "records": {
        "CNAME": "amaanx86.github.io",
        "TXT": "google-site-verification=aXFvR-FIaRAyOQDhDkfHac3qIX56pg9TwSaJpnq-S3A"
    },
    "proxied": true
}

The owner field is not interpreted by DNSControl. It is accountability metadata that answers the question every ops team eventually asks: whose record is this and who do you contact when it breaks? In a shared zone it is the difference between a record you can quietly delete and one that needs a conversation first.

proxied: true enables Cloudflare's orange-cloud proxy. false is DNS-only. Every subdomain declares this explicitly so there is no zone-wide default to accidentally inherit. NS delegations, mail servers, and internal services should not be proxied, and an explicit field per record makes that impossible to get wrong silently.

The records object supports A, AAAA, CAA, CNAME, DS, MX, NS, SRV, TLSA, and TXT. Multiple values per type where the record type allows it.

What to ignore

DNSControl's default behavior is to reconcile your zone to exactly match the code. That is dangerous at the apex. Root-domain records like NS and DNSKEY are auto-managed by Cloudflare. If DNSControl deletes them because they are not in your code, you lose DNS for the entire domain.

The IGNORE directive tells DNSControl to leave certain records alone:

IGNORE("@", "NS"),
IGNORE("@", "DNSKEY"),
IGNORE("@", "MX"),
IGNORE("@", "TXT"),
IGNORE("*._domainkey", "TXT"),
IGNORE("_acme-challenge", "TXT"),
IGNORE("_dmarc", "TXT"),

In this setup, CAA records are the only root-domain records managed as code. CAA controls which certificate authorities can issue certs for your domain, so having it in version control with a review process is worth the overhead. Everything else at the apex is either auto-managed by Cloudflare or changes rarely enough that the dashboard is fine.

The _acme-challenge and *._domainkey ignores keep ACME certificate issuance and DKIM from getting blown away by a push. utils/internal.json holds subdomains that DNSControl should not touch at all, useful for anything managed by a separate system.

Reserving subdomains

If anyone can add a subdomain via PR, nothing stops someone from claiming admin, api, billing, or auth before the platform team does.

Reserved subdomains solve this by pre-populating those names with placeholder A records pointing to 203.0.113.0. That address is from TEST-NET-3 (RFC 5737), a documentation-reserved block that will never route anywhere real. The subdomain exists in DNS, Cloudflare proxies it, requests to it return a Cloudflare error. Claiming it requires removing the placeholder, which requires a PR, which requires approval.

utils/reserved.json covers about 80 names: admin, api, app, auth, billing, blog, cdn, mail, and the rest of the usual candidates.

The zone timestamp

Every push writes a TXT record at _zone-updated with the current Unix timestamp. You can check it any time:

dig TXT _zone-updated.runonaws.dev +short

Two uses: confirming a push actually landed, and spotting records that look stale. If the timestamp predates your last push, something went wrong in the pipeline.

The CI/CD pipeline

Two GitHub Actions workflows handle the deployment lifecycle.

On pull request, the preview workflow runs dnscontrol preview against Cloudflare and posts the output as a PR comment. Reviewers see exactly which records will be added, modified, or deleted before anything merges. The comment updates on force-push so it always reflects the current branch state. No dashboard access needed to understand what a change does.

On push to main, the publish workflow runs dnscontrol push and the changes go live.

Both workflows only fire on changes to domains/*, dnsconfig.js, utils/reserved.json, or the workflow files themselves. Unrelated repo commits do not trigger a DNS push.

The Cloudflare API token needs Zone.DNS write permission scoped to the specific zone, not the whole account.

What the workflow looks like in practice

Someone needs to add staging.runonaws.dev pointing to an ALB. They:

Create a branch: dns/add-staging
Add domains/staging.json:

{
    "owner": {
        "username": "amaanx86",
        "email": "[email protected]"
    },
    "records": {
        "CNAME": "my-alb-1234567890.us-east-1.elb.amazonaws.com"
    },
    "proxied": false
}

Open a PR
Preview workflow posts the diff as a comment: one CNAME being added
Reviewer confirms the target and approves
Merge to main, Cloudflare updates within seconds

The contributor never opened the Cloudflare dashboard. The reviewer saw the exact change before it applied. The commit history shows who added the record, when, and why.

The sync catches what the dashboard quietly added

Here is where the reconciliation model earns its keep. This is the actual preview comment the GitHub Actions bot posted on PR #10:

******************** Domain: runonaws.dev
4 corrections (cloudflare)

± MODIFY _zone-updated.runonaws.dev TXT ("1758268077330" ttl=300) -> ("1779036943050" ttl=300)
- DELETE argo.runonaws.dev A 192.168.1.57 proxy=false ttl=1
- DELETE test.runonaws.dev CNAME e73da4f6-a287-41eb-bfe9-563f73037d68.cfargotunnel.com. proxy=true ttl=1
+ CREATE test.runonaws.dev A 203.0.113.0 proxy=true ttl=1

argo.runonaws.dev and test.runonaws.dev were added directly in the Cloudflare dashboard, outside of the repo. DNSControl has no knowledge of them. From its perspective, those records do not belong in the zone and will be deleted on the next push. The preview comment is the warning before that happens.

test.runonaws.dev shows up as both a DELETE and a CREATE. The CNAME pointing to a Cloudflare Tunnel was manually added; the replacement A record pointing to 203.0.113.0 comes from reserved.json. So the sync removes the manual record and installs the reserved placeholder in its place.

This is exactly the behavior you want, but it means any record added through the dashboard will eventually get wiped. Before merging a PR, check the preview output for unexpected DELETEs. If a record that should stay is listed for deletion, port it into a domains/ JSON file first. Once it is in the repo, it is safe.

The alternative is to merge and let the sync clean up the manual additions. Either way, the zone ends up matching the code.

Provider portability

DNSControl abstracts the provider. Switching from Cloudflare to Route53 is a one-line change in dnsconfig.js. The domain JSON files, reserved and internal lists, and CI pipelines stay the same. Zone definitions are not coupled to the DNS provider.

The POC uses Cloudflare because that is where runonaws.dev lives, but the same setup runs against any of the 30+ supported providers.

Getting started

The repo is at amaanx86/cloudflare-dnscontrol. Fork it, swap in your domain name and Cloudflare API token, and populate domains/ with your existing records.

The hardest part is the initial import. DNSControl's get-zones command can export an existing zone to its DSL format, giving you a starting point for the JSON conversion. Do it once to get in sync, and every change after that is a PR.

DNS is infrastructure. Treat it like infrastructure.

Originally published at amaanx86.github.io.

I Got Tired of Passing --profile on Every OCI CLI Command

Amaan Ul Haq Siddiqui — Fri, 22 May 2026 18:12:00 +0000

This is a problem you only run into if you work for Oracle directly or you work at a Cloud/DevOps MSP managing OCI for multiple clients. Most people will never touch it. If you are reading this, you are probably not most people.

The OCI CLI already demands a --compartment-id on almost every command. Compartment IDs are long, they are not memorable, and you need a different one depending on what you are looking at. That was already a nightmare before you add multiple tenancies to the picture. Once you are juggling prod, staging, dev, and a handful of client accounts, passing --profile on top of --compartment-id on every single command becomes the kind of friction that breaks you.

So I wrote ocswitch.

What the OCI CLI Actually Gives You

The OCI CLI reads config from ~/.oci/config by default. You can point it at a different file with OCI_CONFIG_FILE, and select a named profile within that file with OCI_CLI_PROFILE. That is the full surface area you have to work with.

The multi-profile story in the official tooling is basically: put multiple [PROFILE_NAME] sections in one config file and pass --profile PROFILE_NAME every time. If you forget the flag, you hit whichever profile is in [DEFAULT]. This works but it is tedious, and it is session-scoped - export the env var in one terminal, open another, you are back to default.

ocswitch fixes the persistence problem by writing the active profile to ~/.oci/active_profile and restoring it at shell startup.

How It Works

Profiles are separate config files in ~/.oci/:

~/.oci/config_prod
~/.oci/config_staging
~/.oci/config_dev

One file per tenancy, named config_<profile>. When you switch:

ocswitch prod

Two things happen: ~/.oci/active_profile gets written with the string prod, and OCI_CONFIG_FILE plus OCI_CLI_PROFILE are exported in the current shell. Every oci command you run after that hits the prod tenancy.

The persistence part lives at the top of the script, outside the function, so it runs every time the shell sources the file:

_oci_state="${HOME}/.oci/active_profile"
if [[ -f "$_oci_state" ]]; then
  _oci_profile=$(cat "$_oci_state")
  _oci_cfg="${HOME}/.oci/config_${_oci_profile}"
  if [[ -f "$_oci_cfg" ]]; then
    export OCI_CONFIG_FILE="$_oci_cfg"
    export OCI_CLI_PROFILE="config_${_oci_profile}"
  fi
fi

Open a new terminal, source kicks in, reads the state file, exports the right vars. The active profile follows you across sessions without any extra steps.

What It Looks Like

ocswitch list shows all detected profiles (anything matching ~/.oci/config_*) with the active one highlighted in Oracle red. ocswitch clear removes the state file and unsets both env vars.

Tab completion is included for both bash and zsh. Hit tab after ocswitch and you get list, clear, and all your profile names.

The ocid Bonus

There is a second function in the script: ocid. It reads the active config, pulls the tenancy OCID and user OCID out of it, and calls the OCI IAM API to resolve the human-readable names:

Tenant Name: mycompany-production
Tenant ID:   ocid1.tenancy.oc1..aaaa...
User Email:  user@example.com

Useful when you have switched between a few profiles and want to confirm exactly which tenancy you are currently pointing at before running something destructive.

Install

zsh

curl -fsSL https://gist.githubusercontent.com/amaanx86/2621a181e2f4b572a1904d65748d66bd/raw/ocswitch.zsh \
  -o ~/.oci/ocswitch.zsh
echo 'source ~/.oci/ocswitch.zsh' >> ~/.zshrc
source ~/.zshrc

bash

curl -fsSL https://gist.githubusercontent.com/amaanx86/2621a181e2f4b572a1904d65748d66bd/raw/ocswitch.bash \
  -o ~/.oci/ocswitch.bash
echo 'source ~/.oci/ocswitch.bash' >> ~/.bashrc
source ~/.bashrc

Usage

ocswitch               # show help + logo
ocswitch list          # list profiles (active highlighted)
ocswitch <profile>     # switch profile (persists across all terminals)
ocswitch clear         # clear active profile

Profile Setup

Each profile is a standard OCI config file at ~/.oci/config_<profile>. The section header inside the file uses the same name as the file itself, not [DEFAULT]:

[config_prod]
user=ocid1.user.oc1..aaaa...
fingerprint=xx:xx:xx:...
tenancy=ocid1.tenancy.oc1..aaaa...
region=us-ashburn-1
key_file=~/.oci/oci_api_key.pem

So ~/.oci/config_staging has [config_staging] inside, ~/.oci/config_dev has [config_dev], and so on.

The full source is in this gist. Two files, one for bash and one for zsh, no external dependencies beyond the OCI CLI itself.

Originally published at amaanx86.github.io.

Bypassing Cloudflare for True Origin Health Checks with Blackbox Exporter and CoreDNS

Amaan Ul Haq Siddiqui — Tue, 28 Apr 2026 19:02:00 +0000

You have 100+ web endpoints behind Cloudflare. You need to know if the origin is healthy. Not the edge. Not the CDN cache. The actual server behind it.

Prometheus Blackbox Exporter is the standard answer for HTTP probing. Point it at a URL, it makes a request, you get probe_success, probe_duration_seconds, probe_http_status_code. Clean, simple, works.

Until you put Cloudflare in front of everything.

The problem

When Blackbox probes https://app.example.com, DNS resolves to Cloudflare's edge. The request hits their network, gets proxied, maybe served from cache. Your probe comes back 200 OK. Meanwhile the origin server is on fire and has been returning 502s for twenty minutes. Cloudflare is masking the failure by serving a stale response or its own error page with a 200-ish status.

Even when it does proxy through to origin, you are measuring Cloudflare-to-origin latency plus edge processing time. Your probe_duration_seconds metric is useless for understanding actual server response time.

The health check is lying to you. It is monitoring Cloudflare's availability, not yours.

Naive approaches that do not work

"Just use the origin IP directly."

https://198.51.100.10 - sure, but now TLS fails. The certificate is issued for app.example.com, not for an IP address. You could skip TLS verification (insecure_skip_verify: true), but then you are not testing what your users actually experience. And many applications inspect the Host header for routing. Hitting a raw IP with no SNI or wrong Host header gets you a 404 or default backend, not a real health check.

"Use /etc/hosts on the monitoring server."

Map app.example.com to the origin IP in /etc/hosts. This works for exactly one server. Then you have 80 endpoints across a fleet of monitoring nodes and you are managing hosts files by hand. You update an origin IP and forget to update the hosts file. You add a new domain and forget to add the entry. It is a maintenance problem disguised as a solution.

Also, depending on your deployment model, you might not even control /etc/hosts. And even if you do, the change is system-wide. Every process on that host now bypasses Cloudflare for those domains, which is probably not what you want.

"Set a custom DNS server in Prometheus."

Prometheus does not probe endpoints itself. It tells Blackbox Exporter to probe them. Changing DNS settings on the Prometheus host does nothing - Blackbox is the one resolving the domain. You need to control DNS resolution at the Blackbox process level.

"Curl with --resolve."

Works great for ad-hoc debugging. Not a thing you can configure in Blackbox Exporter. The prober does not expose a per-target DNS override. It resolves using whatever DNS the process has access to.

Architecture

                         Public DNS (Cloudflare)
                         Returns edge IPs -- NOT what we want
                                  |
                                  X  (bypassed)
                                  |
+----------------------------+    |    +----------------------------+
|      Prometheus            |    |    |      CoreDNS Sidecar       |
|                            |    |    |                            |
|  "probe app.example.com"   |    |    |  hosts.override:           |
|         |                  |    |    |    198.51.100.10            |
|         v                  |    |    |        app.example.com     |
|  +------------------+      |    |    |        api.example.com     |
|  | Blackbox Exporter |------+---------->  203.0.113.50            |
|  |                  |  DNS query |    |        docs.example.org   |
|  |  dns: 172.20.0.53|<----------+    |                            |
|  +------------------+      |    |    |  fallthrough -> 8.8.8.8    |
|         |                  |    |    +----------------------------+
+---------|------------------+    |
          |                       |
          | TLS + SNI: app.example.com
          | Host: app.example.com
          | IP: 198.51.100.10
          |
          v
+----------------------------+
|   Origin Load Balancer     |
|   198.51.100.10:443        |
|                            |
|   Serves real certificate  |
|   Routes by Host header    |
+----------------------------+
          |
          v
+----------------------------+
|   Application Server       |
|   Actual origin response   |
+----------------------------+

The trick is one line in the Blackbox Exporter config: point its DNS at the CoreDNS sidecar instead of the system resolver. CoreDNS returns origin IPs for overridden domains, falls through to public DNS for everything else. Blackbox never knows the difference - it just resolves and probes like normal.

The actual solution: a DNS sidecar

The insight is simple once you see it: if you control what DNS server Blackbox Exporter uses, you control where its probes land. You do not need to touch Blackbox's config at all. You do not need to modify how Prometheus scrapes. You just need a DNS resolver that lies about specific domains - returns your origin load balancer IP instead of Cloudflare's edge IP.

CoreDNS is perfect for this. It is a lightweight, pluggable DNS server. The relevant config is about six lines:

. {
    hosts /etc/coredns/hosts.override {
        fallthrough
    }

    forward . 8.8.8.8

    log
    errors
}

That is it. CoreDNS loads a hosts file for overridden domains. Anything not in the file falls through to 8.8.8.8 for normal resolution.

The hosts override file is a plain hosts-format file. Nothing exotic:

# Origin LB for production
198.51.100.10  app.example.com
198.51.100.10  api.example.com
198.51.100.10  cdn.example.com

# Origin LB for staging
198.51.100.20  staging.example.com
198.51.100.20  staging-api.example.com

# Origin LB for partner sites
203.0.113.50   docs.example.org
203.0.113.50   status.example.org

203.0.113.71   shop.example.net
203.0.113.72   blog.example.net

Each line maps a domain to the public IP of the actual load balancer in front of the origin servers. Multiple domains can point to the same LB. Commented lines for decommissioned endpoints. Straightforward.

Then you point Blackbox Exporter's DNS resolution at this CoreDNS instance. If CoreDNS is running at 172.20.0.53, you set Blackbox's DNS to that address. The mechanism depends on your runtime - a dns directive, a resolv.conf mount, whatever your environment supports. The point is that Blackbox resolves domains through CoreDNS, not through the system default.

What happens at probe time

The flow:

Prometheus tells Blackbox: probe https://app.example.com
Blackbox resolves app.example.com via CoreDNS at 172.20.0.53
CoreDNS checks its hosts override file, finds the entry, returns 198.51.100.10
Blackbox connects to 198.51.100.10:443, sends TLS ClientHello with SNI app.example.com
The load balancer presents its certificate for app.example.com, TLS completes normally
Blackbox sends the HTTP request with Host: app.example.com
The application responds directly - no Cloudflare in the path

TLS works because the domain name is preserved. The origin LB has the real certificate. SNI and Host header are correct because Blackbox thinks it is talking to app.example.com - it just happens to have resolved to a different IP than Cloudflare's.

This is the key: you are not changing what Blackbox probes, you are changing where the name resolves to. Everything else - TLS negotiation, Host header, path, response validation - stays exactly the same.

The Prometheus side

Nothing changes in how you write your scrape config. Blackbox probing uses the standard multi-target pattern:

- job_name: blackbox_https_prod
  metrics_path: /probe
  scrape_interval: 60s
  scrape_timeout: 15s
  params:
    module: [http_2xx]
  static_configs:
    - targets:
        - app.example.com
        - api.example.com
        - shop.example.net
        - blog.example.net
        - docs.example.org
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
      replacement: https://$1
    - source_labels: [__address__]
      target_label: instance
    - target_label: __address__
      replacement: <blackbox-host>:9115

Targets are just domain names. The relabel chain rewrites them into probe parameters and sets the instance label to the human-readable domain. Nothing here is aware of Cloudflare or DNS overrides. That is all handled one layer down.

This separation matters. The person adding a new endpoint to monitoring does not need to know about the DNS plumbing. They add the domain to the target list. If the domain needs a DNS override, someone adds it to the hosts file. Two different concerns, two different files.

Alerting on actual origin health

With Cloudflare out of the path, your alerts are honest. A probe_success == 0 means the origin is actually down, not that Cloudflare is having a bad day.

- alert: EndpointDown
  expr: probe_success{job=~"blackbox_https_(prod|nonprod)"} == 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: "Endpoint DOWN: {{ $labels.instance }}"

- alert: EndpointHighLatency
  expr: probe_duration_seconds{job=~"blackbox_https_(prod|nonprod)"} > 3
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "High latency: {{ $labels.instance }} ({{ $value | printf \"%.2f\" }}s)"

for: 0m on the down alert - no waiting. If the origin is unreachable, you want to know immediately. The latency alert uses for: 10m because transient spikes happen and you do not want noise. These thresholds are tunable, but the point is that the data feeding them is real. You are measuring origin behavior, not CDN behavior.

Why not just use /etc/hosts at scale

It is worth expanding on why the hosts-file-on-the-monitoring-host approach breaks down, because it is the first thing most people try.

Scope creep. You start with five domains. Then twenty. Then eighty. Each load balancer IP change requires updating the file and restarting the service. Miss one and you are probing the wrong backend for weeks without knowing.

Blast radius. System-wide DNS overrides affect everything on the host. Your monitoring stack bypasses Cloudflare - great. But so does every other process. If you are running anything else on that host that legitimately needs to hit the CDN edge, you have broken it.

Reproducibility. The hosts file is stateful, manual, and easy to drift. CoreDNS with a hosts file is also just a flat file, but it is scoped to a single process (Blackbox), versioned in your repo, and trivially reproducible. Rebuild the monitoring stack from scratch and the DNS overrides come with it.

Multiple monitoring replicas. If you run more than one Prometheus instance for redundancy (and you should), each one has its own Blackbox Exporter. Managing hosts files across multiple hosts is the kind of thing that works until it doesn't. A dedicated DNS sidecar per replica is self-contained.

Operational considerations

Adding a new endpoint. Two steps: add the domain to Prometheus targets, add the domain-to-IP mapping in the CoreDNS hosts file. Reload both. Five-minute task.

IP changes. When a load balancer gets a new IP, update one line in the hosts file, restart CoreDNS. Blackbox automatically starts resolving to the new address on its next probe.

Domains not behind Cloudflare. The fallthrough directive in CoreDNS means anything not in the hosts file gets resolved normally via 8.8.8.8. So you can mix Cloudflare-proxied and direct endpoints in the same Blackbox instance. No special handling needed.

Debugging. dig @172.20.0.53 app.example.com tells you exactly what Blackbox will resolve. If the probe is failing, check whether CoreDNS returns the right IP. If it does, the problem is downstream. If it does not, your hosts file is wrong. Clean separation of concerns.

The broader pattern

This is not really about Cloudflare specifically. The same approach works for any situation where public DNS does not resolve to where you want your probes to land:

Services behind any CDN or reverse proxy (Akamai, Fastly, AWS CloudFront)
Internal services with split-horizon DNS where monitoring runs in a different network zone
Migration scenarios where you want to probe the new backend before cutting DNS over
Canary deployments where you want to health-check a new origin before adding it to the pool

The pattern is always the same: run a lightweight DNS resolver as a sidecar, override the specific records you need, let everything else pass through. Your probing tool does not need to know about any of this. It just resolves names and makes requests, which is exactly what it should do.

The setup is simple. CoreDNS is maybe 30MB of memory. The hosts file is trivially maintainable. And the result is that when your pager goes off at 3am, you know it is because the origin is actually down - not because Cloudflare rotated an edge node.

That is the kind of signal worth waking up for.

Originally published at amaanx86.github.io

The Git habits that made reviewers trust me before they read a single line of code

Amaan Ul Haq Siddiqui — Wed, 15 Apr 2026 09:01:28 +0000

Most developers treat Git as a save button. Commit when something works. Push before you go to sleep. That's it. And honestly, it gets the job done — until you start sending PRs to projects with real reviewers, or working on teams where the commit history actually needs to make sense to someone else.

Here's the thing: reviewers form an opinion about you before they open a single file. The branch name, the commit messages, the PR description — all of that is visible before they click into the diff. I figured this out after sending contributions to projects across GitHub, GitLab, Azure DevOps, and AWS CodeCommit. The feedback loop was fast. Clean work got reviewed fast and merged. Noisy work got asked follow-up questions, or sat in the queue.

So here's what I actually do.

Three permanent branches, clear purpose for each

I keep three long-lived branches. That's it.

main — production only. Nothing goes here without a PR that passed UAT. This branch is protected.
uat — staging and pre-production testing. Business users validate here before anything touches main.
dev — where everything starts. Features, fixes, experiments — all of it starts from here.

The reason for UAT as a dedicated branch (not just a tag or a deploy config) is that it gives you a stable checkpoint. When you find a bug in UAT, you know it's code that passed dev but failed real-world testing. That's a different category of bug than something that never left the developer's machine, and it deserves a different branch to fix it.

Short-lived branches with a naming pattern that explains itself

When I start any new piece of work I create a branch from the right source and name it properly:

# Feature from dev
git checkout dev && git pull origin dev
git checkout -b feat/user-authentication

# Bug found during UAT testing
git checkout uat
git checkout -b bugfix/cart-crash-fix

# Production emergency
git checkout main
git checkout -b hotfix/prod-db-connection

The prefixes matter. feat/, bugfix/, hotfix/, test/ — any reviewer or CI system that sees these knows what category the work is before they look at anything else. And because the branch name describes the problem, not the implementation, it reads the same to a human and to a ticket system.

Hotfixes are the one case where you merge in three directions: back into main, uat, and dev. Missing one creates drift and you'll feel it later.

Conventional commits — the part that changed how I write history

This is the single highest-leverage change I made to how I use Git. The format is simple:

<type>(<scope>): <description>

Types I use:

feat — new feature
fix — bug fix
docs — documentation only
refactor — restructured code, no behavior change
perf — performance improvement
test — tests added or updated
chore — build tooling, CI, dependencies

A real example from one of my projects:

feat(auth): add JWT refresh token rotation

Implements sliding expiration using a short-lived access token (15m)
and a rotating refresh token (7d). Old refresh tokens are invalidated
immediately on use to prevent token replay.

Closes: #142

When I send a PR with a history like this, reviewers can read the commit log like a changelog. They don't have to reverse-engineer what I did from the diff. And when something breaks in production six months later, the person doing the post-mortem can git log --oneline and actually understand the sequence of events.

Semantic versioning falls out of this naturally too. If all your commits since the last tag are fix:, you ship a patch. If there's a feat:, it's a minor. If anything has a BREAKING CHANGE in the footer, it's a major. Your CHANGELOG.md can be generated automatically. Your release tags follow v1.2.3. No guessing.

git tag -a v1.0.0 -m "Release version 1.0.0"
git push origin --tags

Cherry-pick: when you need one thing from a branch, not the whole branch

This is the tool most developers don't reach for until they're stuck, and then they use it wrong. Here's the actual use case.

I had a project where uat and main had diverged significantly. Both branches had been accumulating changes. We needed the azure-pipelines.yml from uat in main — just that file, because getting the pipeline consistent across environments was blocking the team. A full merge would have pulled in unvalidated changes. That's not acceptable for a protected branch.

So instead of a merge request, I used cherry-pick to move exactly the commits that touched that file.

First, find the commits that changed the file in the source branch:

git checkout uat
git log uat -- azure-pipelines.yml --oneline

Output:

b5df122 feat: update Azure Pipelines configuration to specify custom agent pool and demands
44cbc1e feat: update image tag for crmiskuat container to use 'uat' version
8fee9b4 feat: add Azure Pipelines configuration for OCI image build and push

Switch to the target branch and apply them in order (oldest first):

git checkout main
git cherry-pick 8fee9b4 44cbc1e b5df122

If there's a conflict on just that file:

# Resolve the conflict in your editor, then:
git add azure-pipelines.yml
git cherry-pick --continue

This creates new commits on main with the same changes but different hashes. The original commits stay on uat. You get a clean, traceable history on main without dragging in anything that wasn't ready.

The key thing to understand: cherry-pick is not a shortcut for avoiding code review. You still open a PR with these commits. The difference is the PR is surgical — it does one thing and reviewers can verify exactly what that thing is.

The PR description template I actually fill out

A PR is a document. If the description is empty or says "fix stuff," you're forcing the reviewer to read the entire diff to understand context that you already have in your head.

## Linked Issue / Ticket
<!-- GitHub/GitLab: Closes #123 | Jira: PROJECT-456 | Azure DevOps: AB#789 -->
Closes #

## Summary
Brief description of what this MR accomplishes.

## Changes
- Specific change with context
- Another change and why it was necessary

## Testing
- [ ] Unit tests added/updated
- [ ] Integration tests pass
- [ ] Manual testing completed in dev/UAT

## Screenshots (if applicable)

## Checklist
- [ ] Code follows project standards
- [ ] Self-review done
- [ ] No new warnings
- [ ] Documentation updated

This is not bureaucracy. It's a forcing function for you to think through what you did before someone else has to. Half the time, filling out the testing checklist is when I catch something I missed.

Post-merge cleanup

After a branch merges, delete it. Both locally and remotely.

git branch -d feat/S-CR1234-new-feature
git push origin --delete feat/S-CR1234-new-feature

A repository with 200 stale branches is a repository nobody trusts. Clean branches signal that someone is actually maintaining this.

Why this matters more now, not less

The argument I used to hear against structured commits was that it slows you down. Write the code, get it working, the history is just metadata.

That argument has fully inverted. AI code review tools, automated changelog generation, semantic release pipelines — all of these work significantly better when the commit history is structured. When I use an AI assistant to review a PR now, it can read the conventional commits and understand the intent without me explaining it. When I ask it to generate release notes, it has something to work from.

But more fundamentally: reviewers at AWS, GitHub, GitLab, Azure — the pattern I noticed was that the PRs that got attention were the ones that looked like the contributor had thought it through. Clean branch name, useful commits, filled-out description. It signals that you respect the reviewer's time. That's the actual thing that builds trust.

The technical content still has to be right. But a well-structured contribution gets a genuine review. A sloppy one gets skimmed.

The actual workflow in one place

# Start a feature
git checkout dev && git pull origin dev
git checkout -b feat/your-feature

# Work in small, meaningful commits
git add src/specific-file.ts
git commit -m "feat(auth): add token expiry validation"

# Push and open PR to dev
git push origin feat/your-feature

# After merge, clean up
git branch -d feat/your-feature
git push origin --delete feat/your-feature

That's it. No magic. Just consistency applied over enough time that it becomes automatic.

Originally published at amaanx86.github.io

End-to-End Supply Chain Security for a Go Project: TUF on CI, cosign, and SLSA L3

Amaan Ul Haq Siddiqui — Sat, 11 Apr 2026 21:25:52 +0000

Adding cosign sign to a CI pipeline and calling it "signed releases" is a bit like putting a lock on a glass door. The lock works. The glass does not. Signing the image proves a specific digest was signed by a specific identity at a specific time. It says nothing about whether the source commit matches what was built, whether the build environment was clean, or whether someone replaced the release asset after the fact.

I had been going deep on supply chain security for a while - reading through TUF specs, the Sigstore design docs, how Fulcio issues short-lived certificates, how Rekor works as an append-only transparency log. At some point I came across the OpenSSF Best Practices requirements and saw the full picture laid out as a checklist. I could have just signed the image and moved on. Instead I used oci-prometheus-sd-proxy - a project that does OCI Prometheus service discovery - as the thing to actually build it on. I wanted to understand each layer well enough to explain it, not just wire it up. What I ended up with: cosign keyless signing, a CycloneDX SBOM attestation, SLSA L3 build provenance, and TUF metadata distribution via tuf-on-ci published to GitHub Pages. No long-lived keys anywhere in the pipeline.

Why each layer exists

This is the question worth answering properly, because you can absolutely ship just cosign and be in a better position than 95% of projects. So why go further?

HTTPS protects against network-level tampering. It does not protect against a compromised GitHub account pushing a backdoored release, or a build that was modified after it completed, or an asset quietly swapped post-publication.

cosign on a container image proves that a specific digest was signed by a specific OIDC identity with the event recorded in Rekor's transparency log. That is genuinely useful - but it only covers the image. It says nothing about the source ref, the build environment, or the workflow that produced it. Someone could sign a backdoored binary and the cosign verification would pass.

SLSA L3 provenance fills that gap, and it is the layer I found most interesting to wire up. The provenance is generated in a separate, isolated signing job with its own OIDC identity, not in the main build job. That isolation is what makes L3 meaningful: an attacker who compromises the main build job cannot forge L3 provenance because they do not control the signing job's OIDC token. The provenance attests to the exact source ref, the exact workflow, and the exact runner environment. You can look at a SLSA L3 attestation and know that the image you are running came from that commit in that repo via that verified builder.

TUF adds something orthogonal to all of the above - it is about distribution trust, not just signing trust. It adds a role-based metadata layer where clients can verify that a release target was authorised by the project's key holders, that the metadata has not been rolled back to a previous version, and that the metadata is actually fresh. The design difference that matters: TUF survives key compromise in a way that a single cosign keypair does not. If my cosign key leaked tomorrow, every past signature would be under a cloud. With TUF, key rotation is a defined protocol. The damage is bounded and recoverable.

The point of having all four is that verification is fully independent. Verifying the image signature, the build provenance, and the TUF metadata chain are three separate operations against different data sources. Compromising any single one of them is not enough to ship a malicious release undetected. You need to compromise all of them simultaneously - and the transparency logs make doing that silently very hard.

Two repos, one trust boundary

The implementation lives across two repositories:

amaanx86/oci-prometheus-sd-proxy - the application, the build pipeline, the release workflow
amaanx86/oci-prometheus-sd-proxy-tuf-on-ci - TUF metadata, managed by tuf-on-ci, published to GitHub Pages

Keeping them separate was a deliberate trust boundary decision, not just organisation. The app CI can push a signing branch to the TUF repo, but it cannot merge that branch or sign targets.json. That step requires my OIDC identity authenticating via Sigstore in a browser - not the CI system's token. An attacker who steals the app repo's CI tokens hits a hard wall at the TUF signing step. They can push a branch, but they cannot authorise the release. The human is the last gate.

App repo CI (run #24290258383)
  └── builds image (linux/amd64, linux/arm64)
  └── pushes to GHCR
  └── cosign attest --type cyclonedx sbom.cyclonedx.json (SBOM)
  └── cosign sign (image signature, OIDC -> Fulcio cert -> Rekor)
  └── cosign attest --type  release-metadata.json
  └── slsa-github-generator -> SLSA L3 provenance (isolated job)
  └── pushes sign/release-1-4-2-rc-24290258383 branch -> TUF repo

TUF repo (tuf-on-ci) [PR #8]
  └── signing-event.yml detects branch push -> opens signing PR
  └── Maintainer: tuf-on-ci-sign (browser OIDC -> @amaanx86 identity)
  └── PR merged -> online-sign.yml refreshes snapshot + timestamp
  └── publish.yml -> GitHub Pages
  └── test.yml -> smoke-tests TUF client (scheduled)

Users / Policy Engines
  └── cosign verify - checks image signature against Rekor
  └── cosign verify-attestation --type cyclonedx - checks SBOM
  └── cosign verify-attestation --type  - checks release-metadata
  └── slsa-verifier verify-image - checks SLSA L3 provenance
  └── TUF client (python-tuf ngclient) - fetches metadata from GitHub Pages, verifies chain

The build pipeline

The workflow (docker-build-push.yml) triggers on release publication. After pushing the multi-arch image (linux/amd64 and linux/arm64) to GHCR it does five more things.

SBOM generation and attestation

First, Syft generates a CycloneDX SBOM for the pushed image, which gets attached as a cosign attestation:

cosign attest \
  --yes \
  --predicate sbom.cyclonedx.json \
  --type cyclonedx \
  ghcr.io/amaanx86/oci-prometheus-sd-proxy@sha256:759e255e607f623e0b1ee4ea9df02b2aefd89e2c9ec979842ee2e6f8b21772fd

The SBOM is also uploaded as a release asset (the 80.7 KB *.cyclonedx.json artifact visible on the workflow run).

cosign image signing

cosign sign \
  --yes \
  ghcr.io/amaanx86/oci-prometheus-sd-proxy@sha256:759e255e607f623e0b1ee4ea9df02b2aefd89e2c9ec979842ee2e6f8b21772fd

No --key flag. cosign uses the GitHub Actions OIDC token to get an ephemeral certificate from Fulcio, signs the digest, and records the operation in Rekor. The private key is generated in memory and never stored. The Rekor entry pins the workflow identity to the specific digest at a specific timestamp.

Note the image is signed by digest, not by tag. Tags are mutable; the digest is what the signature actually covers.

No secret to rotate. No key to leak.

Release-metadata attestation

The workflow generates a release-metadata.json with the image digest, source commit, release tag, and build timestamp, then attaches it as an attestation under a custom predicate type:

cosign attest \
  --yes \
  --predicate release-metadata.json \
  --type https://github.com/amaanx86/oci-prometheus-sd-proxy/release-metadata \
  ghcr.io/amaanx86/oci-prometheus-sd-proxy@sha256:759e255e607f623e0b1ee4ea9df02b2aefd89e2c9ec979842ee2e6f8b21772fd

Using a project-specific type URI keeps the attestation namespaced and lets cosign verify-attestation --type <uri> fetch exactly this attestation rather than every in-toto statement on the image.

SLSA L3 provenance

uses: slsa-framework/slsa-github-generator/.github/workflows/generator_container_slsa3.yml@v2.0.0
with:
  image: ghcr.io/amaanx86/oci-prometheus-sd-proxy
  digest: ${{ needs.build-and-push.outputs.digest }}

slsa-github-generator runs as a reusable workflow with its own isolated OIDC identity. The provenance attestation is generated and signed there, not in the main build job. L3 specifically requires this isolation - the verified builder identity in the provenance is https://github.com/slsa-framework/slsa-github-generator/.github/workflows/generator_container_slsa3.yml@refs/tags/v2.0.0, distinct from the app workflow's identity.

Triggering the TUF signing event

Last step: clone the TUF repo and push a signing branch with a new target file at targets/oci-prometheus-sd-proxy/releases/v1.4.2-rc.json. The branch name embeds the run ID to avoid collisions: sign/release-1-4-2-rc-24290258383 (dots in the version become hyphens, run ID appended). This branch push is what kicks off the TUF side of the pipeline.

The Python tuf library (v5+) is used inline to update metadata/targets.json with the new target entry before committing - the targets metadata version is incremented and signatures cleared, ready for the human signing step.

tuf-on-ci

tuf-on-ci manages the TUF metadata lifecycle entirely within a GitHub repository. All online signing uses GitHub Actions OIDC. All offline signing uses Sigstore OIDC (browser-based). No key files anywhere.

The TUF repo has four workflows. signing-event.yml fires on any sign/** branch push, opens a PR, and annotates it with what needs signing. online-sign.yml runs after a signing PR is merged and refreshes snapshot.json and timestamp.json using the Actions OIDC token. publish.yml deploys everything to GitHub Pages. test.yml runs on a schedule and verifies the full metadata chain with a real TUF client to catch expiry or breakage before any user does.

Signing a release

When signing-event.yml opens PR #8 with title "Signing event: sign/release-1-4-2-rc-24290258383", I run:

tuf-on-ci-sign sign/release-1-4-2-rc-24290258383

This opens a browser window for Sigstore OIDC. I authenticate as @amaanx86 via GitHub. Fulcio issues an ephemeral certificate tied to that identity, the tool signs targets.json, and pushes the signature to the branch. Rekor gets an entry proving the person at @amaanx86 authorised this specific targets update at this specific time.

Worth being clear on what "offline" means here: it requires a human with a verified identity, not a CI token. It does not require an air-gapped machine. The private key is still ephemeral.

After merge, online-sign.yml takes over and refreshes snapshot and timestamp automatically using the Actions OIDC token. No human needed for that part.

What gets published

GitHub Pages at amaanx86.github.io/oci-prometheus-sd-proxy-tuf-on-ci/metadata/ serves:

root.json - signed by @amaanx86 via Sigstore OIDC; defines trusted key holders for all roles
targets.json - signed by @amaanx86; lists all authorised release targets with digests
snapshot.json - signed by GitHub Actions OIDC; prevents any metadata file from being swapped with an older version
timestamp.json - signed by GitHub Actions OIDC; has a short validity window to prevent freeze attacks

Each release gets a target file at targets/oci-prometheus-sd-proxy/releases/v1.4.2-rc.json. The TUF metadata key (used inside targets.json) is oci-prometheus-sd-proxy/releases/v1.4.2-rc.json - the path is relative to the targets directory, not the repo root.

Verifying v1.4.2-rc

All verification uses the digest, not the tag, since the tag is a mutable pointer.

Image signature:

cosign verify \
  --certificate-identity="https://github.com/amaanx86/oci-prometheus-sd-proxy/.github/workflows/docker-build-push.yml@refs/tags/v1.4.2-rc" \
  --certificate-oidc-issuer="https://token.actions.githubusercontent.com" \
  ghcr.io/amaanx86/oci-prometheus-sd-proxy@sha256:759e255e607f623e0b1ee4ea9df02b2aefd89e2c9ec979842ee2e6f8b21772fd

SBOM attestation:

cosign verify-attestation \
  --certificate-identity="https://github.com/amaanx86/oci-prometheus-sd-proxy/.github/workflows/docker-build-push.yml@refs/tags/v1.4.2-rc" \
  --certificate-oidc-issuer="https://token.actions.githubusercontent.com" \
  --type cyclonedx \
  ghcr.io/amaanx86/oci-prometheus-sd-proxy@sha256:759e255e607f623e0b1ee4ea9df02b2aefd89e2c9ec979842ee2e6f8b21772fd

Release-metadata attestation:

cosign verify-attestation \
  --certificate-identity="https://github.com/amaanx86/oci-prometheus-sd-proxy/.github/workflows/docker-build-push.yml@refs/tags/v1.4.2-rc" \
  --certificate-oidc-issuer="https://token.actions.githubusercontent.com" \
  --type "https://github.com/amaanx86/oci-prometheus-sd-proxy/release-metadata" \
  ghcr.io/amaanx86/oci-prometheus-sd-proxy@sha256:759e255e607f623e0b1ee4ea9df02b2aefd89e2c9ec979842ee2e6f8b21772fd

SLSA L3 provenance (requires digest reference - slsa-verifier rejects mutable tags):

slsa-verifier verify-image \
  "ghcr.io/amaanx86/oci-prometheus-sd-proxy@sha256:759e255e607f623e0b1ee4ea9df02b2aefd89e2c9ec979842ee2e6f8b21772fd" \
  --source-uri "github.com/amaanx86/oci-prometheus-sd-proxy" \
  --source-tag "v1.4.2-rc"
# Verified build using builder "...generator_container_slsa3.yml@refs/tags/v2.0.0"
# at commit a68d4cd5d29cc6b865c6804fe63adff14ac74b27
# PASSED: SLSA verification passed

TUF metadata chain verification is documented in detail - including the full client walkthrough and what each step validates - in the release verification docs. The short version: a compliant TUF client walks root trust, snapshot consistency, and timestamp freshness before fetching the target. If the metadata is expired, rolled back, or the signature chain is invalid, the client raises before returning anything. That freshness check is what separates TUF from static signature verification.

What this covers and where the gaps are

This pipeline secures the release artifact and its provenance chain. An operator fetching the image can independently verify who built it, from what source, in what environment, and that the release was authorised by a human identity. That is a lot more than most projects ship with.

But supply chain security has layers, and the release artifact is only one of them. A few honest gaps:

Go module dependencies. The SBOM shows what modules are in the binary, and go.sum pins their hashes. But govulncheck and a periodic dependency audit are what actually catch known-vulnerable transitive dependencies. The attestation proves the SBOM is authentic; it does not tell you the SBOM is safe.

Runtime enforcement. Signing and provenance only matter if someone checks them at deploy time. Right now verification is a manual step. The more interesting place this is going is integrating the cosign attestations and SLSA provenance into a Kyverno or OPA/Gatekeeper policy engine that enforces admission control in Kubernetes. Policy engines like Kyverno can query Sigstore and reject any image that lacks a valid SLSA L3 attestation from the correct workflow identity - automatically, at admission time, not as a manual verification step. That closes the loop between what we proved at build time and what is allowed to run in production.

TUF root key compromise. If the @amaanx86 GitHub account itself was compromised, an attacker could rotate the root TUF key in a way that would pass client verification. TUF supports threshold signatures across multiple root key holders to mitigate this, which becomes relevant as the project scales to multiple maintainers.

The dependency of your dependencies. None of this solves a compromised slsa-github-generator or a backdoored Syft release. That is a solved problem in theory (pin action hashes, verify the tools themselves) but it is worth naming.

What I am building toward

The next step that interests me most is closing the loop at runtime.

The next meaningful investment is runtime enforcement. Kyverno ClusterPolicies that require verified SLSA provenance before admission, OPA rules that check SBOM attestations against a known-safe package policy, and Sigstore-aware image admission are all achievable with what the pipeline already produces. The attestations are already there. The policy layer that consumes them is the missing piece.

After that, expanding to multi-maintainer TUF with hardware-backed root keys and threshold signatures would make the trust model genuinely robust at scale.

Full implementation details and the verification workflow are documented at oci-prometheus-sd-proxy.readthedocs.io/en/latest/releasing.html.

Originally published at amaanx86.github.io

Built My Own Prometheus Service Discovery for Oracle Cloud Because a 3-Year-Old PR Never Got Merged

Amaan Ul Haq Siddiqui — Wed, 11 Mar 2026 23:20:32 +0000

There is a specific kind of frustration reserved for when you know a problem is solved, you can see the solution, and you still cannot use it. That is how this project started.

The Context: Setting Up Observability From Scratch

I was building out observability from scratch across Oracle Cloud Infrastructure - multiple tenancies, multiple regions, a decent number of compute instances spread across compartments. The goal was full coverage: every VM enrolled in monitoring, no gaps, no guesswork.

When you are starting from zero, one of the first things you ask is how Prometheus is going to discover what it needs to scrape. For AWS you have EC2 service discovery built right in. Same for GCP, Azure, Hetzner, DigitalOcean. You configure credentials, set some filters, and Prometheus takes care of the rest.

OCI is not on that list.

I searched. I found a pull request in the Prometheus repository opened by an engineer at Oracle. It was exactly what I needed. It was also three years old and had never been merged. Comments, reviews, back and forth, and then silence. The PR is still open today. If you have gone looking for OCI service discovery in Prometheus you have probably landed on that same page, felt a brief moment of hope, and then noticed the date.

So I built it myself.

What I Actually Needed

The requirement was simple: new VM comes up, Prometheus finds it and starts scraping. No manual steps in that loop. Not because I was fixing a broken process - there was no process yet. I was designing this from scratch and I was not going to design it with a gap where a human has to remember to update a config file.

The concern was blind spots. Infrastructure grows, VMs get provisioned, things change. If enrollment is manual, coverage is only as good as the last person who remembered to do it. I wanted observability that was structurally complete, not best-effort.

The workflow I landed on:

Tag the VM in OCI.
When provisioning a new instance, add a tag - something like prometheus:scrape = true. That is the only enrollment step.

Open the Prometheus port on the network security group.
Allow the Prometheus server IP to reach port 9100. One rule, specific source, no broad exposure.

Run the node exporter playbook.
An Ansible playbook installs and starts node exporter. Done.

That is the full enrollment flow. The VM is in monitoring. No touching the Prometheus config. No reloading Prometheus. No rollout restart of a Kubernetes deployment. No SSH back into the server to verify anything.

The proxy polls OCI, finds every instance with the right tag across all configured tenancies and compartments, and hands the target list to Prometheus via the HTTP service discovery API. Prometheus picks it up on its next refresh cycle. The whole thing is automatic by design, not patched in after the fact.

What I Built

oci-prometheus-sd-proxy is a lightweight Go service that implements the Prometheus HTTP Service Discovery API for OCI.

You point Prometheus at it like this:

scrape_configs:
  - job_name: oci_instances
    http_sd_configs:
      - url: 'http://oci-sd-proxy:8080/v1/targets'
        authorization:
          type: Bearer
          credentials: 'YOUR_TOKEN'
        refresh_interval: 60s
    scrape_interval: 30s
    scrape_timeout: 10s
    metrics_path: /metrics
    relabel_configs:
      - source_labels: [__meta_oci_instance_name]
        target_label: instance
      - source_labels: [__meta_oci_tenancy_name]
        target_label: tenancy
      - source_labels: [__meta_oci_compartment_name]
        target_label: compartment
      - source_labels: [__meta_oci_region]
        target_label: region
      - source_labels: [__meta_oci_availability_domain]
        target_label: availability_domain
      - source_labels: [__meta_oci_instance_shape]
        target_label: shape

The proxy handles the rest. It scans OCI, filters by tag, and returns targets with rich metadata: tenancy, compartment, region, availability domain, shape, fault domain, private IP, and all your custom OCI tags as Prometheus labels. Use them for relabeling, alerting rules, dashboards - whatever you need.

A few things I cared about when building it:

Multi-tenancy from day one. The proxy handles all tenancies in parallel from a single config file. One deployment, full coverage.

Rate limiting. OCI's API will return 429s if you push it too hard. The proxy uses a token bucket proactively and a retry policy reactively. Discovery does not silently fail under load.

Security. Bearer token auth, distroless container, read-only config mounts. In an HA setup it sits on the local network, only reachable by Prometheus. It does not need to be internet-facing and it should not be.

Caching. Discovery results are cached so Prometheus always gets a response, even if the OCI API is momentarily slow or rate limiting.

Battle Tested

This is running in production across 10+ Oracle Cloud tenancies - different regions, different compartments, different team setups. It has handled OCI API slowness, tenancy permission edge cases, and the general entropy that comes with real infrastructure at scale.

The thing that still satisfies me is watching a new VM appear in Grafana within a minute of it booting, with nobody doing anything to make that happen. That is what good observability infrastructure should feel like. You build it right once and it stays right.

Why This Did Not Exist Already

OCI is a smaller player compared to AWS or GCP and Prometheus contributors naturally prioritize the platforms most of their users are on. Oracle engineers clearly wanted to solve this - that PR is evidence of that - but getting a first-party integration merged upstream is a long road with no guarantees.

The HTTP SD API that Prometheus exposes is actually the right answer for this situation. It lets any platform plug in without touching the Prometheus codebase. I just had to build the other end of that interface.

Try It

Repository: https://github.com/amaanx86/oci-prometheus-sd-proxy
Documentation: https://oci-prometheus-sd-proxy.readthedocs.io/

The docs cover installation, the full configuration reference, OCI IAM permissions, and Docker Compose deployment examples. If you are building observability on OCI and want automatic enrollment from day one, this is what I use.

And if you were one of the people who commented on that PR hoping it would eventually land - same. This is what I built instead.

Originally published at amaanx86.github.io

What Broke During Our AWS DMS Migration (And How We Fixed It)

Amaan Ul Haq Siddiqui — Sat, 24 Jan 2026 09:46:30 +0000

Let me tell you about the time I thought migrating a database would be straightforward. Spoiler alert: it wasn't.

The Setup

I was tasked with migrating our MySQL database from DigitalOcean's managed service to AWS RDS. Armed with confidence and the AWS DMS documentation, I dove in headfirst.

The First Roadblock: Serverless Seemed Like a Good Idea

Creating the source and target endpoints went surprisingly smooth. I felt like I was on a roll. Then came the task creation, and I thought, "Hey, let's use serverless DMS. Modern, scalable, perfect."

That's when everything came to a grinding halt.

The networking configuration for serverless DMS had me completely stumped. I spent way too long trying to figure out the right VPC setup, subnet configurations, and security group rules. Nothing seemed to work the way I expected. The documentation made sense in theory, but practice was a different story.

Eventually, I gave up on serverless and pivoted to the traditional EC2-based replication instance. Sometimes the old way is the right way.

Excluding the Noise

With my shiny new replication instance ready, I created the migration task. But I needed to make sure I wasn't migrating MySQL's system databases along with my actual data. Nobody needs that mess.

I configured table mappings to include all user databases while explicitly excluding the system ones:

{
  "rules": [
    {
      "rule-type": "selection",
      "rule-id": "1",
      "rule-name": "include-all-user-dbs",
      "object-locator": {
        "schema-name": "%",
        "table-name": "%"
      },
      "rule-action": "include"
    },
    {
      "rule-type": "selection",
      "rule-id": "2",
      "rule-name": "exclude-mysql",
      "object-locator": {
        "schema-name": "mysql",
        "table-name": "%"
      },
      "rule-action": "exclude"
    },
    {
      "rule-type": "selection",
      "rule-id": "3",
      "rule-name": "exclude-sys",
      "object-locator": {
        "schema-name": "sys",
        "table-name": "%"
      },
      "rule-action": "exclude"
    },
    {
      "rule-type": "selection",
      "rule-id": "4",
      "rule-name": "exclude-information_schema",
      "object-locator": {
        "schema-name": "information_schema",
        "table-name": "%"
      },
      "rule-action": "exclude"
    },
    {
      "rule-type": "selection",
      "rule-id": "5",
      "rule-name": "exclude-performance_schema",
      "object-locator": {
        "schema-name": "performance_schema",
        "table-name": "%"
      },
      "rule-action": "exclude"
    }
  ]
}

Clean and specific. I felt good about this.

The Premigration Checks Humbled Me

I ran the premigration assessment checks, expecting maybe a warning or two. Instead, I was greeted with a wall of failures. Major ones. The kind that make you question your life choices.

I spent the next few hours going through each failed check, cross-referencing with AWS documentation, and fixing issues one by one. Most of the critical warnings got resolved, though some minor ones remained. I figured those were acceptable and proceeded with the migration.

The Migration Itself: A False Sense of Security

This was our staging database, and it was massive. The migration kicked off, and surprisingly, it ran smoothly. Hours passed, data transferred, progress bars filled. Everything looked perfect.

We switched the application endpoint to the new RDS instance, deployed the changes, and waited for the green light.

Then the login feature stopped working entirely.

The Investigation That Nearly Broke Me

Cue several hours of frantic debugging. We checked connection strings, verified credentials, tested queries manually, checked network rules, and compared database structures. Everything looked identical.

Until we looked closer at the tables themselves.

Our foreign keys were gone. Primary keys were missing. Auto-increment sequences had reset. DMS had essentially eaten the structural integrity of our database.

Turns out, this is a known behavior. DMS focuses on moving data efficiently, and in doing so, it doesn't always preserve things like constraints and keys perfectly during the initial load.

The Actual Solution

After more digging through documentation and forums, we found the recommended approach: manually dump and restore the schema first, then let DMS handle just the data migration.

We also discovered that you can pass additional connection parameters to the DMS endpoint configuration to better preserve database objects during migration. We updated our endpoint settings with these parameters and ran the migration again.

This time, everything worked. Foreign keys intact, primary keys preserved, auto-increment sequences functioning as expected. The application came back to life, and logins worked perfectly.

Lessons Learned

First, serverless isn't always the answer, especially when you're still figuring out the networking intricacies of a new service.

Second, premigration checks exist for a reason. Those warnings are trying to save you from pain later.

Third, and most importantly, when migrating databases with DMS, take the time to migrate your schema separately. Don't rely on DMS to handle everything. It's a data migration service, not a complete database cloning tool.

Fourth, if you're planning to use CDC (Change Data Capture) for ongoing replication, make sure binary logging is enabled on your source database with the correct format. MySQL requires binlog_format set to ROW for DMS to capture changes properly. Without this, your CDC tasks will fail silently or miss updates entirely.

The whole experience was frustrating, time-consuming, and honestly a bit embarrassing. But it taught me more about AWS DMS in a few days than I would have learned in weeks of casual reading.

If you're planning a DMS migration, learn from my mistakes. Your future self will thank you.

Designing a Secure AWS Landing Zone with Control Tower (What Most Blogs Don’t Tell You)

Amaan Ul Haq Siddiqui — Sun, 04 Jan 2026 10:50:10 +0000

Intro

imagine ur the architect of this massive growing org right and ur job is basically to design a secure compliant scalable aws environment that wont fall apart when business needs change which they always do. u gotta ensure the right governance is there sensitive data is locked down and everything can scale up. sounds like a headache honestly but with aws control tower u actually have a shot at making this work without losing ur mind

lets go on a trip thru the cloud setting up a landing zone using control tower and seeing how organizational units aka OUs can be the backbone of ur security game

1. Building the foundation: the org management account

so every solid landing zone starts somewhere and that somewhere is the organization management account. think of this as the heart of ur aws world where policies access and the basic structure live. its where u define global security rules and where control tower actually does its magic

for me it all starts by making this org management account the master key to everything. first step i take is locking this thing down tight access control everything because if this account gets compromised its game over

2. The structure takes shape: OUs and governance

now that the management account is chillin i start carving out the structure. this is where control tower is actually super useful. using organizational units i make a hierarchy that actually makes sense for the business

i always gotta balance letting devs do their thing while keeping control. for big companies i usually set up separate OUs for security production and staging just to keep sanity

so i generally define some high level OUs

security ou – the gatekeeper making sure audits happen
production ou – where the money is made live services live here
staging ou – the playground where we break stuff before prod

3. Nested OUs: a growing ecosystem

as the org gets bigger the aws environment gets messy so to keep it clean i start nesting OUs. this is basically putting folders inside folders but for cloud accounts

now i gotta support multiple teams so i typically make groups like app-1 and app-2 each with their own prod and staging accounts. this ensures app-1 cant mess with app-2s stuff. nice and isolated

then the internal ops need love too so i make an internal operations OU with sub-OUs for finance hr and it departments so hr doesnt accidentally delete the finance database

4. Audit and compliance: the log archive

structure is done but u need visibility. audit logs compliance retention policies all that boring but super critical stuff needs to be set up so u dont fail an audit

i set up audit and log archive accounts immediately. these are like the black box of an airplane immutable records of everything that happens. logs go in they dont come out (unless u need to investigate something) and automated backups keep everything safe

5. The real magic: terraform and automation with AFT

ok so the structure is cool and all but clicking buttons in the console is for amateurs. we want speed and consistency. enter the aws control tower account factory for terraform or AFT for short. this is where things get super interesting because now we are treating our account vending machine as code

check this out u can use this module here
https://registry.terraform.io/modules/aws-ia/control_tower_account_factory/aws/latest

and the code lives here
https://github.com/aws-ia/terraform-aws-control_tower_account_factory

so instead of manually provisioning accounts i deploy AFT. now when a new team needs an account they just push a change to a terraform repo. AFT sees the change spins up the account using control tower and then—this is the best part—it automatically applies all the baseline customization. we are talking setting up security groups iam roles and connecting to the vpc automatically. no human error just pure automation pipelines running smooth

it basically lets u maintain a global account customization repo so every single account that gets birthed by control tower comes out pre-configured with my specific tooling and security baselines right out of the box. massive time saver

6. Conclusion: a secure scalable future

so i step back drink some coffee and look at the dashboard. i didnt just build some servers i built a foundation. the org is secure scalable and aligned with business goals. thanks to control tower and that sweet AFT automation the journey from a messy handful of accounts to a compliant enterprise grade environment was actually kinda smooth and easy access and control setup with IAM Identity Centre SSO & Directory Services which we can integrate with Azure EntraID as well :)

Why We Didn’t Move to EKS (Yet): Choosing ECS Over Kubernetes in Production

Amaan Ul Haq Siddiqui — Sun, 28 Dec 2025 09:47:19 +0000

In the cloud-native world, Kubernetes (EKS) is often treated as the default destination for container orchestration. It’s powerful, flexible, and industry-standard. But for many engineering teams, it’s also overkill.

We recently faced the classic "build vs. buy" decision for our infrastructure. The pressure to adopt EKS was there, but after evaluating our actual needs, we made a conscious choice to stick with Amazon ECS.

The result? We saved a handsome amount of money, months of engineering time, and avoided the operational tax that comes with managing Kubernetes clusters. Here is how we architected a robust, scalable production environment on ECS without the K8s complexity.

The "Kubernetes Tax" We Wanted to Avoid

Kubernetes is amazing, but it requires a significant investment in tooling and maintenance. To run EKS properly in production, you aren't just managing containers; you're managing a platform. You need:

GitOps tools: ArgoCD or FluxCD for deployments.
Observability: Fluentd or similar for log shipping.
Ingress Controllers: NGINX or ALB controllers.
Security: Constant patching of the control plane and worker nodes.

For our team, we wanted to focus 100% on shipping application code, not managing infrastructure plumbing.

Our Hybrid ECS Architecture

We designed a hybrid ECS strategy that leverages the best of both serverless and provisioned compute.

1. Fargate for Stateless Workloads

For our main application servers and Sidekiq background workers, we used ECS Fargate.

No Servers to Manage: We don't worry about OS patching or scaling instances.
Right-Sizing: We pay only for the vCPU and RAM the tasks actually use.
Scalability: Fargate handles the heavy lifting of launching thousands of containers if needed.

2. EC2 Launch Type for Cron Jobs

Interestingly, we didn't go 100% Fargate. For our scheduled Cron jobs, we stuck with the EC2 Launch Type.

Why? Cron jobs run frequently and often use the same base images.
The Cost Hack: By running these on EC2 instances, we can cache Docker layers locally on the host. This drastically reduces data transfer costs from ECR (Elastic Container Registry) and speeds up start times, something Fargate doesn't support as efficiently for frequent, short-lived tasks.

The Stack: Simple and Managed

We offloaded state management to AWS managed services to keep the compute layer purely ephemeral:

Database: Amazon RDS for PostgreSQL.
Caching: Amazon ElastiCache (Redis).

CI/CD: Skipping the Complexity

One of the biggest wins was avoiding the "GitOps" complexity of ArgoCD or Flux. Our pipeline is a straightforward GitHub Actions workflow:

Build: Create the Docker image.
Scan: Run security vulnerability scans.
Push: Upload to ECR.
Deploy: Update the ECS Task Definition and force a new deployment.

That’s it. No separate synchronization server, no complex CRDs (Custom Resource Definitions), and no managing Helm charts. The pipeline is robust, easy to debug, and requires zero maintenance.

The Verdict: Time is Money

By choosing ECS, we:

Skipped the Learning Curve: No need to train the team on kubectl, manifests, or cluster networking.
Reduced Operational Overhead: No node patching, no control plane upgrades.
Lowered Bill: We aren't paying for EKS control plane fees ($73/month per cluster) or the overhead of system pods running on worker nodes.

We might move to EKS one day if our requirements for custom networking or service mesh become complex enough to warrant it. But for now, ECS allows us to run a stable, high-performance production environment where the only thing we have to take care of is our application code.

Sometimes, the best engineering decision is the boring one.

Memory-Based Auto Scaling: Saving Our Sidekiq Jobs When CPU Metrics Lied to Us

Amaan Ul Haq Siddiqui — Fri, 26 Dec 2025 15:13:41 +0000

We usually just default to CPU-based scaling for our Auto Scaling Groups (ASGs). It’s the standard move. It’s easy, it’s familiar, and for web servers? It usually works fine.

But sometimes, CPU utilization lies.

We recently hit a wall where CPU scaling completely failed us. This is the story of how a critical background job kept crashing even though our dashboards said everything was "healthy," and how switching to memory-based metrics saved the day.

The Silent Failure

We run a Ruby on Rails app. It relies heavily on Sidekiq for background work. These workers run on EC2 instances in an Auto Scaling Group.

On paper, everything looked great.
CPU usage? A comfortable 20–30%.
Network? Normal.
Disk? Fine.
AWS said we were green.

But the app was on fire.

Critical jobs were timing out. The queues were piling up. Retries were spiking. Worst of all? Workers were just... vanishing. Processes were dying, but since CPU was low, the auto-scaler didn't care. It didn't launch new instances. It just let them die.

The Culprit: The OOM Killer

I dug into the logs, and the answer was right there. Memory.

Our Sidekiq jobs are hungry. As the Ruby processes chewed through heavy tasks, they ate up more and more RAM. The instances were running out of memory, and the Linux OOM (Out-of-Memory) Killer stepped in to save the server by killing our Sidekiq process.

The problem? EC2 doesn't send memory metrics to CloudWatch by default.

So, while our RAM was screaming for help, CloudWatch saw low CPU and thought, "Everything is chill."

Step 1: Making Memory Visible

You can't fix what you can't see.

First thing I did was install the CloudWatch Agent on our instances. I needed it to ship custom metrics—specifically mem_used_percent—to AWS.

As soon as we turned it on, the graphs confirmed it.
CPU was bored at 20%.
Memory? It was spiking over 85%.

Above: Finally seeing the truth. CPU was low, but RAM was maxed out.

Step 2: Changing the Rules

We stopped listening to CPU. I set up a Target Tracking scaling policy that looks strictly at memory.

I told the ASG: "Keep average memory at 40%."

Sounds low, right? But background workers are unpredictable. They need breathing room for sudden spikes. This setup does two things:

It adds new servers before we hit the danger zone (75%+).
It doesn't kill servers too fast, so we avoid "thrashing" (booting up and shutting down constantly).

Above: The new policy. If RAM goes up, we scale out.

The Result

Instant fix.

The scaling became predictive. Instead of waiting for a crash, the cluster sees the memory pressure building and adds more power before things break.

We haven't seen a single OOM kill since. The Sidekiq service is happy, the queues are empty, and I can finally sleep.

Above: Stable, boring, and running perfectly.

Why CPU Scaling Sucks for Workers

Here's the takeaway.

Web traffic is usually CPU-heavy. You get a request, you process it, you send a response. CPU spikes, you scale. Simple.

Background workers (like Sidekiq, Celery, Bull) are different. They load big files. They process heavy data objects. They eat RAM. Your CPU can be totally asleep while your memory is completely full.

Final Thoughts

If you're running background jobs in an ASG and you're only watching CPU... you might be one heavy job away from a silent outage.

Don't just use the default settings. Scale on what actually hurts your application.

Note: The screenshots above are from a test setup. I’ve hidden sensitive stuff like Account IDs.

My First AWS Community Day Dubai & that too as a Speaker :)

Amaan Ul Haq Siddiqui — Thu, 25 Dec 2025 08:21:38 +0000

My First AWS Community Day Dubai (And That Too as a Speaker!)

Speaking at an AWS event was always on my "someday" list.
But "someday" arrived a lot sooner than I expected.

This past Sunday, I attended AWS Community Day UAE 2025 in Dubai.
And I didn't just attend.
I gave my first-ever talk.

The Announcement: Is This Real?

When the speaker list dropped, seeing my name next to industry veterans felt surreal.

Session: The Winning Stack: Lessons from a DevOps Hackathon Winner
Who: Me (DevOps Engineer at SUDO Consultants)

Representing my team and sharing my own messy, real-world journey in front of the AWS community? Humbling. Terrifying. But mostly, exciting.

The Talk: No Fluff, Just Real DevOps

I didn't want to give a lecture on textbook theory. We have documentation for that.
I wanted to talk about what actually happens when things are on fire.

I focused on:

How winning a DevOps Hackathon changed my perspective.
Why time pressure forces you to build better architectures.
How to build a secure CI/CD pipeline that actually works in production.
The massive gap between "Best Practices" and "Real Reality."

My goal was simple:
Share what actually works.

A Sunday Well Spent

The event was on a Sunday, and the turnout was incredible.
The best part wasn't even the sessions—it was what happened in the hallways.

I saw students asking about cloud careers for the first time.
I saw senior engineers swapping war stories about production outages.
I saw a community that actually wants to help each other.

One attendee told me later that hearing my story—going from a student to a DevOps pro—helped them understand what skills actually matter.
That feedback alone? Worth all the prep time.

Check out this post from the event

Why You Should Go

Events like this aren't just about the tech.
They are about the reality check.

You get honest career stories you won't find in a whitepaper.
You get networking that feels human, not transactional.
For anyone starting in Cloud or DevOps, this is how you accelerate.

Big Thanks

I have to give a shoutout to the people who made this happen:

SUDO Consultants for trusting me and pushing me forward.
The AWS User Group UAE volunteers—you guys run a world-class show.
The audience for the questions and the energy.

Being part of a group that values sharing over competition is rare. I don't take it for granted.

To Students & First-Timers

If you're a student or a junior engineer wondering if you should go to these events... or if you're "ready" to speak at one:

Don't wait until you feel ready.

You will never feel 100% ready.
These events don't reward perfection. They reward showing up.
Every expert on that stage started exactly where you are right now.

What's Next?

This was my first AWS Community Day Dubai.
It definitely won't be my last.

Thank you to the UAE AWS community for the platform and the inspiration.
Here’s to more learning, more sharing, and better clouds.