<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Warren Parad</title>
    <description>The latest articles on DEV Community by Warren Parad (@wparad).</description>
    <link>https://dev.to/wparad</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F86409%2Fad0e5c54-e76f-4fd9-864e-f04b266ab62f.jpg</url>
      <title>DEV Community: Warren Parad</title>
      <link>https://dev.to/wparad</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/wparad"/>
    <language>en</language>
    <item>
      <title>Actually Fixing AWS S3</title>
      <dc:creator>Warren Parad</dc:creator>
      <pubDate>Sat, 14 Mar 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/aws-builders/actually-fixing-aws-s3-10g3</link>
      <guid>https://dev.to/aws-builders/actually-fixing-aws-s3-10g3</guid>
      <description>&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;For help understanding this article or how you can implement auth
and similar security architectures in your services, feel free to 
reach out to me via the community server.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://authress.io/community" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Join the community&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;AWS just released a supposed fix for S3 bucket squatting by utilizing what they are calling &lt;a href="https://aws.amazon.com/blogs/aws/introducing-account-regional-namespaces-for-amazon-s3-general-purpose-buckets/" rel="noopener noreferrer"&gt;Account Regional Namespaces&lt;/a&gt;. I don't understand the hype, and now I'm going to explain why.&lt;/p&gt;

&lt;h2&gt;
  
  
  Broken: S3 Bucket Names are Global​
&lt;/h2&gt;

&lt;p&gt;S3 bucket names are global. Not global to your account. Not global to your region. Global to the entire AWS partition — every account, every region, every customer who has ever existed on AWS.&lt;/p&gt;

&lt;p&gt;This was not a deliberate design philosophy. It was a default from 2006 that nobody corrected. S3 launched when AWS was essentially a startup with Amazon as its main customer. Global uniqueness was the path of least resistance. Nobody asked whether it would cause problems at scale, because at the time &lt;em&gt;"scale"&lt;/em&gt; meant a hundreds or thousand developers, not millions of accounts and decades of production workloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But, that default is still in place today.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fma27qo59v4189h74wfbv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fma27qo59v4189h74wfbv.png" alt='A dog sitting calmly in a room that is on fire, captioned "This is Fine"'&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;small&gt;AWS's relationship with the S3 naming model, circa every year since 2008.&lt;/small&gt;&lt;/p&gt;

&lt;p&gt;The sad truth is, nobody needs global bucket names. There is no use case that requires your bucket name to be universally unique across every AWS customer on the planet. The value of global uniqueness flows entirely in one direction: it must have simplified the original implementation. The cost of global uniqueness flows in the other direction: two decades of pain for every customer who has ever tried to name a bucket something sensible.&lt;/p&gt;

&lt;p&gt;The abomination lives on because someone probably said "Wouldn't be cool if you could expose your S3 bucket publicly?" And for that the bucket name would have to be in the URL, and therefore globally unique (and also require that the bucket name be lowercase and &lt;a href="https://datatracker.ietf.org/doc/html/rfc7553" rel="noopener noreferrer"&gt;RFC 7553 compliant&lt;/a&gt;). This is true but also irrelevant. S3 doesn't even support TLS for custom domains. So there is no way to actually serve an asset such as &lt;code&gt;https://assets.mycompany.com&lt;/code&gt; directly from your S3 Bucet. &lt;strong&gt;None, full stop.&lt;/strong&gt; Let's break that down, there are three parts to that URL — HTTPS, your domain, and something that maps to the S3 bucket. It has always been, and still is only &lt;strong&gt;PICK 2&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Anyone who needs a public URL with a real domain and HTTPS already is using CloudFront as a reverse proxy. As a matter of fact, every SPA out there, must be using CloudFront in order to achieve HTTPS or they must not be using a custom domain. The only suitible URL is the CloudFront distribution's alias, not the S3 bucket name. The bucket name is internal plumbing that nothing outside your AWS account should ever reference directly. I'm here to tell you that not only are global bucket names a mistake, there is actually an easy way to fix it. One has to wonder why AWS hasn't.&lt;/p&gt;

&lt;p&gt;The people who think they need global bucket names are the people using &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html" rel="noopener noreferrer"&gt;S3 Virtual Hosting&lt;/a&gt; — &lt;code&gt;mybucketname.s3.amazonaws.com&lt;/code&gt; — which does have TLS, but on AWS's domain, not theirs. And of course, there the sad case for supporting this pattern indefinitely because AWS is much nicer than some other cloud providers that constantly deprecate actually required features, such as DNS Zone hosting. Although in recent times that hasn't held up as much, and gives credence to AWS dropping the concept as it would have direct Security and Reliability wins. Not to mention straight out improvement by reducing complexity. There is no case for making it the architectural foundation of an object storage service used by billions of production workloads. And as we will see shortly, exposing that endpoint directly comes with its own expensive problem that CloudFront eliminates entirely.&lt;/p&gt;

&lt;p&gt;The reality is none of the following are tradeoffs you agreed to. They are the consequences of a default, set in 2006, that nobody changed. The cost has landed on you ever since. And boils down to basically one core concept.&lt;/p&gt;

&lt;h3&gt;
  
  
  Name squatting​
&lt;/h3&gt;

&lt;p&gt;The boring version: the bucket name you want — &lt;code&gt;mycompany-prod-logs&lt;/code&gt;, &lt;code&gt;myapp-assets&lt;/code&gt;, &lt;code&gt;opentofu-state&lt;/code&gt; — was registered years ago by someone who no longer works at the company that registered it. AWS has no mechanism for name reclamation. That name is gone until the current owner deletes the bucket, which may never happen. So what you might think, just choose a new name, like you would choose a new username, or website domain. This isn't a new problem after all.&lt;/p&gt;

&lt;p&gt;But the reality is: bucket names are predictable, and predictable names are claimable before you need them, and it turns out some bucket names you actually very much need.&lt;/p&gt;

&lt;p&gt;The researchers at Aqua Security demonstrated this at Black Hat USA 2024, calling it &lt;a href="https://www.aquasec.com/blog/bucket-monopoly-breaching-aws-accounts-through-shadow-resources/" rel="noopener noreferrer"&gt;Bucket Monopoly&lt;/a&gt;. AWS services, themselves, create S3 buckets automatically use naming patterns derived from your account ID. Account IDs are not secret — they appear in IAM role ARNs, error messages, S3 URLs, and CloudTrail logs. And while good hygiene means keeping your AWS account ID obscured, the bucket names themselves must be completely public. &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html" rel="noopener noreferrer"&gt;S3 Virtual Hosting&lt;/a&gt; resolves every bucket as a DNS subdomain (&lt;code&gt;mybucket.s3.amazonaws.com&lt;/code&gt;), &lt;a href="https://developer.mozilla.org/en-US/docs/Web/Security/Defenses/Certificate_Transparency" rel="noopener noreferrer"&gt;Certificate transparency&lt;/a&gt;, and &lt;a href="https://www.spamhaus.com/resource-center/what-is-passive-dns-a-beginners-guide/" rel="noopener noreferrer"&gt;passive DNS&lt;/a&gt; collectors observe and index those queries continuously. And while they might not have caught everything, any bucket that has ever received traffic via Virtual Hosting has a name that likely exists in a DNS database outside your control.&lt;/p&gt;

&lt;p&gt;Many naming patterns were vulnerable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Athena: &lt;code&gt;aws-athena-query-results-{account-id}-{region}&lt;/code&gt; — data query results&lt;/li&gt;
&lt;li&gt;Elastic Beanstalk: &lt;code&gt;elasticbeanstalk-{region}-{account-id}&lt;/code&gt; — application build artifacts&lt;/li&gt;
&lt;li&gt;AWS Config: &lt;code&gt;config-bucket-{account-id}&lt;/code&gt; — compliance and configuration records&lt;/li&gt;
&lt;li&gt;CloudFormation, Glue, EMR, SageMaker, ServiceCatalog, and CodeStar all also have had similar patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The complete impact ranged from data exfiltration to remote code execution to full-service takeover. AWS has patched many of these services after disclosure.&lt;/p&gt;

&lt;p&gt;The CDK case may be the worst case. AWS's own infrastructure-as-code tool hack wrapper (because actually the CDK isn't the IaC tool) bootstraps a staging bucket with a name that was never random:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cdk-hnb659fds-assets-{account-id}-{region}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The qualifier &lt;code&gt;hnb659fds&lt;/code&gt; is a &lt;a href="https://docs.aws.amazon.com/cdk/v2/guide/bootstrapping-customizing.html" rel="noopener noreferrer"&gt;hardcoded constant in CDK's bootstrap template&lt;/a&gt;. It has never changed. Anyone who knows your account ID knows your CDK staging bucket name. If that bucket does not exist — because you deleted it, or becouse you have not bootstrapped yet, or because someone cleaned up an old environment — an attacker can claim it. CDK will then use that bucket to store and retrieve CloudFormation templates. The attacker injects a malicious template. CDK deploys it using an IAM role with broad permissions. Full account takeover.&lt;/p&gt;

&lt;p&gt;Aqua Security found over 38,000 accounts susceptible. The vulnerability was present for years before being fixed in CDK &lt;a href="https://github.com/aws/aws-cdk/releases/tag/v2.149.0" rel="noopener noreferrer"&gt;v2.149.0 in July 2024&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To be clear, an attacker who learns your AWS Account ID, can register those bucket names before you deploy the service. AWS will see that the bucket exists, trust it, and the route your data into the attacker's bucket. This is happening even without your knowledge. Have you actually checked that every bucket AWS is secretly sending data to is self-owned by your account? Probably not, you probably don't even know which buckets AWS is using.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security Through Obscurity​
&lt;/h3&gt;

&lt;p&gt;I thought it would go without saying, but I'm sure someone will bring it up: "Keep your bucket name obscure" is not a defense, since you can figure out these buckets by just using AWS services. And worse, the bucket name shows up in website hosting CNAMEs, presigned urls, and other places. It is publicly available.&lt;/p&gt;

&lt;p&gt;And of course the inverse is also a problem. S3 bucket names carry implicit trust. When your infrastructure reads configuration from &lt;code&gt;my-config-bucket&lt;/code&gt;, it assumes the content is authoritative because the name is correct. The global namespace means that assumption is structurally unsound — the name and the owner are not bound to each other in any durable way. An attacker who controls a bucket your infrastructure reads from doesn't need to exfiltrate anything. They inject. Your service pulls the configuration, trusts it, and acts on it.&lt;/p&gt;

&lt;p&gt;This is not abstract. Consider the pattern of storing IAM permission mappings in S3 and distributing them via OU StackSets across an AWS organization. &lt;a href="https://authress.io/knowledge-base/articles/2026/03/03/securing-aws-accounts-access" rel="noopener noreferrer"&gt;Something I actually just wrote about doing&lt;/a&gt;. An attacker who controls that bucket — whether by squatting the name, claiming it after a deletion, or exploiting a misconfigured access policy — can inject a permissions map that adds their own identity as a trusted principal. The StackSet propagates the poisoned configuration to every account in the org. Their CICD pipeline assumes the role via OIDC federation. Full organization-wide access, delivered through the normal configuration path, with no credentials created and no anomalous API calls.&lt;/p&gt;

&lt;p&gt;This is the same pattern that made Clownstrike's &lt;a href="https://www.cisa.gov/news-events/alerts/2024/07/19/widespread-it-outage-due-crowdstrike-update" rel="noopener noreferrer"&gt;botched configuration update&lt;/a&gt; in 2024 so severe. A trusted delivery mechanism pushed configuration that every endpoint pulled and acted on without independent verification. The delivery channel was correct. The content was not. Millions of machines followed instructions from a source they had no reason to distrust.&lt;/p&gt;

&lt;p&gt;The difference is that Clownstrike's delivery infrastructure was their own, and the configuration was negligent, not malicious. Whereas the S3 version of this attack does not require compromising the infrastructure owner at all, it only requires claiming a bucket name.&lt;/p&gt;

&lt;p&gt;The global namespace is what makes this entire attack class possible. In a correctly scoped namespace, your bucket names are yours, and an attacker in a different account cannot claim them. AWS built a shared global pool and then built their own services on top of it using predictable names, inheriting the vulnerability they created.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security misconfiguration​
&lt;/h3&gt;

&lt;p&gt;The public access model exists because bucket names are global. Since any AWS account can reference your bucket by name, making a bucket readable without credentials makes it readable by everyone — which is occasionally intentional and routinely catastrophic.&lt;/p&gt;

&lt;p&gt;The deeper problem: S3's access control system has never cleanly separated "accessible by my AWS account" from "accessible by the public internet." That distinction is not a first-class concept in S3. It has to be constructed from a combination of overlapping controls, each added at a different point in S3's history, each with its own interaction rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bucket policies&lt;/strong&gt; — grant access to specific principals or to &lt;code&gt;*&lt;/code&gt; (everyone)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ACLs&lt;/strong&gt; — a separate, older system with its own grantees, including the confusingly named &lt;code&gt;AuthenticatedUsers&lt;/code&gt; property&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Block Public Access&lt;/strong&gt; — four separate boolean flags that apply restrictions over policies and ACLs, added only in 2018 as a retroactive guardrail&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Object Ownership&lt;/strong&gt; — controls whether ACLs are enforced at all, added later still&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IAM Policies&lt;/strong&gt; — scopes permissions to principals with IAM authority.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each layer was added to contain the blast radius of the previous one. None of them establish "private to my account" as the starting point. They establish "open to everything" as the starting point and ask you to correctly configure the restrictions. Miss one flag, misread one grantee, inherit one policy from a module you didn't write — and the bucket is likely public.&lt;/p&gt;

&lt;p&gt;I like this article from 6 years ago talking a &lt;a href="https://nodramadevops.com/2020/04/why-protecting-data-in-s3-is-hard-and-a-least-privilege-bucket-policy-to-help/" rel="noopener noreferrer"&gt;bit about that&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F708hs2r81ix0a2q0xp6i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F708hs2r81ix0a2q0xp6i.png" alt="AWS IAM access pattern"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;small&gt;IAM access summarized&lt;/small&gt;&lt;/p&gt;

&lt;p&gt;But then you realize, this is just how IAM works, it isn't how S3 works at all. Sure whether or not IAM grants access is part of the picture, but where's the rest of it? I was trying to find a document in the AWS Docs that does a good job of explaining. There isn't one. There are over &lt;strong&gt;One Hundred Pages&lt;/strong&gt; on access control in S3 alone. Don't believe me, &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-management.html" rel="noopener noreferrer"&gt;count them&lt;/a&gt;. To be fair we have more than one page on similar &lt;a href="https://authress.io/knowledge-base/docs/category/authorization" rel="noopener noreferrer"&gt;Authorization concepts&lt;/a&gt; in the Authress KB. However, arguably what we designed has to be significantly more complex, since it has to handle literally every possible authorization scenario.&lt;/p&gt;

&lt;p&gt;This is not a configuration problem. It is an architecture problem. It is a security problem. The controls are layered on top of a model that was never designed to be private.&lt;/p&gt;

&lt;p&gt;And while the likelihood of getting it wrong has gone down significantly, the trade-off has been increased burden on configuration and setup.&lt;/p&gt;




&lt;h2&gt;
  
  
  Historical Hacks​
&lt;/h2&gt;

&lt;p&gt;Each problem identified by the community attracted a from AWS patch. But no one said they were the right patch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Forced random suffixes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For buckets operated by AWS Services, you have no recourse, but for buckets you manage for your own platform, you have a small, but not very satisfying alternative. Because the global pool is full of names claimed by other accounts, you cannot have the names you want. &lt;code&gt;my-app-assets&lt;/code&gt; is taken. &lt;code&gt;opentofu-state&lt;/code&gt; is taken. &lt;code&gt;prod-logs&lt;/code&gt; is taken. The community's answer to the problem, years before AWS even started to take any approach, is to use the only reliable strategy available — append a random suffix and stop trying to name things sensibly: &lt;code&gt;my-app-assets-8f2a3c&lt;/code&gt;, &lt;code&gt;opentofu-state-a1b2c3&lt;/code&gt;, &lt;code&gt;prod-logs-9e4d71&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A list of your S3 buckets is now a list of opaque identifiers. Understanding which bucket belongs to which service requires either tagging discipline — which degrades over time — or reading OpenTofu state, which is stored in an S3 bucket with a random suffix. Not to mention this only gets around the creation problem, and doesn't remotely address the security angle.&lt;/p&gt;

&lt;p&gt;This is not a novel problem. Discord ran the same experiment with usernames. Their original system appended a four-digit discriminator to every display name: &lt;code&gt;warren#0088&lt;/code&gt;. Globally unique, unambiguous, machine-friendly. I don't remember anyone that could actually remember their discriminator. I can't imagine how many friend requests failed because users entered the wrong tag. With only 10,000 discriminators available per name, popular names of course ran out.&lt;/p&gt;

&lt;p&gt;Discord's fix was not to make the discriminator longer. They separated the unique identifier — the username, used for backend lookups — from the display name, which is human-readable and non-unique. The part that needed global uniqueness was the lookup mechanism. The part humans see and share does not need to be globally unique at all.&lt;/p&gt;

&lt;p&gt;S3 never made this distinction. The bucket name is simultaneously the unique global identifier, the human-readable label, and the public URL component. When all three concerns are collapsed into one string that must be globally unique across every AWS customer, you get &lt;code&gt;my-app-assets-8f2a3c&lt;/code&gt;. That is your discriminator.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Forced predictable suffixes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For us we've taken a slightly different approach. And that's because random suffixes cannot be dynamically used at read time, are not idempotent, and that means usually hard-coding this string in multiple places. Or worse, I've seen many implementations attempt to export the generated S3 name from the infrastructure process to somewhere else, effectively coupling disparate systems that had no business being coupled together.&lt;/p&gt;

&lt;p&gt;Our approach is to add the AWS Account ID, the Region, and an internal consistent identifier to ever bucket we create. Now everyone will understand what that means. For example, you can imagine you choose something like &lt;code&gt;-${accountId}-${region}-un1que1d&lt;/code&gt;. Is that clever? Not really, but it is far better than having every bucket have a random ID.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The &lt;code&gt;ExpectedBucketOwner&lt;/code&gt; property&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One hack AWS added was integrating a new parameter into the S3 bucket APIs, which could validate ownership on bucket related actions such as Creation, PutObject, and GetObject. Released in &lt;a href="https://aws.amazon.com/blogs/aws/amazon-s3-update-three-new-security-access-control-features/" rel="noopener noreferrer"&gt;Oct 2020&lt;/a&gt;, every S3 API call could now include the expected AWS account ID of the bucket owner. If the bucket exists but belongs to a different account, the call fails. You add this header to your SDK calls, your bucket policies, your presigned URL logic. The problem of AWS created buckets was so bad, that AWS needed an internal security fix for the problem. And this helped a little bit for us users as well. It isn't a real solution though, just something hacked on top.&lt;/p&gt;

&lt;p&gt;The problem with this hack though, is that it is security you have to opt into, and if you are using some library or reusable module, good luck assuming that made it in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CDK v2.149.0&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the July 2024 fix for the CDK boostrap, AWS merged a change that adds a condition to the CDK bootstrap role, preventing the attacker-controlled-bucket scenario. However, the fix still required teams to re-run &lt;code&gt;cdk bootstrap&lt;/code&gt;. Any environment bootstrapped with CDK v2.148.1 or earlier and not yet re-bootstrapped remains vulnerable. The hack qualifier still remains &lt;code&gt;hnb659fds&lt;/code&gt;, but you can change it, &lt;a href="https://github.com/aws/aws-cdk/blob/d16dc7e433c4986f3473b2992ba36bee9fb64f1e/packages/aws-cdk-lib/core/lib/stack-synthesizers/bootstrapless-synthesizer.ts#L10-L18" rel="noopener noreferrer"&gt;if you want to.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Block Public Access&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By 2018, the pattern was clear: teams were misconfiguring bucket policies and ACLs what seemed like on-purpose, as if they were on a mission to win an award. Objects were going public, breaches were making headlines, and the individual controls were too granular and too easy to get wrong. AWS's response was to add a meta-level override: Block Public Access — four boolean flags that sit above all bucket policies and ACLs and veto any access grant that would expose objects to the public internet. To be clear, these flags don't affect the bucket at all, the affect the ability for you to change those other insecure properties on the bucket.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;BlockPublicAcls&lt;/code&gt;, &lt;code&gt;IgnorePublicAcls&lt;/code&gt;, &lt;code&gt;BlockPublicPolicy&lt;/code&gt;, &lt;code&gt;RestrictPublicBuckets&lt;/code&gt;. Each flag a different angle on the same problem.&lt;/p&gt;

&lt;p&gt;It is a kill switch. It works, for the most part. It was necessary because the model it was bolted onto had no safe default — the access system started too easy to open and required teams to correctly configure the restrictions, which teams reliably failed to do at scale. Block Public Access does not change that model. It adds a blunt override and calls it a fix. AWS enabled it by default for new accounts in 2022.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Paying for unauthorized access&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Did you know until 2024, if someone attempted to access your AWS S3 bucket, even if it was never public, would still incur a charge for you? This massive oversight was fixed under the radar, and you can read more about it in the release &lt;a href="https://aws.amazon.com/about-aws/whats-new/2024/05/amazon-s3-no-charge-http-error-codes/" rel="noopener noreferrer"&gt;Amazon S3 will no longer charge for several HTTP error codes&lt;/a&gt;. How that ever got off the ground in the first place is honestly shocking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OU: Block Public Access&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Finally, only last year, did AWS release the ability for AWS Organizations to turn off the incredibly insecure configuration by utilizing one of the &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-control-block-public-access.html" rel="noopener noreferrer"&gt;S3 Org level policies&lt;/a&gt;. Now you can actually be sure you don't accidentally get it wrong, or I guess also find out if you did much sooner than you would have.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Floeakx0e0kn11po6izjq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Floeakx0e0kn11po6izjq.png" alt="A doctor telling a patient &amp;quot;Well, don't do that then&amp;quot; in response to &amp;quot;Doctor, it hurts when I do this&amp;quot;"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;small&gt;The entire history of S3 naming advice, summarized.&lt;/small&gt;&lt;/p&gt;

&lt;p&gt;Are these hacks? Yes, yes they are. That is because the default considerations for using S3 require more configuration then lesser used strategies. If you want your bucket to be public, you configure less than you do if you want it to stay private. If you want to make sure you are secure and writing to your own bucket, you need to add properties, rather than remove.&lt;/p&gt;




&lt;h2&gt;
  
  
  What AWS Just Shipped​
&lt;/h2&gt;

&lt;p&gt;The biggest challenges with all of these hacks is — that with each new one being introduced, it required every service, product, application, and library to directly integrate that change. That's because every API, architecture decision, and code path had to account for this change. These weren't just hacks AWS made to solve the problem, these were bad hacks that pushed the burden on customers.&lt;/p&gt;

&lt;p&gt;And so, AWS has watched the community embed account IDs, regions, and random identifiers into bucket names for years. That must have meant we loved it, because then they shipped that exact pattern as a first-class feature: &lt;a href="https://aws.amazon.com/blogs/aws/introducing-account-regional-namespaces-for-amazon-s3-general-purpose-buckets/" rel="noopener noreferrer"&gt;Account Regional Namespaces&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbssiathy3u6s0yz90gqy.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbssiathy3u6s0yz90gqy.jpg" alt="Success self pat on the back"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The feature applies works in that when you create a bucket named &lt;code&gt;myapp-logs&lt;/code&gt; and request it in your account-regional namespace: &lt;code&gt;myapp-logs-123456789012-us-east-1-an&lt;/code&gt;. The &lt;code&gt;-an&lt;/code&gt; suffix signals to the S3 service that this name is scoped to your account and region. Nobody else can register &lt;code&gt;anything-123456789012-us-east-1-an&lt;/code&gt; — the &lt;code&gt;123456789012-us-east-1&lt;/code&gt; segment is reserved for your account. How AWS managed to promise that buckets with an &lt;code&gt;-an&lt;/code&gt; suffix don't already exist, and none of those bucket where in a cross-account scenario, is beyond me. Maybe they didn't. The likelihood is very small, that someone already had a bucket with a suffix of &lt;code&gt;-{accountId}-{region}-an&lt;/code&gt;, but if they did, and they had a cross account scenario, then that is now broken. Or maybe it isn't, maybe that special bucket according to the new rules was created in the correct account, but in reality someone else owns it.&lt;/p&gt;

&lt;p&gt;And so, we can see the same problematic pattern with this one as all the other hacks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It is opt-in.&lt;/strong&gt; You must set a special header or use a special property on &lt;code&gt;CreateBucket&lt;/code&gt;. Existing buckets are not migrated. Existing tooling does not generate these names. Every piece of infrastructure code that creates S3 buckets needs to be updated to use the new naming convention. And that means every service, SDK, API, library, product, etc... that you are using must also make this change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It wastes 26+ characters to your bucket name.&lt;/strong&gt; S3 bucket names have a 63-character limit. You now have at most 37 characters to work with before you hit the wall. If you have a naming convention like &lt;code&gt;{environment}-{team}-{service}-{purpose}&lt;/code&gt;, you are already in trouble. Hopefully each team in your organization has their own AWS account, but I know some of us aren't that lucky. You might be asking yourself, why 63? Well this limitation also almost certainly exists because the bucket name has to be part of the url as a subdomain. And DNS parts max out at 63 according to &lt;a href="https://datatracker.ietf.org/doc/html/rfc1123#section-2" rel="noopener noreferrer"&gt;RFC 1123&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It does not address the actual architectural problem.&lt;/strong&gt; Your bucket is still globally addressable via &lt;code&gt;s3.amazonaws.com&lt;/code&gt;. The access model is unchanged. The public bucket problem is unchanged.&lt;/p&gt;

&lt;p&gt;And then there is the SDK story.&lt;/p&gt;

&lt;p&gt;Clever engineers will immediately ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If my bucket name no longer explicitly includes my account ID and region, I cannot just pass around the bucket name. How do I write portable infrastructure?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;My answer: You don't.&lt;/p&gt;

&lt;p&gt;The obvious AWS's answer: pass the account ID and region as a special token that the SDK resolves at runtime from the current execution environment. Instead of hardcoding &lt;code&gt;123456789012&lt;/code&gt;, you reference a variable that CloudFormation or the SDK resolves from the execution context.&lt;/p&gt;

&lt;p&gt;So it's a second hack layered on top of the first one. The question is philosophical but practical, and AWS' answer is technical. That's a weird take.&lt;/p&gt;

&lt;p&gt;You now have infrastructure code that creates bucket names by concatenating a prefix with a runtime-resolved account ID and region. Your IaC state needs to capture the resolved name, not the template. Your references to the bucket in other services need to either embed the same resolution logic or accept the full resolved name as an input. Your cross-account pipelines — CI/CD systems deploying into multiple accounts — need to be aware of this resolution mechanism.&lt;/p&gt;

&lt;p&gt;AWS did not fix the problem. They added an opt-in feature that partially addresses one symptom, then added tooling to work around the limitations of that feature. You'll notice in the same release post, they also include the changes they had to make to CloudFormation S3 Resource. The people celebrating are celebrating a band-aid on a fracture.&lt;/p&gt;




&lt;h2&gt;
  
  
  How S3 Is Actually Used​
&lt;/h2&gt;

&lt;p&gt;But the real goal of tis article is actually talk about a solution. And to do that we need to review the fundamental use cases of S3. In practice it exists for four distinct use cases. Which of course have almost nothing in common:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Private object storage&lt;/strong&gt; — build artifacts, backups, data lakes, Lambda packages, database snapshots, OpenTofu, Terraform, IaC state files, and SPA access by CloudFront. No direct external access. Internal AWS service-to-service or IAM-authenticated only. I'm go out on a limb and say this is 99% percent of the S3 usage by volume and by bucket count.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Event-driven processing&lt;/strong&gt; — S3 event notifications triggering Lambda functions. An object is created or deleted; an event fires; a Lambda processes it. (One caveat here is that you MUST Never do this because S3 event notifications are not durable, ensure that all S3 events are sent directly to SQS, and then to Lambda.) The bucket name and ARN arrive in the event payload:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Records"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"eventSource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aws:s3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"awsRegion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"eventTime"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2024-03-01T12:00:00.000Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"eventName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ObjectCreated:Put"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"userIdentity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"principalId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AWS:AROAEXAMPLEID:session"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"responseElements"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"x-amz-request-id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"EXAMPLE123456789"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"s3"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"s3SchemaVersion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"configurationId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"upload-processor-trigger"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"bucket"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-app-uploads"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"ownerIdentity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"principalId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AEXAMPLEOWNERID"&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"arn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:s3:::my-app-uploads"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"object"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"uploads/photo.jpg"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"eTag"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"d41d8cd98f00b204e9800998ecf8427e"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"sequencer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"0A1B2C3D4E5F678901"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice what is not in this payload: a public-facing URL. The &lt;code&gt;bucket.name&lt;/code&gt; and &lt;code&gt;bucket.arn&lt;/code&gt; reference the internal bucket name. S3 ARNs have never included an account ID or region — &lt;code&gt;arn:aws:s3:::my-app-uploads&lt;/code&gt;, not &lt;code&gt;arn:aws:s3:us-east-1:123456789012:my-app-uploads&lt;/code&gt;. The identifier in the event is already the private bucket identifier, not a public one. And it would be easy to add the region and account ID to this ARN and likely not break a single thing.&lt;/p&gt;

&lt;p&gt;And that's the tell. The event-driven use case has always operated on private identifiers. The Lambda function receiving this event doesn't care what the bucket is called publicly, or whether it has a public URL at all. It cares about the object key and the internal bucket reference — both of which are already account-scoped and private by nature. S3's internal event system was already operating on the right model. The global namespace was never part of this path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Presigned URLs&lt;/strong&gt; — assets that could be served over CloudFront because they are cacheable, but because you don't want them to be public, such as user owned data, you create a strategy to serve user data directly from S3. And same goes in reverse, you allow users to upload data, but rather than needing to deal with it in your service API, you directly have the client integrate with S3.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Direct public access&lt;/strong&gt; — open buckets, bucket website hosting, ACL-public objects, resolvable by a public DNS. This is the pattern that causes all the breaches, all the confusion, and almost all of the architectural complexity AWS has accumulated in S3 over the years.&lt;/p&gt;

&lt;p&gt;Category 4 is a tiny fraction of actual S3 usage by any metric you choose. It is responsible for a disproportionate fraction of the design surface area, the security incidents, and the policy complexity. And all the fixes so far make the usages of (1), (2), and (3) more challenging, while increasing the safety of (4). This is not how you solve architectural problems. You want to play a strategy where the most frequent uses are optimized for security, where the threat model identifies the biggest risk, to subvert that, not protect a screendoor or a fence in the middle of the desert.&lt;/p&gt;

&lt;p&gt;The data breaches you read about were almost always S3 misconfiguration involving category 4. A few illustrative examples from a single year — 2017 alone:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.upguard.com/breaches/verizon-cloud-leak" rel="noopener noreferrer"&gt;Verizon&lt;/a&gt;&lt;/strong&gt; — 14 million customer records including names, addresses, and account PINs, left in a publicly accessible bucket by a third-party vendor (NICE Systems). The bucket was open for weeks after Verizon was notified.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.upguard.com/breaches/cloud-leak-accenture" rel="noopener noreferrer"&gt;Accenture&lt;/a&gt;&lt;/strong&gt; — Four public buckets containing 137GB of internal data: credentials, decryption keys, the master AWS KMS access key for their cloud platform, and data from clients across the Fortune 500.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://mackeeper.com/blog/data-breach-reports-2017/" rel="noopener noreferrer"&gt;WWE&lt;/a&gt;&lt;/strong&gt; — 3 million fan records including home addresses, ages of children, ethnicity, and account details. Open to anyone with the URL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.engadget.com/2018-08-09-amazon-aws-error-exposes-31-000-godaddy-servers.html" rel="noopener noreferrer"&gt;GoDaddy&lt;/a&gt;&lt;/strong&gt; — Configuration data for 31,000 GoDaddy servers exposed in a public bucket. In a detail that should give everyone pause: the bucket was used and misconfigured by an AWS employee.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The fix in every case should have been "make S3 harder to misconfigure." But the advice and resolution we've seen was instead: "fix your IAM policies", "enable Block Public Access", "audit your bucket ACLs." Patches. Tooling. Guardrails, Security Hub findings around a footgun that should not exist in the first place.&lt;/p&gt;

&lt;p&gt;The reason category 4 exists at all is historical. In 2006, if you wanted to serve a file publicly from the internet, you needed a publicly accessible server. S3 was that server. CloudFront did not launch until 2008. IAM did not launch until 2011. The access model AWS ships with S3 today is the access model from an era when the alternatives did not exist yet. (I'm of course speculating here, because I didn't use AWS until 2008, and couldn't find a great source for this.)&lt;/p&gt;

&lt;p&gt;Yet, some of the hacks to fix this problem have happened much later than 2011, and realistically, none of them even required IAM to make this happen.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Root Cause​
&lt;/h2&gt;

&lt;p&gt;All of that complexity — ACLs, Object Ownership, Block Public Access, website hosting, and the hacks added attempt to fix secord-order mistakes. They were pilled ontop of the one thing nobody touched: &lt;strong&gt;the naming model&lt;/strong&gt;. And it's the real feature everyone wants:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Feature 1: The same logical bucket name across multiple AWS accounts.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Take OpenTofu (or any IaC for that matter) for instance. You need remote state storage. The canonical setup: one S3 bucket per account, typically named something like &lt;code&gt;{org}-opentofu-state&lt;/code&gt; or &lt;code&gt;{account-name}-tfstate&lt;/code&gt;. Simple, readable, deterministic.&lt;/p&gt;

&lt;p&gt;In practice, you have a &lt;code&gt;dev&lt;/code&gt; account, a &lt;code&gt;staging&lt;/code&gt; account, a &lt;code&gt;production&lt;/code&gt; account, a &lt;code&gt;security&lt;/code&gt; account, a &lt;code&gt;shared-services&lt;/code&gt; account. You want &lt;code&gt;123456798012-opentofu-state&lt;/code&gt; in all of them. Under the current global namespace, you cannot have that. You have to name them &lt;code&gt;123456798012-opentofu-state-dev&lt;/code&gt;, &lt;code&gt;123456798012-opentofu-state-prod&lt;/code&gt;, and so on — encoding the account into the name because the namespace doesn't do it for you.&lt;/p&gt;

&lt;p&gt;With the new account-regional namespaces, you can now have &lt;code&gt;opentofu-state&lt;/code&gt; scoped to each account. In theory. But in practice, all the changed was the interface for creating buckets, the usage of the buckets and their names are still the same as without this latest feature, and worse, without changing anything regarding how the service actually works, now everyone needs to make change. It is the worst of all fates:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;OpenTofu's and other IaC's S3 backend configuration needs to be updated to use the new naming scheme&lt;/li&gt;
&lt;li&gt;Any modules that reference this bucket by name need to be updated&lt;/li&gt;
&lt;li&gt;Any existing state files pointing to the old bucket names need to be migrated&lt;/li&gt;
&lt;li&gt;Your bootstrap process — the code that creates the state bucket before OpenTofu can run — needs to support the new &lt;code&gt;CreateBucket&lt;/code&gt; header&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;None of this is easily managed. And while you can opt out of things like (2) and (3), you all know that there is some "security theater" going on at large enterprises that will claim a migration here "increases security". I'm sure there the associated security hub finding that is going to come out soon with a Critical level. All of it is work that should not have been necessary if the architecture had been correct from the start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Feature 2: The same logical bucket name across multiple regions.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Multi-region active-active deployments are increasingly common. You want &lt;code&gt;my-app-assets&lt;/code&gt; in &lt;code&gt;us-east-1&lt;/code&gt; and &lt;code&gt;eu-west-1&lt;/code&gt;. Under the account-regional namespace, these would be &lt;code&gt;my-app-assets-123456789012-us-east-1-an&lt;/code&gt; and &lt;code&gt;my-app-assets-123456789012-eu-west-1-an&lt;/code&gt; — different names for logically identical resources. Your infrastructure code must now either parameterize the region or generate the full resolved name in every place that references the bucket.&lt;/p&gt;

&lt;p&gt;This is the same problem that existed before the fix. The namespace is account-regional — it scopes names to an account &lt;em&gt;and&lt;/em&gt; a region. That is correct for preventing name collisions, but it means your logical bucket name is still not portable across regions. The same bucket in a different region is a different name. Your replication configuration, your CDN origin setup, your cross-region failover logic — all of it must carry the full resolved name around. You can have the same DynamoDB Table Name used in every region, but not S3.&lt;/p&gt;

&lt;p&gt;The underlying issue is that S3 conflated four separate concerns:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Identity&lt;/strong&gt; — what is this bucket called?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Location&lt;/strong&gt; — which account owns it, and which region holds the data?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Addressability&lt;/strong&gt; — how do external clients find it?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accessibility&lt;/strong&gt; — Who should have access to it?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;AWS's new feature embeds all four into the name string itself: &lt;code&gt;myapp-123456789012-us-east-1-an&lt;/code&gt;. The account ID is in the name. The region is in the name. The identity is whatever is left over after you subtract those 26+ characters. The &lt;code&gt;an&lt;/code&gt; limits access. This is not a namespace — it is a naming convention that happens to be enforced by the S3 service on creation only. The four concerns are still coupled; they are just coupled inside the string rather than explicitly as configuration.&lt;/p&gt;




&lt;h2&gt;
  
  
  Intelligent Design​
&lt;/h2&gt;

&lt;p&gt;I want to be clear, AWS S3 is a fantastic service. It is so great in fact that there are no small number of huge businesses built around duplicating the S3 API. &lt;a href="https://aws.amazon.com/blogs/aws/twenty-years-of-amazon-s3-and-building-whats-next/" rel="noopener noreferrer"&gt;There are 20 years of successes&lt;/a&gt; after all. And I don't want to gloss over that:&lt;/p&gt;

&lt;h3&gt;
  
  
  What S3 gets right​
&lt;/h3&gt;

&lt;p&gt;Object storage is the correct primitive. An opaque key — a bucket name and an object path — maps to a sequence of bytes. Durable, versioned, regionally placed, with a consistent API surface across every SDK AWS ships. Lifecycle rules, replication, object tagging, multipart uploads, and locking (but only recently unfortunately). These are the right tools for managing data at scale, and they work.&lt;/p&gt;

&lt;p&gt;Additionally, Presigned URLs are the correct mechanism for temporary access delegation. Credential-scoped, time-limited, no IAM policy change required. The object stays private; the URL grants access for a window. That's also the right design.&lt;/p&gt;

&lt;p&gt;Do I need to mention the high durability of &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/DataDurability.html" rel="noopener noreferrer"&gt;99.999999999%&lt;/a&gt;, and the reliability of &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/DataDurability.html" rel="noopener noreferrer"&gt;99.99%&lt;/a&gt; as well?&lt;/p&gt;

&lt;p&gt;None of this needs to change. The problem isn't storage. It's two things piled on top of storage: the naming model and the access model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Secure by default​
&lt;/h3&gt;

&lt;p&gt;Every AWS primitive designed with security in mind starts from the same position: the unconfigured state is safe.&lt;/p&gt;

&lt;p&gt;IAM: default deny on everything. No permission exists until you create one explicitly. The account with no IAM policies grants access to nothing.&lt;/p&gt;

&lt;p&gt;VPC Security Groups: inbound traffic blocked by default. Every allow rule is explicit. The security group you just created, without touching it? It denies everything. (excluding the default VPC, which I'm not going to get into here)&lt;/p&gt;

&lt;p&gt;KMS customer-managed keys: a key with no resource policy grants decryption to nobody — except the account root, which is a recovery mechanism, not an access path. Grants are explicit.&lt;/p&gt;

&lt;p&gt;S3 is the exception.&lt;/p&gt;

&lt;p&gt;Secure by default doesn't mean &lt;em&gt;"safe unless you misconfigure it."&lt;/em&gt; It means safe by construction. The state you reach without doing anything must be the safe state. And for me that also excludes the presence of &lt;code&gt;pits of failure&lt;/code&gt;. If it is easy to do the wrong thing, then this a dangerous state. Public access for instance, must require deliberate, explicit, named work. Not the absence of a flag. Not the absence of a policy. Not a default you forgot to change.&lt;/p&gt;

&lt;p&gt;S3 had it backwards. And the fix isn't more flags. The fix is a model where a public bucket cannot exist — because public access isn't a property a bucket can have, it's a property of a feature called "promotion".&lt;/p&gt;

&lt;h3&gt;
  
  
  My Prospal: Private by Default, Public by Promotion​
&lt;/h3&gt;

&lt;p&gt;Here is the core insight that AWS released but no one wanted to commit to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bucket names are global (partly addressed by the new feature, but only for new buckets, only opt-in, only with a 26-character tax)&lt;/li&gt;
&lt;li&gt;Buckets are the unit of access control&lt;/li&gt;
&lt;li&gt;Public access is a property of the bucket&lt;/li&gt;
&lt;li&gt;Anyone with the bucket name and the right IAM permissions (or no permissions required, if it's public) can read objects&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The right model: &lt;strong&gt;A Private Bucket Service&lt;/strong&gt;. If you tilt your head sideways and squint, you might see that such a thing has been here all along, and I'm sure there is even an already existing AWS primative that encapsulates this concept internally.&lt;/p&gt;

&lt;p&gt;info&lt;/p&gt;

&lt;p&gt;By &lt;strong&gt;Private&lt;/strong&gt; , I mean that the bucket is private to your account, not private in the fact that it just isn't publicly accessible.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Allow the creation of &lt;strong&gt;S3 Private Buckets&lt;/strong&gt; the same way you would the current &lt;strong&gt;S3 Public Buckets&lt;/strong&gt;. Might as well rename the current API to be &lt;code&gt;Public Buckets Service&lt;/code&gt; instead, although I guess PBS was already taken, not to mention Public and Private both start with &lt;code&gt;P&lt;/code&gt; a bit of an oversight in the english language.&lt;/li&gt;
&lt;li&gt;Private Buckets only exist in that one region in that one account, and make use of the AWS ARNs correctly with aws account ID and region in the ARN.&lt;/li&gt;
&lt;li&gt;All interactions within the account will assume the private bucket, and never the public bucket. These are your API calls through SDKs, Event Source Mappings for SQS, Event notifications.&lt;/li&gt;
&lt;li&gt;Names follow the same strategy as they do today, (although since they aren't public, please let us have upper case characters)&lt;/li&gt;
&lt;li&gt;Objects are private. Not by default. Always. Without exception.&lt;/li&gt;
&lt;li&gt;Public access is not a property of the bucket. (Want to create a public bucket still? I'll get to that in moment.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I don't think this a novel concept. DynamoDB works exactly this way.&lt;/p&gt;

&lt;p&gt;And under this model, &lt;code&gt;my-app-assets&lt;/code&gt; in &lt;code&gt;us-east-1&lt;/code&gt; and &lt;code&gt;my-app-assets&lt;/code&gt; in &lt;code&gt;eu-west-1&lt;/code&gt; are two separate buckets each globally identifiable via the ARN, and accessible via the region based parameter in the SDK/CLI/API (which by the way is already necessary.) Your infrastructure code references the bucket name as it always has done.&lt;/p&gt;

&lt;p&gt;What's missing you might ask?&lt;/p&gt;

&lt;p&gt;No 26-character suffix. No runtime SDK token substitution. No encoding of internal topology into names that humans have to read and type. No weird public configuration, no ACLs, no URLs associated with the buckets, no pits of failure.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Cornerstone Example
&lt;/h3&gt;

&lt;p&gt;When you create a Bucket today &lt;code&gt;s3PublicClient.createPublicBucket()&lt;/code&gt;, let's call it &lt;code&gt;my-app-assets&lt;/code&gt;. It has a ridiculous number of limitations for creation, which I will get to later as well as the underlying assumption that you will make some part of it public. It comes with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bucket Policy&lt;/li&gt;
&lt;li&gt;CORS Policy&lt;/li&gt;
&lt;li&gt;DNS Name&lt;/li&gt;
&lt;li&gt;Bucket Website&lt;/li&gt;
&lt;li&gt;Global ARN&lt;/li&gt;
&lt;li&gt;Public Access Block configuration&lt;/li&gt;
&lt;li&gt;63 character lowercase name restriction&lt;/li&gt;
&lt;li&gt;I'm sure there are 20 more things here that also no one needed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That bucket is created with the ARN &lt;code&gt;arn:aws:s3:::my-app-assets&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This doesn't go away, you can still call that API, if you really wanted to. But the truth is that no one would call that API, because very few people need that API. Instead you would call the &lt;code&gt;s3PrivateClient.createPrivateBucket()&lt;/code&gt;, and you will get a bucket with an ARN &lt;code&gt;arn:aws:s3:REGION:AWS_ACCOUNT_ID:my-app-assets&lt;/code&gt;. That bucket operates with everything you would want in a private bucket:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Encryption&lt;/li&gt;
&lt;li&gt;Governance&lt;/li&gt;
&lt;li&gt;Presigned URL support&lt;/li&gt;
&lt;li&gt;Resource Policies&lt;/li&gt;
&lt;li&gt;etc...&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But it doesn't have any of those things for the public bucket. If you want those things above, you would need to call &lt;code&gt;s3PrivateClient.promoteBucket()&lt;/code&gt;. The parameters for that should be something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;s3PrivateClient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;promoteBucket&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;my-app-assets&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;publicBucketName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;my-app-assets-public&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Doing so at that moment would validate if that public bucket name exists. Everything continues to work the same from a public bucket standpoint, but we are also afforded all the benefits of the private bucket without any of the risks.&lt;/p&gt;

&lt;p&gt;This also prevents there being any backwards compatibility isuses as far as infrastructure management and creation goes, because the S3 Public API still exists, the only difference is now there is also the S3 Private API which can be used to create the local buckets, and when desired will be promoted to also be a public bucket. Additionally, you'll see later that migration on the AWS side is necessary to support this.&lt;/p&gt;

&lt;p&gt;If I were an S3 Architect, I might ensure that all public bucket names start with &lt;code&gt;public-&lt;/code&gt; or exist in the namespace &lt;code&gt;public/&lt;/code&gt; or &lt;code&gt;public:&lt;/code&gt;, so that someone could not accidentally write &lt;code&gt;arn:aws:s3:::my-app-assets&lt;/code&gt; and get a malicious attacker's promoted S3 private bucket.&lt;/p&gt;

&lt;p&gt;That is, if an attacker created &lt;code&gt;arn:aws:s3:us-east-1:666666666666:my-app-assets&lt;/code&gt;, and promoted it to be &lt;code&gt;arn:aws:s3:::my-app-assets&lt;/code&gt;. Then you could create &lt;code&gt;arn:aws:s3:us-east-1:000000000000:my-app-assets&lt;/code&gt; and accidentally reference it as &lt;code&gt;arn:aws:s3:::my-app-assets&lt;/code&gt;. In doing so, you would be again using that attackers bucket. Holistically, this is the same problem that has always existed up until this point, so this strategy isn't worse. It is just not perfect. But that's a mistake AWS might need to live it.&lt;/p&gt;

&lt;p&gt;It would be better if would have to explicitly add in the &lt;code&gt;public&lt;/code&gt; prefix and write &lt;code&gt;arn:aws:s3:::aws-public-buckets/my-app-assets&lt;/code&gt; for all public buckets. But that's a breaking change, so likely off the table. However as I mention below, there are great ways to protect against this that AWS can help with.&lt;/p&gt;

&lt;h3&gt;
  
  
  Public Buckets: How promotion works​
&lt;/h3&gt;

&lt;p&gt;A bucket, once created, is private. The bucket's access state never changes. What changes is what you attach to it.&lt;/p&gt;

&lt;p&gt;There are two core public scenarios that I'll call promotion paths that still must have solutions for:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Presigned URLs: Temporary Promotion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You issue a time-limited, credential-signed URL for a specific object. The URL encodes the object path, an expiration, and a signature derived from your IAM credentials. Anyone with that URL can read that object — for the duration you specified. When it expires, access ends. The bucket policy didn't change. The object's access model didn't change. The credential signed the request; the routing table resolved the bucket; S3 validated the signature and served the object.&lt;/p&gt;

&lt;p&gt;A presigned URL today looks like &lt;code&gt;https://mybucket.s3.amazonaws.com/file.png?X-Amz-Credential=AKID123%2F20240101%2Fus-east-1%2Fs3%2Faws4_request&amp;amp;X-Amz-Signature=...&lt;/code&gt;. The &lt;code&gt;X-Amz-Credential&lt;/code&gt; field already contains the account identifier — derived from the access key ID, which maps to an account. S3 extracts that account, consults the routing table for &lt;code&gt;mybucket&lt;/code&gt; in that account, and routes to the right physical bucket. The global uniqueness constraint was never doing the routing work here. The credential was.&lt;/p&gt;

&lt;p&gt;I want to say that again, presigned urls will still absolutely work out of the box without any changes.&lt;/p&gt;

&lt;p&gt;This is because Presigned URLs are not an S3 concept. They're an IAM concept that S3 validates. To explain, we need to dive into how AWS IAM actually works. AWS IAM uses their custom SigV4 signature strategy for every request to AWS. And every request to AWS goes over the wire on a AWS owned DNS url for the service with all the necessary parameters.&lt;/p&gt;

&lt;p&gt;For instance, your SDK computes a SigV4 signature using your IAM credentials — the access key ID and its corresponding secret. No AWS API call is made. The URL is computed entirely locally. This is how it works for &lt;strong&gt;every AWS service API&lt;/strong&gt;. When you call DynamoDB this happens, and the same thing happens when you call S3.&lt;/p&gt;

&lt;p&gt;Presigned S3 is a trick. After constructing the full HTTP payload to send to the service, instead of actually sending it, you give it to someone else. Then that person executes the payload. Normally it wouldn't matter who executes it, but what if some part of the payload was allowed to change between the generation of the HTTP payload and the exector executing, let's say for instance: &lt;strong&gt;the Binary Body&lt;/strong&gt;. In this way, you could generate a request that encodes the bucket, the object path, the expiration, and the signature, and hand it to some other user. They present it to S3 with a custom binary.&lt;/p&gt;

&lt;p&gt;When S3 receives the request, it extracts the access key ID from &lt;code&gt;X-Amz-Credential&lt;/code&gt;, looks up the corresponding IAM entity via STS, re-derives the expected signature, and checks that it matches. Then it checks the expiration. Then it checks that the IAM entity had &lt;code&gt;s3:GetObject&lt;/code&gt; permission at signing time. If all three pass, S3 serves the object (or persists it in the case of &lt;code&gt;s3:PutObject&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;That's all. S3 is just doing IAM validation, the same thing every other service is doing. It is not checking whether the bucket is public. It is not consulting the access model at all. A fully private bucket — no ACLs, no public access configuration, nothing — can serve objects via presigned URL because the authorization is credential-based, IAM-based, AWS-API based, it is not a unique access-model built into public S3-based buckets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Public Buckets: Permanent Promotion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Since, public access is not a PrivateBucket property, there has to be some way to still expose public access to the PrivateBucket data. And so the proposal would allow making a PrivateBucket public by requesting a bucket name from the global authoritative S3 Bucket Name list. The same process you already have today for S3 buckets, when you create a new one.&lt;/p&gt;

&lt;p&gt;In the new model, the public properties, the ACLs, the website configuration, aren't properties of the bucket. They're a separate resource: a public access configuration. Which today is what is called S3. So you might be able to see why I'm suggesting a name change. When you create one, attach it to your private bucket, and the S3 URL is created. You remove it, and the S3 URL stops existing. The bucket itself never changes state. The URL is a consequence of the configuration, not a property of the storage. And that URL, that's the thing that must be globally unique, and most importantly that doesn't even need to match the original bucket, and it won't.&lt;/p&gt;

&lt;p&gt;Website configuration lives there too. Index documents, error documents, redirect rules — these move from bucket settings into the public access configuration. The &lt;code&gt;s3-website&lt;/code&gt; endpoint exists because the configuration says it should, not because the bucket was created with a flag set.&lt;/p&gt;

&lt;p&gt;And because the user-defined string — the bucket name — is preserved through the Public Bucket configuration. What is no longer true is that this string must be globally unique for the private bucket. That constraint was never load-bearing. It was just there because of the expectation on public usage.&lt;/p&gt;

&lt;p&gt;The Custom HTTP domains using S3 website hosting — with CNAMEs pointing to &lt;code&gt;mybucket.s3-website-us-east-1.amazonaws.com&lt;/code&gt; or not — continue to work. The website configuration moves into the public access configuration resource; the &lt;code&gt;s3-website&lt;/code&gt; endpoint continues to exist as long as that configuration exists. No customer change is required.&lt;/p&gt;

&lt;p&gt;Because this functionality is separate, AWS can disable (and hopefully dismantle) in one huge swath all of the public features of S3 that are insecure by default, and lead new AWS accounts down the path of CloudFront for public access. If you need custom domains, TLS termination on your own domain, caching, WAF, HTTP/2, geographic restrictions, or edge functions — that's not an S3 question. That's a CDN question. And the answer is CloudFront as the reverse proxy with a private S3 bucket origin granted access via the Origin Access Control configuration.&lt;/p&gt;

&lt;p&gt;The bucket stays private. CloudFront has authorized access to it. Your users get a production-grade delivery layer with every security consideration you need. S3's job is to hold the bytes and serve them to one authenticated caller — the distribution. CloudFront's job is to serve those bytes to the world under your domain, your TLS certificate, your cache rules.&lt;/p&gt;

&lt;p&gt;This is already how every serious production setup works. The new model doesn't change that. It just makes it the only coherent option, instead of one option among several confusing ones.&lt;/p&gt;

&lt;h3&gt;
  
  
  A New CloudFront Opportunity​
&lt;/h3&gt;

&lt;p&gt;Presigned URLs have a structural limitation today that nobody talks about: the SigV4 signature is computed over the canonical request, which includes the Host header. And so the URL is signed against &lt;code&gt;mybucket.s3.amazonaws.com&lt;/code&gt;. Change the hostname and the signature fails. Which actually is a huge problem for CloudFront Functions when rerouting requests to a different origin (sometimes it works). This means custom domains for presigned URLs are impossible today. Every download link, every document export, every profile photo URL your product generates contains &lt;code&gt;s3.amazonaws.com&lt;/code&gt;. Your customers see your infrastructure provider in every URL. There is no way around it with the current model.&lt;/p&gt;

&lt;p&gt;The right fix is for CloudFront to gain first-class presigned URL support: the ability to validate SigV4 signatures on behalf of S3. If CloudFront can validate the signature, the URL can be generated against your CloudFront custom domain — with your ACM certificate, on your domain — and CloudFront handles the validation and the downstream request to S3. The signing mechanism doesn't change. The client code doesn't change. The SDK &lt;code&gt;GeneratePresignedURL&lt;/code&gt; call works identically, just against a different hostname. Ironically, CloudFront offers some partial functionality for Signed Request URLs and Signed Cookies, but these actually have a security hole because they don't include the same level of control that IAM policies provide. CloudFront + IAM would be a real game changer for Presigned URLs.&lt;/p&gt;




&lt;h2&gt;
  
  
  The S3 team's outstanding task​
&lt;/h2&gt;

&lt;p&gt;Now on to easy but annoying part. AWS cannot simply remove public bucket support creation path. It isn't the millions of buckets in production, but rather all the code paths that create buckets and then make assumptions about them. Some of those code paths were written by teams that no longer exist.&lt;/p&gt;

&lt;p&gt;Any migration strategy that requires customers to take action will fail for the long run. The path forward has to be one where the default behavior improves without requiring every customer to update their infrastructure. Something that the current history of hacks haven't gotten correct at all. (Although their folly resulted only in decreased security rather than broken configuration.)&lt;/p&gt;

&lt;p&gt;AWS can either trudge along with this currently broken S3 architecture riddled with pits of failures. Or they can admit they made a mistake and default all new accounts' buckets to not contain a public access strategy. This is actually the right thing to do, and they can do this safely as they have deprecated even whole AWS services before.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 1 — New accounts, new defaults​
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Public S3 Buckets completely disabled by default, no website hosting, no ACLs, no bucket policies. All of these are blocked from usage without a support ticket. We don't need the public configuration.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This doesn't break existing buckets. And new infrastructure gets the right defaults. The blast radius is almost zero. There are some AWS organizations out there that are dynamically creating S3 buckets in automatically provisioned new AWS accounts with assumptions based on how buckets work. When creating a new account and then a bucket in that account, they will see a problem. This just needs to be communicated.&lt;/p&gt;

&lt;p&gt;You might be thinking, couldn't there just be a magic flag on bucket creation that specifies that the bucket is account/region bound, call that flag: &lt;code&gt;private: true&lt;/code&gt;. The problem is removing the restriction to private buckets MUST BE OPT-OUT. &lt;code&gt;private: true&lt;/code&gt; makes the default the legacy insecure current state, and keeps &lt;code&gt;public access&lt;/code&gt; is opt-out. And therefore it still allows all the &lt;a href="https://www.lastweekinaws.com/podcast/aws-morning-brief/a-hole-in-the-s3-buckets/" rel="noopener noreferrer"&gt;bucket negligence awards&lt;/a&gt; that &lt;a href="https://www.linkedin.com/in/coquinn/" rel="noopener noreferrer"&gt;Corey&lt;/a&gt; is so keen on giving out. A flag is not sufficient, and instead there needs to be a mature approach to the migration. Which is why the recommendation here is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Rename S3 everywhere to "S3 Public Bucket Configuration"&lt;/li&gt;
&lt;li&gt;Reintroduce S3 as a Private Bucket concept&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Phase 2 — AWS internal service updates​
&lt;/h3&gt;

&lt;p&gt;AWS has some internal work to do. Luckily most of the mess that was caused is squarely cornered into the S3 Public Bucket Configuration and none of it actually affects our new private bucket creation or usage. That means, after the rename, AWS can go back through all of their services and retarget all interactions with S3 to use the new Private S3 SDKs/API. This is squarely in their control.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;S3 Bucket Events + Lambda Event Source Mapping&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One area where there is a bit of a cross over are events like S3 Events over SQS =&amp;gt; Lambda. But as discussed earlier that's actually a no-op. Similarly, Lambda Event Source Mapping (Lambda ESM), used for automatically polling SQS is a non-issue. But the reason why is worth understanding. An ESM configuration is account-scoped in the first place. When you set up a Lambda trigger, you're making an authenticated API call inside your account: "Lambda function X should fire on events bucket Y." The ESM record lives in your account. The bucket lives in your account. AWS resolves the bucket reference using the account context of that API call — not the public namespace.&lt;/p&gt;

&lt;p&gt;The current ESM ARN looks like &lt;code&gt;arn:aws:s3:::mybucket&lt;/code&gt; — no account ID, no region, because those were implicit in the global uniqueness guarantee. In the new model, &lt;code&gt;mybucket&lt;/code&gt; is a private identifier scoped to your account. The ARN format doesn't change. The resolution just shifts from "global name lookup" to "private identifier lookup within account context" — which AWS handles internally. No customer touches their ESM configuration. No ARN format changes. No trigger reconfiguration. Future ARN formats for the ESM should take the account ID and the bucket region, but AWS needs to maintain the global mapping table they already have that allows the account-less, region-less ESM bucket ARN to resolve the bucket in the specific region, in the correct specific account. In other words, ESM resource should accept either the global bucket naming strategy or the region-account local one.&lt;/p&gt;

&lt;p&gt;The message here "Update your Event Source Mappings for Buckets so that you have the account ID or region specificed". This might be the first ever &lt;code&gt;[Action Required]&lt;/code&gt; email, that actually has a required action. Or maybe they'll just update Security Hub to include a finding to fix this, and an AWS Config rule that validates it with an automatic remediation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CloudFront S3 origin compatibility&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;CloudFront is not part of S3's access model — it's a CDN that sits in front of a private S3 bucket, authorized via OAC. That already works today and obviously must continue to work in the new model. The only S3-specific change AWS needs to make is ensuring that CloudFront's S3 origin configuration resolves bucket references using the private identifier rather than the global name. Again, that is an internal AWS concern. No customer CloudFront configuration changes. I'm sure there is someone out there that is going to request cloudfront have access to S3 buckets in another account. AWS can easily support a similar solution to the ESM as above, CloudFront accepts either the global S3 ARN or the account-region localized one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 3 — Configuration Split​
&lt;/h3&gt;

&lt;p&gt;Every existing S3 bucket is already the private half of the new model. Customers haven't been creating "public buckets" — they've been creating private buckets and then attaching public configuration to them in the form of ACLs, Block Public Access exemptions, Bucket Policies, and website hosting settings. The private bucket has always existed. What hasn't existed is the explicit separation exposed to AWS Account users. That starts now.&lt;/p&gt;

&lt;p&gt;Since the buckets themselves and the public access configuration don't actually change here, the only thing AWS has to do is backpopulate a list of S3 private buckets whose names will be the exact same same as the current PublicBucket name. The goal being that all AWS S3 buckets should be referencable by their account-region localized arn, and the relevant console UI exists to display that. That's a script even Kiro could write in an afternoon.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Presigned URL configuration handling&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As argued above, the Presigned URL configuration already will work out of the box since the exact same problem has already been solved for literally every other resource in AWS. The one caveat here is that there will likely need to be a new method &lt;code&gt;GeneratePresignedBucketUrlForPrivateBucket&lt;/code&gt; to make sure it includes the account Id and the region explicitly so that the public bucket configuration isn't necessary to continue to use that option. That's because the current method doesn't take in the account ID or the region, but just the bucket name.&lt;/p&gt;

&lt;p&gt;The one exception is cross-account presigned URLs — an IAM identity in Account B generating URLs for a bucket that lives in Account A. I personally don't even know if this is possible, but technically I don't see why not. In this case, if we use the &lt;code&gt;X-Amz-Credential&lt;/code&gt; to determine the account, AWS would incorrectly assume the account is B (where the identity is) and not Account A (where the bucket actually lives). But AWS S3 have very competent architects, so I'll leave that challenge for them to solve (I can imagine using this same new GeneratePresigned menthod I just suggested above).&lt;/p&gt;

&lt;p&gt;It's also worth noting that potentially the presigned URL configuration could be an explicit resource you create when you need it similar to the public access. And by default just create it for all existing buckets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 3 — Deprecation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The best part of this design is that regarding deprecations there &lt;strong&gt;are none&lt;/strong&gt;! Since all we are actually doing is changing the same of some SDKs to improve readibily and really just the text in the UI. The only real change that is necessary here is going through all the docs and updating the content with more appropriate and clear naming.&lt;/p&gt;

&lt;p&gt;Most importantly, over time, the "public bucket" moniker will disappear entirely from the documentation as a concept, from customer usages, and most importantly from the news. And what replaces it? A private bucket with an explicit access configuration attached when needed. Two resources, two concerns, neither coupled to the other by default. The access model that caused two decades of breaches stops being something new engineers get to learn about.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Objections​
&lt;/h2&gt;

&lt;p&gt;Proposing a fundamental redesign of S3's control plane will attract objections. Here are the ones I felt like addressing:&lt;/p&gt;

&lt;h3&gt;
  
  
  What About SPA Websites?​
&lt;/h3&gt;

&lt;p&gt;The most common objection: "But I host my react/vue/solidjs app on S3 with website hosting enabled, and it works fine."&lt;/p&gt;

&lt;p&gt;It works, but it isn't correct architecture. Let's be precise about what is actually happening.&lt;/p&gt;

&lt;p&gt;Your S3 bucket is serving HTTP at &lt;code&gt;http://my-app.s3-website-us-east-1.amazonaws.com&lt;/code&gt;. Your domain is resolved by one two ways:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option A — CNAME directly to the S3 website endpoint.&lt;/strong&gt; — You have no TLS. S3 website hosting is HTTP only — it has no mechanism to serve HTTPS for a custom domain. Your users therefore must be on HTTP, so this is not a viable production setup. It actually doesn't work at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option B — CloudFront in front.&lt;/strong&gt; — CloudFront handles TLS (via ACM), your custom domain, HTTP→HTTPS redirects, the &lt;code&gt;404 → /index.html&lt;/code&gt; behavior for client-side routing, cache headers, compression, and geographic distribution. S3 is behind CloudFront, serving bytes when requested.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option C — The website domain is the S3 url&lt;/strong&gt; — You are freely passing out your S3 bucket URL to clients and asking them to remember that custom url. Something for sure is going to break some day, but nothing stopped you from doing it.&lt;/p&gt;

&lt;p&gt;Sites that only use S3 website hosting without CloudFront, serving plain HTTP is not a counterexample. It is a site that is broken and getting more broken by the day. &lt;a href="https://blog.chromium.org/2023/08/towards-https-by-default.html" rel="noopener noreferrer"&gt;Chrome announced in 2023 that it is moving towards HTTPS by default&lt;/a&gt;, automatically upgrading HTTP navigations to HTTPS. An S3 website serving HTTP gets upgraded to HTTPS by the browser, and since S3 cannot serve HTTPS on a custom domain, the request fails. Firefox has had an &lt;a href="https://support.mozilla.org/en-US/kb/https-only-prefs" rel="noopener noreferrer"&gt;HTTPS-Only Mode&lt;/a&gt; available since 2020 that blocks HTTP sites entirely. These are not future concerns. They are not esoteric. They are not nuanced. They are the current state of the web. A site that only works over HTTP is not a production website in 2026. It is a broken website that has not been maintained.&lt;/p&gt;

&lt;p&gt;Which means in every functional production scenario, S3 website hosting is doing nothing useful. CloudFront is handling everything. S3 is holding bytes.&lt;/p&gt;

&lt;p&gt;Therefore, Option B is every production SPA, S3 website hosting is contributing nothing. CloudFront is doing all the work that makes the setup viable. The bucket does not need to be public. Website hosting does not need to be enabled. The only reason engineers enable website hosting is that they are following a tutorial that predates CloudFront's ability to serve private S3 buckets, and nobody told them the tutorial was outdated. Or more likely, someone did, but they didn't listen.&lt;/p&gt;

&lt;p&gt;CloudFront likely has been able to serve our new Private S3 bucket concept since &lt;a href="https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/private-content-restricting-access-to-s3.html" rel="noopener noreferrer"&gt;Origin Access Control (OAC)&lt;/a&gt; replaced the older Origin Access Identity (OAI) approach. OAC supports server-side encrypted buckets, covers all S3 regions, and signs requests to private S3 using SigV4. Even before OAC, your bucket never needed to be public.&lt;/p&gt;

&lt;p&gt;There could be a concern that CloudFront doesn't know how to talk to anything other than a public S3 bucket or a public URL. But interestingly enough, &lt;a href="https://aws.amazon.com/blogs/aws/introducing-amazon-cloudfront-vpc-origins-enhanced-security-and-streamlined-operations-for-your-applications/" rel="noopener noreferrer"&gt;CloudFront now also supports private origins via ALB with VPC origins&lt;/a&gt;, which closes the last remaining scenario where direct public exposure might have been argued as necessary. You can run your origin entirely inside a VPC, with no public exposure, and serve it through CloudFront. The gap is gone.&lt;/p&gt;

&lt;p&gt;And the "CloudFront costs more" objection doesn't land either. CloudFront has a free tier: 1 TB of data transfer per month, 10 million HTTP requests, and 2 million CloudFront function invocations. A landing page or documentation site that fits in an S3 bucket almost certainly fits within that free tier, and even if it doesn't, at scale you are still getting the benefit of the cost reduction.&lt;/p&gt;

&lt;p&gt;A complexity argument would be more interesting. Setting up a CloudFront distribution requires more steps than enabling S3 website hosting. That is true. But the complexity exists either way, it is just hidden. And you still need TLS. You still need the &lt;code&gt;index.html&lt;/code&gt; routing behavior for client-side routing (or a more expensive CloudFront function). You still end up at CloudFront. The engineers who skip it are the ones serving HTTP from a subdomain with no TLS, which is screams for a denial-of-wallet attack.&lt;/p&gt;

&lt;p&gt;And for users who genuinely have not set up CloudFront, a la &lt;strong&gt;Option C&lt;/strong&gt; : the AWS S3 migration plan already answers this. The configuration split means existing public buckets keep their public access configuration intact, those sites keep working. The owner does nothing. When they are ready to do it correctly, the options are available.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bucket Origin Responses​
&lt;/h3&gt;

&lt;p&gt;There is one thing I left out, and I didn't want to bring this up because it's annoying, but I'm sure someone will call me out on it.&lt;/p&gt;

&lt;p&gt;When you set up S3 as an origin for your CloudFront, you might have the need to control the response headers. Historically, you were not able to configure anything in CloudFront, let alone do it dynamically. And so using S3 to set the CORS policies or other security policies was required. However now, CloudFront offers response headers, and while it isn't everything, even S3 isn't sufficient for specifying all the relevant headers. While I don't love it, for Authress, we have a CloudFront Function attached to every response. There is a performance hit and a cost hit to do this on literally every S3 related request. But argubly it is a small price to pay to have CloudFront do the thing that it should be doing all along, and not to save this configuration in S3 where it doesn't. Maybe AWS could be nice and still offer this configuration in S3, or be nice and add this as an option to CloudFront, or be nice and make CloudFront functions even cheaper, because why not, API Gateway velocity templates are free after all!&lt;/p&gt;

&lt;h3&gt;
  
  
  You're asking AWS to blow up a working control plane​
&lt;/h3&gt;

&lt;p&gt;Yes. That is what a migration looks like. The alternative is two more decades of incremental patches, each one adding more surface area and more documentation burden without touching the underlying design, and worst of all, still enables a massive pit of failure.&lt;/p&gt;

&lt;p&gt;The control plane does not need to be blown up for customers. The translation layer proposal in the previous section means existing workloads continue working. What needs to change is the model exposed to new infrastructure — the primitives developers learn, the defaults they encounter, and the architecture that tutorials recommend.&lt;/p&gt;

&lt;p&gt;AWS has done this before. The IAM role model replaced key-based authentication for most AWS-to-AWS access patterns. And AWS IIC replaces IAM roles for organizations and SSO. CloudFront Origin Access Control replaced Origin Access Identity. Neither replacement was instantaneous, and neither broke existing workloads. The old model continued working through a maintained compatibility layer while the new model became the default for anything new.&lt;/p&gt;

&lt;p&gt;The objection treats "existing behavior must never change" and "defaults must never improve" as the same thing. They are not.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Better Announcement​
&lt;/h2&gt;

&lt;p&gt;The Account Regional Namespaces announcement solves one real problem, the name collisions, using an opt-in mechanism with a 26-character tax on your bucket names, tooling that requires SDK and CloudFormation updates to remain portable. &lt;strong&gt;But it has zero impact on the access model that causes actual harm.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The right announcement would have looked like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The best feature ever Private Buckets: Account-regional namespaces are the default&lt;/strong&gt; — for all new bucket creation, no suffix, no opt-in, just the natural behavior that every engineer already wanted is now expected. Change nothing, get all the value.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The recommendation for public content: A managed CloudFront promotion layer&lt;/strong&gt; — as the only path to public content, surfaced as a first-class feature with its own console workflow, not a best practice buried in the CloudFront documentation. Because for some reason, AWS likes to improve their console, it still surprises me for how many ClickOps isn't just a migration strategy but a business critical one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backwards compatibility is still and always will work&lt;/strong&gt; — Legacy ACLs and direct public bucket access still exist — but as of today they are deprecated and require a support ticket to activate. The on-ramp is gone. The escape hatch remains, for now.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead, we got a feature that requires you to append &lt;code&gt;-123456789012-us-east-1-an&lt;/code&gt; to your bucket names, a second feature that lets your SDK dynamically resolve that suffix from the execution environment, and a wave of blog posts explaining how to wire these two features together. And of course we still have to wait for &lt;strong&gt;your-favorite-tool™&lt;/strong&gt; to implement this funcitonality.&lt;/p&gt;

&lt;p&gt;This is not a fix. It is a patch on top of a patch, with new documentation for how to apply both patches correctly. AWS has a long history of excellent engineering, but I don't concern this new functionality to be part of it.&lt;/p&gt;

&lt;p&gt;The gap between "what was shipped" and "what would fix the problem" is not subtle. It is not a matter of resources or engineering difficulty. Name collisions, the problem I can only imagine customers have been filing tickets about for years, was partially addressed. But the access model that still will cause actual harm was not.&lt;/p&gt;

&lt;p&gt;Until the access model changes, the endless stream of conflicting advice will remain out there on the internet.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;For help understanding this article or how you can implement auth
and similar security architectures in your services, feel free to 
reach out to be via the community server.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://authress.io/community" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Join the community&lt;/a&gt;
&lt;/p&gt;

</description>
      <category>aws</category>
      <category>security</category>
      <category>cloud</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Securing CI/CD Access to AWS</title>
      <dc:creator>Warren Parad</dc:creator>
      <pubDate>Tue, 03 Mar 2026 00:00:00 +0000</pubDate>
      <link>https://dev.to/aws-builders/securing-cicd-access-to-aws-1ib7</link>
      <guid>https://dev.to/aws-builders/securing-cicd-access-to-aws-1ib7</guid>
      <description>&lt;p&gt;I've seen a lot of complex tooling in my experience, but by far the worst case is designing just one more tool to do something. Especially in the age where software is free, we become burdened by &lt;em&gt;just one more tool&lt;/em&gt;. We know at Authress that &lt;a href="https://authress.io/knowledge-base/articles/2025/11/01/how-we-prevent-aws-downtime-impacts" rel="noopener noreferrer"&gt;increased complexity =&amp;gt; increased failure rate&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The solution is to utilize the tools we already have, just a little bit better. In this case — &lt;em&gt;"just a little bit better"&lt;/em&gt; — is adding a trivial amount to your existing AWS built-in technologies, and doing it in a way that you won't even need to add extra management overhead.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;For help understanding this article or how you can implement auth
 and similar security architectures in your services, feel free to 
reach out to us via the community server.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://authress.io/community" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Join the community&lt;/a&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  ❌ The Wrong Way​
&lt;/h2&gt;

&lt;p&gt;There are lots of ways this could have gone wrong. In fact, if you ask any of the &lt;em&gt;"Reasoning LLMs"&lt;/em&gt;, and are unlucky enough not be told &lt;strong&gt;IDK&lt;/strong&gt; , you will find out things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deploy a Lambda Function to every account is the right option - Don't do that.&lt;/li&gt;
&lt;li&gt;List all the accounts in a CFN template mapping - You will run out of template space, you are limited, especially if you have more than a couple of AWS Accounts or GitHub/GitLab accounts. Often requires a complex &lt;code&gt;Fn::Or&lt;/code&gt;, chunked chain to fit it in the template in the first place. Assuming you don't hit the 200 key mapping limit.&lt;/li&gt;
&lt;li&gt;Using a CloudFormation Parameter - You aren't going to know the AWS Account up front any way, I don't even know how this was going to work, assuming you don't have the 4096 character limit for parameter values.&lt;/li&gt;
&lt;li&gt;Creating a CloudFormation Macro - And for a moment a Macro sounds like a good answer, until you realize that OU Stack Sets aren't allowed to use Transforms which are required.&lt;/li&gt;
&lt;li&gt;Using a CFN Module - I'm actually surprised none of the LLMs came up with this solution, but the problem is that it will still deploy a lambda function into every account.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At least the lambda function in every account would work, but it isn't clean, you'll get a lambda in every account, and potentially also region, which comes with at least one IAM role, a CloudWatch Logs Group, and who knows what else.&lt;/p&gt;

&lt;p&gt;Someone out there is probably saying &lt;em&gt;"Why aren't you using OpenTofu for that"&lt;/em&gt;, I'll leave that as a challenge for the reader to answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Complete Design​
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxm596o93iymrwyprzf3b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxm596o93iymrwyprzf3b.png" alt="Securing Access to AWS via GitLab OU StackSet Architecture"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The design is quite straightforward.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Deploy a Lambda Function to the AWS Management Account which contains the list of permissions for each account.&lt;/li&gt;
&lt;li&gt;Deploy an OU StackSet which uses a Custom Resource to call the lambda function in the management account, to fetch the list.&lt;/li&gt;
&lt;li&gt;The list is persisted in a GitLab assumable IAM Role&lt;/li&gt;
&lt;li&gt;GitLab assumes the role at deployment&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  🔒 AWS Account Permissions Lambda Function​
&lt;/h2&gt;

&lt;p&gt;Let's do the easy part first. Of course we want to define the permissions somewhere. Since we are using GitLab, what we actually want to do is define for each AWS account, which GitLab projects (and their branches can be used to access that AWS account). At the top here, we'll define the permissions. And at the bottom, we'll receive the account ID from the caller and use that pull the correct permissions out of the map.&lt;/p&gt;

&lt;p&gt;Permissioning Lambda Function&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;accountPermissionsMap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="mi"&gt;000000000000&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;project_path:authress/automation/*:ref_type:*:ref:*&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="mi"&gt;111111111111&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;project_path:side-projects/*:ref_type:*:ref:*&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sendResponse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;Status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;Reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;PhysicalResourceId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;logStreamName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;StackId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;StackId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;RequestId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;RequestId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;LogicalResourceId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;LogicalResourceId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ResponseURL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PUT&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Content-Type&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Content-Length&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="nx"&gt;body&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="nx"&gt;exports&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;RequestType&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Delete&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sendResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;SUCCESS&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;accountId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ResourceProperties&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;AccountId&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;permissions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;accountPermissionsMap&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;accountId&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sendResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;SUCCESS&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;GitLabProjects&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;permissions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;,&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Event:&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Error:&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sendResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;FAILED&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt;
      &lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🟢 Deploying the Lambda Function​
&lt;/h2&gt;

&lt;p&gt;Management Account: CloudFormation Template&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// First load the lambda function from the lambda function const handlerCode = await fs.readFile(path.join(__dirname, './fetchPermissionsLambdaFunction.js'), 'utf8');&lt;/span&gt;

&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;AWSTemplateFormatVersion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;2010-09-09&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;Parameters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;OrganizationId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;String&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;The organization&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;Resources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;GlobalConfigLookupRole&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;AWS::IAM::Role&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;RoleName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;OU-StackSet-GlobalConfigLookup&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;AssumeRolePolicyDocument&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="na"&gt;Version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;2012-10-17&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="na"&gt;Statement&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
              &lt;span class="na"&gt;Effect&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Allow&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="na"&gt;Principal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="na"&gt;Service&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;lambda.amazonaws.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
              &lt;span class="na"&gt;Action&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;sts:AssumeRole&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
          &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="na"&gt;ManagedPolicyArns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
          &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;arn:aws:iam::aws:policy/service-role/
           AWSLambdaBasicExecutionRole&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;GlobalConfigLookupLogGroup&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;AWS::Logs::LogGroup&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;LogGroupName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/aws/lambda/OU-StackSet-GlobalConfigLookup&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;RetentionInDays&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;

    &lt;span class="na"&gt;GlobalConfigLookupFunction&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;AWS::Lambda::Function&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;FunctionName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;OU-StackSet-GlobalConfigLookup&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;Runtime&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;nodejs24.x&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;Handler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;index.handler&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;Role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Fn::GetAtt&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
          &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;GlobalConfigLookupRole&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Arn&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="na"&gt;MemorySize&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1769&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;Timeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;Code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="na"&gt;ZipFile&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;handlerCode&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="na"&gt;LoggingConfig&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="na"&gt;LogFormat&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Text&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="na"&gt;LogGroup&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;GlobalConfigLookupLogGroup&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;GlobalConfigLambdaPermission&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;AWS::Lambda::Permission&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;FunctionName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;GlobalConfigLookupFunction&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="na"&gt;Action&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;lambda:InvokeFunction&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;Principal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;*&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;PrincipalOrgID&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;OrganizationId&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;Outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;GlobalConfigLookupFunction&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;Value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Fn::GetAtt&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;GlobalConfigLookupFunction&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Arn&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="na"&gt;Export&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;GlobalConfigLookupLambdaArn&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  ▶️ Utilize the Lambda Function​
&lt;/h2&gt;

&lt;p&gt;Then we update the member stack to utilize this lambda function, and create the correct IAM Role.&lt;/p&gt;

&lt;p&gt;OU StackSet Member Account: CloudFormation Template&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Pull the values in the Lambda Function&lt;/span&gt;
  &lt;span class="nl"&gt;GlobalConfiguration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Custom::GlobalConfiguration&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;ServiceToken&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;globalConfigurationLambdaArn&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="na"&gt;AccountId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;Ref&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;AWS::AccountId&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;

  &lt;span class="c1"&gt;// The IAM Role for GitHub to utilize&lt;/span&gt;
  &lt;span class="nx"&gt;GitLabRunnerRole&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nl"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;AWS::IAM::Role&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;Properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nl"&gt;RoleName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Fn::Sub&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;GitLabRunnerRole&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="nx"&gt;MaxSessionDuration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="nx"&gt;AssumeRolePolicyDocument&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nl"&gt;Version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;2012-10-17&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nx"&gt;Statement&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
          &lt;span class="na"&gt;Effect&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Allow&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="na"&gt;Principal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="na"&gt;Federated&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Fn::Sub&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;arn:aws:iam::${AWS::AccountId}:oidc-provider/gitlab.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
          &lt;span class="p"&gt;},&lt;/span&gt;
          &lt;span class="na"&gt;Action&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;sts:AssumeRoleWithWebIdentity&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="na"&gt;Condition&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="na"&gt;StringEquals&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gitlab.com:aud&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://gitlab.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="na"&gt;StringLike&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
              &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gitlab.com:sub&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Fn::Split&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;,&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Fn::GetAtt&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;GlobalConfiguration&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;GitLabProjects&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}]&lt;/span&gt;
              &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
          &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}]&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;

  &lt;span class="c1"&gt;// Then register the GitLab OIDC Provider to&lt;/span&gt;
  &lt;span class="c1"&gt;//   allow GitLab to actually assume the role&lt;/span&gt;
  &lt;span class="nx"&gt;GitLabOIDCProvider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nl"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;AWS::IAM::OIDCProvider&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;Properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nl"&gt;ClientIdList&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://gitlab.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
      &lt;span class="nx"&gt;Url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://gitlab.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🏁 Run the Deployment​
&lt;/h2&gt;

&lt;p&gt;One hidden piece of information that might not be so obvious is how we are going to actually deploy that Member Account CloudFormation Template to all the AWS accounts we have in our AWS Organization. For that, we use an AWS Organization OU Stack Set. The stack set automatically deploys the template for every AWS account in the OU, for every region.&lt;/p&gt;

&lt;p&gt;Deploy OU StackSet&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;OrganizationsClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;DescribeOrganizationCommand&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@aws-sdk/client-organizations&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;AwsArchitect&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;aws-architect&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OrganizationsClient&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;us-east-1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Organization&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;DescribeOrganizationCommand&lt;/span&gt;&lt;span class="p"&gt;({}));&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;parameters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;organizationId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Organization&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Id&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;awsArchitect&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AwsArchitect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;packageMetadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{});&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;deploymentResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt;
  &lt;span class="nx"&gt;awsArchitect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;deployTemplate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;globalConfigurationTemplate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;stackConfiguration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;GlobalConfigurationLambdaArn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
  &lt;span class="nx"&gt;deploymentResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;o&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
    &lt;span class="nx"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ExportName&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;GlobalConfigLookupLambdaArn&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;OutputValue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;memberParameters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  &lt;span class="nx"&gt;GlobalConfigurationLambdaArn&lt;/span&gt;  &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;awsArchitect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;configureStackSetForAwsOrganization&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;memberAccountTemplate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;orgStackConfiguration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;memberParameters&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the best part of this is that the lambda function is extensible, so you can include a full configuration in S3 or anything else that you might want to persist in the management account's git repository.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;For help understanding this article or how you can implement auth
 and similar security architectures in your services, feel free to 
reach out to us via the community server.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://authress.io/community" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Join the community&lt;/a&gt;
&lt;/p&gt;

</description>
      <category>aws</category>
      <category>gitlab</category>
      <category>github</category>
      <category>cicd</category>
    </item>
    <item>
      <title>[Boost]</title>
      <dc:creator>Warren Parad</dc:creator>
      <pubDate>Fri, 07 Nov 2025 20:41:43 +0000</pubDate>
      <link>https://dev.to/wparad/-5ach</link>
      <guid>https://dev.to/wparad/-5ach</guid>
      <description>&lt;p&gt;

&lt;/p&gt;
&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/aws-builders/how-when-aws-was-down-we-were-not-4nel" class="crayons-story__hidden-navigation-link"&gt;How when AWS was down, we were not&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;
          &lt;a class="crayons-logo crayons-logo--l" href="/aws-builders"&gt;
            &lt;img alt="AWS Community Builders  logo" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F2794%2F88da75b6-aadd-4ea1-8083-ae2dfca8be94.png" class="crayons-logo__image"&gt;
          &lt;/a&gt;

          &lt;a href="/wparad" class="crayons-avatar  crayons-avatar--s absolute -right-2 -bottom-2 border-solid border-2 border-base-inverted  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F86409%2Fad0e5c54-e76f-4fd9-864e-f04b266ab62f.jpg" alt="wparad profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/wparad" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Warren Parad
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Warren Parad
                
              
              &lt;div id="story-author-preview-content-3001348" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/wparad" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F86409%2Fad0e5c54-e76f-4fd9-864e-f04b266ab62f.jpg" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Warren Parad&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

            &lt;span&gt;
              &lt;span class="crayons-story__tertiary fw-normal"&gt; for &lt;/span&gt;&lt;a href="/aws-builders" class="crayons-story__secondary fw-medium"&gt;AWS Community Builders &lt;/a&gt;
            &lt;/span&gt;
          &lt;/div&gt;
          &lt;a href="https://dev.to/aws-builders/how-when-aws-was-down-we-were-not-4nel" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Nov 7 '25&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/aws-builders/how-when-aws-was-down-we-were-not-4nel" id="article-link-3001348"&gt;
          How when AWS was down, we were not
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/aws"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;aws&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/reliability"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;reliability&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/architecture"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;architecture&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/serverless"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;serverless&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/aws-builders/how-when-aws-was-down-we-were-not-4nel" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/raised-hands-74b2099fd66a39f2d7eed9305ee0f4553df0eb7b4f11b01b6b1b499973048fe5.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/exploding-head-daceb38d627e6ae9b730f36a1e390fca556a4289d5a41abb2c35068ad3e2c4b5.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;18&lt;span class="hidden s:inline"&gt; reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/aws-builders/how-when-aws-was-down-we-were-not-4nel#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              2&lt;span class="hidden s:inline"&gt; comments&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            37 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;




</description>
      <category>aws</category>
      <category>reliability</category>
      <category>architecture</category>
      <category>serverless</category>
    </item>
    <item>
      <title>How when AWS was down, we were not</title>
      <dc:creator>Warren Parad</dc:creator>
      <pubDate>Fri, 07 Nov 2025 00:00:00 +0000</pubDate>
      <link>https://dev.to/aws-builders/how-when-aws-was-down-we-were-not-4nel</link>
      <guid>https://dev.to/aws-builders/how-when-aws-was-down-we-were-not-4nel</guid>
      <description>&lt;h2&gt;
  
  
  🚨 AWS us-east-1 is down! ​
&lt;/h2&gt;

&lt;p&gt;One of the most massive AWS incidents transpired on &lt;a href="https://aws.amazon.com/message/101925/" rel="noopener noreferrer"&gt;October 20th&lt;/a&gt;. The long story short is that the DNS for DynamoDB was impacted for &lt;code&gt;us-east-1&lt;/code&gt;, which created a health event for the entire region. It's the worst incident we've seen in a decade. &lt;a href="https://aws.amazon.com/message/101925/" rel="noopener noreferrer"&gt;Disney+&lt;/a&gt;, &lt;a href="https://aws.amazon.com/message/101925/" rel="noopener noreferrer"&gt;Lyft&lt;/a&gt;, &lt;a href="https://aws.amazon.com/message/101925/" rel="noopener noreferrer"&gt;McDonald'ss&lt;/a&gt;, &lt;a href="https://aws.amazon.com/message/101925/" rel="noopener noreferrer"&gt;New York Times&lt;/a&gt;, &lt;a href="https://aws.amazon.com/message/101925/" rel="noopener noreferrer"&gt;Reddit&lt;/a&gt;, and the &lt;a href="https://www.cnbc.com/2025/10/20/amazon-web-services-outage-takes-down-major-websites.html" rel="noopener noreferrer"&gt;list goes on&lt;/a&gt; were lining up to claim their share too of the spotlight. And we've been watching because our product is part of our customers critical infrastructure. This one graph of the event says it all:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvggvlkoss7qldqlcj5is.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvggvlkoss7qldqlcj5is.png" alt="Route 53 Health Check result where us-east-1 is down"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The AWS &lt;a href="https://aws.amazon.com/message/101925/" rel="noopener noreferrer"&gt;post-incident report&lt;/a&gt; indicates that at 7:48 PM UTC DynamoDB had &lt;em&gt;"increased error rates"&lt;/em&gt;. But this article isn't about AWS, and instead I want to share &lt;strong&gt;how exactly we were still up when when AWS was down.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now you might be thinking: &lt;strong&gt;&lt;em&gt;why are you running infra in us-east-1?&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And it's true, almost no one should be using us-east-1, unless, well, of course, you are us. And that's because we end up running our infrastructure where our customers are. In theory, practice and theory are the same, but in practice they differ. And if our (or your) customers chose &lt;code&gt;us-east-1&lt;/code&gt; in AWS, then realistically, that means you are also choosing us-east-1 😅.&lt;/p&gt;

&lt;p&gt;During this time, us-east-1 was offline, and while we only run a limited amount of infrastructure in the region, we have to run it there because we have customers who want it there. And even without a direct dependency on &lt;code&gt;us-east--1&lt;/code&gt;, there are critical services in AWS — CloudFront, Certificate Manager, Lambda@Edge, and IAM — that all have their control planes in that region. Attempting to create distributions or roles at that time were also met with significant issues.&lt;/p&gt;

&lt;p&gt;Since there are plenty of articles in the wild talking about &lt;a href="https://newsletter.pragmaticengineer.com/p/what-caused-the-large-aws-outage" rel="noopener noreferrer"&gt;what actually happened&lt;/a&gt;, &lt;a href="https://www.crn.com/news/cloud/2025/aws-15-hour-outage-5-big-ai-dns-ec2-and-data-center-keys-to-know" rel="noopener noreferrer"&gt;why it happened&lt;/a&gt;, and &lt;a href="https://www.theregister.com/2025/10/20/aws_outage_amazon_brain_drain_corey_quinn/" rel="noopener noreferrer"&gt;why it will continue to happen&lt;/a&gt;, I don't need to go into it here. Instead, I'm going to share a dive about exactly what we've built to avoid these exact issues, and what you can do for your applications and platforms as well. In this article, I'll review how we maintain a high SLI to match our SLA &lt;strong&gt;reliability&lt;/strong&gt; commitment even when the infrastructure and services we use don't.&lt;/p&gt;

&lt;h2&gt;
  
  
  📖 What is reliability?​
&lt;/h2&gt;

&lt;p&gt;Before I get to the part where I share how we built one of the most reliable &lt;a href="https://authress.io/knowledge-base/articles/auth-situation-report" rel="noopener noreferrer"&gt;auth solutions&lt;/a&gt; available. I want to define reliability. And for us, that's an SLA of five nines. I think that's so extraordinary that the question I want you to keep in mind through this article is: &lt;strong&gt;is that actually possible?&lt;/strong&gt; Is it really achievable to have a service with a five nines SLA? When I say five nines, I mean that 99.999% of the time, our service is up and running as expected by our customers. And to put this into perspective, the red, in the sea of blue, represents just how much time we can be down.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fycqnaqlwj191gojou7co.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fycqnaqlwj191gojou7co.png" alt="What does 5 nines look like"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And if you can't see it, it's hiding inside this black dot. It amounts to just five minutes and 15 seconds per year. This pretty much means we have to be up all the time, providing responses and functionality exactly as our customers expect.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flakzl3rd6nagppzjqfnd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flakzl3rd6nagppzjqfnd.png" alt="5 nines on the timescale of a year"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  🤔 But why?​
&lt;/h2&gt;

&lt;p&gt;To put it into perspective, it's important to share for a moment, the specific challenges that we face, why we built what we built, and of course why that's relevant. To do that, I need to include some details about what we're building — what &lt;a href="https://authress.io" rel="noopener noreferrer"&gt;Authress actually does&lt;/a&gt;. Authress provides login and access control for the software applications that you write — It generates JWTs for your applications. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User authentication and authorization&lt;/li&gt;
&lt;li&gt;User identities&lt;/li&gt;
&lt;li&gt;Granular role and resource-based authorization (ReBAC, ABAC, TBAC, RBAC, etc...)&lt;/li&gt;
&lt;li&gt;API keys for your technical customers to interact with your own APIs&lt;/li&gt;
&lt;li&gt;Machine to machine authentication, or services — if you have a microservice architecture.&lt;/li&gt;
&lt;li&gt;Audit trails to track the permission changes within your services or expose this to your customers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And there are of course many more components, that help complete full auth-platform, but they aren't totally relevant to this article, so I'm going to skip over them.&lt;/p&gt;

&lt;p&gt;With that, you may already start to be able to see why uptime is so critical for us. &lt;strong&gt;We're on the critical path for our customers&lt;/strong&gt;. It's not inherently true for every single platform, but it is for us. So if our solution is down, then our customer applications are down as well.&lt;/p&gt;

&lt;p&gt;If we put the reliability part in the back corner for one second and just think about the features, we can theorize about a potential initial architecture. That is, an architecture that just focuses on the features, how might you build this out as simple as possible? I want to do this, so I can help explain all the issues that we would face with the simple solution.&lt;/p&gt;

&lt;p&gt;Maybe you've got a single region, and in that region you have some sort of HTTP router that handles requests and they forward to some compute, serverless, container, or virtual machine, or, and I'm very sorry for the scenario — if you have to use bare metal. Lastly, you're interacting with some database, NoSQL, SQL, or something else, file storage, and maybe there's some async components.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft1bl7923pf44h0g8xpg8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft1bl7923pf44h0g8xpg8.png" alt="The simplest auth architecture"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you take a look at this, it's probably obvious to you (and everyone else) that there is no way it is going to meet our reliability needs. But we have to ask, just exactly how often will there actually be a problem with this architecture? Just building out complexity doesn't directly increase reliability, we need to focus on why this architecture would fail. For us, we use AWS, so I look to the Amazon CTO for guidance, and he's famously quoted as saying, _ &lt;strong&gt;Everything fails all the time&lt;/strong&gt; _.&lt;/p&gt;

&lt;p&gt;And AWS's own services are no exception to this. Over the last decade, we've seen numerous incidents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2014 - Ireland (Partial) - Hardware - Transformer failed - EC2, EBS, and RDS&lt;/li&gt;
&lt;li&gt;2016 - Sydney (Partial) - Severe Weather - Power Loss - All Services&lt;/li&gt;
&lt;li&gt;2017 - All Regions - Human error - S3 critical servers deleted - S3&lt;/li&gt;
&lt;li&gt;2018 - Seoul Region - Human error - DNS resolvers impacted - EC2&lt;/li&gt;
&lt;li&gt;2021 - Virginia - Traffic Scaling - Network Control Plane outage - All Services&lt;/li&gt;
&lt;li&gt;2021 - California - Traffic Scaling - Network Control Plane outage - All Services&lt;/li&gt;
&lt;li&gt;2021 - Frankfurt (Partial) - Fire - Fire Suppression System issues - All Services&lt;/li&gt;
&lt;li&gt;2023 - Virginia - Kinesis issues - Scheduling Lambda Invocations impact - Lambda&lt;/li&gt;
&lt;li&gt;2023 - Virginia - Networking issues - Operational issue - Lambda, Fargate, API Gateway…&lt;/li&gt;
&lt;li&gt;2023 - Oregon (Partial) - Error rates - Dynamodb + 48 services&lt;/li&gt;
&lt;li&gt;2024 - Singapore (Partial) - EC2 Autoscaling - EC2&lt;/li&gt;
&lt;li&gt;2024 - Virginia (Partial) - Describe API Failures ECS - ECS + 4 services&lt;/li&gt;
&lt;li&gt;2024 - Brazil - ISP issues - CloudFront connectivity - CloudFront&lt;/li&gt;
&lt;li&gt;2024 - Global - Network connectivity - STS Service&lt;/li&gt;
&lt;li&gt;2024 - Virginia - Message size overflow - Kinesis down - Lambda, S3, ECS, CloudWatch, Redshift&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;2025 - Virginia - Dynamo DB DNS - DynamoDB down - All Services&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And any one of these would have caused major problems for us and therefore our customers. And the frequency of incident is actually increasing in time. This shouldn't be a surprise, right? Cloud adoption is increasing over time. The number of services AWS is offering is also increasing. But how impactful are these events? Would single one of them have been a problem for us to actually reach our SLA promise? What would happen if we just trusted AWS and used that to pass through our commitments? Would it be sufficient to achieve 99.999% SLA uptime? Well, let's take a look.&lt;/p&gt;

&lt;h2&gt;
  
  
  🕰️ AWS SLA Commitments​
&lt;/h2&gt;

&lt;h4&gt;
  
  
  The AWS Lambda SLA is below 5 nines​
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6m3ldwp23hcajpy14nus.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6m3ldwp23hcajpy14nus.png" alt="Lambda SLA"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  The API Gateway SLA is below 5 nines​
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2dpoacaud5b5uk4ejwqj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2dpoacaud5b5uk4ejwqj.png" alt="API Gateway SLA"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  The AWS SQS SLA is below 5 nines​
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsprsphzuy3m0dkxykhvu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsprsphzuy3m0dkxykhvu.png" alt="SQS SLA"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Okay, so when it comes to trusting AWS SLAs, it isn't sufficient. At. All.&lt;/p&gt;

&lt;p&gt;We can't just use the components that are offered by AWS, and go from there. We fundamentally need to do something more than that. So the question becomes, what exactly must a dependency's reliability be in order for us to utilize it? To answer that question, it's time for a math lesson. Or more specifically, everyone's favorite topic, &lt;strong&gt;probabilities&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Let's quickly get through this &lt;del&gt;torture&lt;/del&gt; exercise. Fundamentally, you have endpoints in your service, and you get in an HTTP request, and it interacts with some third-party component or API, and then you write the result to a database. For us, this could be an integration such as &lt;strong&gt;logging in with Google&lt;/strong&gt; or with &lt;strong&gt;Okta&lt;/strong&gt; for our customers' enterprise customers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdykoe5fmi96q1y2163eu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdykoe5fmi96q1y2163eu.png" alt="Third-party Failure Rate"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  💻 Calculating the allowed failure rate​
&lt;/h2&gt;

&lt;p&gt;So if we want to meet a 5-nines reliability promise, how unreliable could this third-party component actually be? What happens if this component out of the box is only 90% reliable? We'll design a strategy for getting around that.&lt;/p&gt;

&lt;p&gt;Uptime is a product of all of the individual probabilities:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs1bifce31ue0p8vomzy3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs1bifce31ue0p8vomzy3.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For the sake of this example, we'll just assume that every other component in our architecture is 100% reliable — That's every line of code, no bugs ever written in our library dependencies, or transitive library dependencies, or the dependencies' dependencies' dependencies, and everything always works exactly as we expect.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F43665iq70ilmu6jiu0ve.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F43665iq70ilmu6jiu0ve.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So we can actually rewrite our uptime promise as a result of the failure rate of that third-party component.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftquucsdsbldbs2ko15nw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftquucsdsbldbs2ko15nw.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And the only way that we can actually increase the success rate of the uptime based off of failures is to retry. And so we can multiply out the third-party failure rate and retry multiple times.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpnvwy187nxeuclf8e51p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpnvwy187nxeuclf8e51p.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Logically that makes a lot of sense. When a component fails, if you retry again, and again, the likelihood it will be down every single time approaches zero. And we can generate a really nasty equation from this to actually determine how many exact times do we need to retry.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqy530layozz10o8q0f4z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqy530layozz10o8q0f4z.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How many exactly can it? Rather than guessing whether or not we should retry four times or five times, or put it in a &lt;code&gt;while(true)&lt;/code&gt; loop, we can figure it out exactly. So we take this equation and extend it out a little bit. Plugging in our 90% reliable third-party component:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcdeb6ooh25hwn5rdcdo6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcdeb6ooh25hwn5rdcdo6.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We find that our retry count actually must be greater than or equal to five. We can see that this adds up to our uptime expectation:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgez8ufisktxqhosi1xv3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgez8ufisktxqhosi1xv3.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Is this the end of the story? Just retry a bunch of times and you're good? Well, not exactly. Remember this equation?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fspvf3hz1j0x0exfkyvgl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fspvf3hz1j0x0exfkyvgl.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We do really need to consider every single component that we utilize. And specifically when it comes to the third-party component, we had to execute it by utilizing a retry handler. So we need to consider the addition of the retry handler into our equation. Going back to the initial architecture, instead of what we had before, when there's a failure in that third-party component, now we will automatically execute some sort of asynchronous retries or in-process retries. And every time that third-party component fails, we execute the retry handler and retry again.&lt;/p&gt;

&lt;p&gt;This means we need to consider the reliability of that retry handler.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffb59vhrdjl6bd7f3rwij.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffb59vhrdjl6bd7f3rwij.png" alt="Retry handler failure rate consideration"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's assume we have a really reliable retry handler and that it's even more reliable than our service. I think that's reasonable, and actually required. A retry handler that is less reliable than our stated SLA by default is just as faulty as the third-party component.&lt;/p&gt;

&lt;p&gt;Let's consider one with five and a half nines — that's half a nine more reliable than our own SLA.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fewmcrbsecwdxs3lww9f5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fewmcrbsecwdxs3lww9f5.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But how reliable does it really need to be? Well, we can pull in our original equation and realize that our total uptime is the unreliability or the reliability of the third-party component multiplied by the reliability of our retry handler.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkmyycn3dw1hz80f8lnn1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkmyycn3dw1hz80f8lnn1.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From here, we add in the retries to figure out what the result should be:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdnmckfa9k6czz0su9756.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdnmckfa9k6czz0su9756.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We have a reliable retry handler, but it's not perfect. And with a retry handler that has reliability of five and a half nines, we can retry &lt;strong&gt;a maximum two times&lt;/strong&gt;. Because remember, it has to be reliable every single time we utilize it, as it is a component which can also fail. Which means left with this equation:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj72bn313d0kw856wi65h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj72bn313d0kw856wi65h.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I don't think comes as a surprise to anyone that in fact five is greater than two. What is the implication here?&lt;/p&gt;

&lt;p&gt;The number of retries required for that unreliable third-party component to be utilized by us exceeds the number of retries actually allowed by our retry handler.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8oz0zgte2pi4vgwhkuv2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8oz0zgte2pi4vgwhkuv2.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That's a failure, the retry handler can only retry twice before itself violates our SLA, but we need to retry five times in order to raise the third-party component reliably up. We can actually figure out what the minimum reliability of a third-party component is allowed to be, when using our retry handler:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzjbk4n9dgl5f9pc8lw5y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzjbk4n9dgl5f9pc8lw5y.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Which in turn validates that it's actually impossible for us to utilize that component. &lt;code&gt;99.7%&lt;/code&gt;. &lt;code&gt;99.7%&lt;/code&gt; is the minimum allowed reliability for any third-party component in order for us to meet our required 5-nines SLA. This third-party component is so unreliable (&lt;code&gt;~90%&lt;/code&gt;), that even using a highly reliable retry handler, we still can't make it reliable enough without the retry handler itself compromising our SLA. We fundamentally need to consider this constraint, when we're building out our architecture.&lt;/p&gt;

&lt;p&gt;That means we drop this third-party component. Done.&lt;/p&gt;

&lt;p&gt;And then, let's assume we get rid of every flaky component, everything that don't have a high enough reliability for us. At this point, it's good to think, is this sufficient to achieve our 5-nines SLA? Well, it isn't just third-party components we have to be concerned about. We also have to be worried about those AWs infrastructure failures.&lt;/p&gt;

&lt;h2&gt;
  
  
  🌩️ Infrastructure Failures​
&lt;/h2&gt;

&lt;p&gt;So let's flashback to our initial architecture again:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft1bl7923pf44h0g8xpg8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft1bl7923pf44h0g8xpg8.png" alt="The simplest auth architecture"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can have issues at the database layer, right? There could be any number of problems here. Maybe it's returning 500s, there are some slow queries, maybe things are timing out. Or there could be a problem with our compute. Maybe it's not scaling up fast enough. We're not getting new infrastructure resources. Sometimes, even AWS is out of bare metal machines when you don't reserve them, request them get them on demand, and the list go on.&lt;/p&gt;

&lt;p&gt;Additionally, there could also be some sort of network issue, where requests aren't making it through to us or even throw a DNS resolution error on a request from our users.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F997xjnqi3hlpgxmovfnc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F997xjnqi3hlpgxmovfnc.png" alt="AWS Infrastructure Failure locations"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In many of these cases, I think the answer is obvious. We just have to declare the whole region as down. And you are probably thinking, well, this is where we failover to somewhere else. No surprise, yeah, this is exactly what we do:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy0qml4ivl0v6r1q6fp4l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy0qml4ivl0v6r1q6fp4l.png" alt="Region failover strategy in AWS"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, this means we have to have all the data and all the infrastructure components duplicated to another region in order to do this. And since &lt;a href="https://authress.io" rel="noopener noreferrer"&gt;Authress&lt;/a&gt; has &lt;strong&gt;six primary regions&lt;/strong&gt; around the world, that also means we need multiple backup regions to be able to support the strategy. But this comes with significant wasted resources and wasted compute that we're not even getting to use. Costly! But I'll get to that later.&lt;/p&gt;

&lt;p&gt;Knowing a redundant architecture is required is a great first step, but that leaves us having to solve for: &lt;strong&gt;how do we actually make the failover happen in practice?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  🚧 The Failover Routing Strategy​
&lt;/h2&gt;

&lt;p&gt;Simply put — our strategy is to utilize DNS dynamic routing. This means requests come into our DNS and it automatically selects between one of two target regions, the primary region that we're utilizing or the failover region in case there's an issue. The critical component of the infrastructure is to switch regions during an incident:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftdhg1xaro8vguni8ymqn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftdhg1xaro8vguni8ymqn.png" alt="Utilizing Route 53 health checks"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In our case, when using AWS, this means using the Route 53 health checks and the &lt;a href="https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/routing-policy-failover.html" rel="noopener noreferrer"&gt;Route 53 failover routing policy&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We know how we're gonna do it, but the long pole in the tent is actually knowing that there is even a problem in the first place. A partial answer is to say &lt;strong&gt;Have a health check&lt;/strong&gt; , so of course there is health check here. But the full answer is: have a health check that validates both of the regions, checking if the region is up, or is there an incident? And if it is, reports the results to the DNS router.&lt;/p&gt;

&lt;p&gt;We could be utilizing the default provided handler from AWS Route 53 or a third-party component which pings our website, but that's not accurate enough from a standpoint of correctly and knowing for certain that our services are in fact down.&lt;/p&gt;

&lt;p&gt;It would be devastating for us to fail over when a secondary region is having worse problems than our primary region. Or what if there's an issue with with network traffic. We wouldn't know if that's an issue of communication between AWS's infrastructure services, or an issue with the default Route 53 health check endpoint, or some entangled problem with how those specifically interact with our code that we're actually utilizing. So it became a requirement to built something ourselves, custom, to actually execute exactly what we need to check.&lt;/p&gt;

&lt;p&gt;Here is a representation of what we're doing. It's not exactly what we are doing, but it's close enough to be useful. Health check request come in from the Route 53 Health Check. They call into our APIGW or Load Balancer as a router. The requests are passed to our compute which can interact and validate logic, code, access, and data in the database:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0m8k9awu82gjzggrp9dl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0m8k9awu82gjzggrp9dl.png" alt="The health check endpoint architecture"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The health check executes this code on request that allows us to validate if the region is in fact healthy:&lt;/p&gt;

&lt;p&gt;Region HealthCheck validation&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;Authorizer&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./authorizer.js&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;ModelValidator&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./modelValidator.js&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;healthCheck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;profiler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dynamoDbCheck&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;accountDatabase&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getDefaultAccount&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;indexerCheck&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;indexer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;authorizationCheck&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;HealthCheck&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sqsValidation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;sqsClient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;LiveCheck&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;authorizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;Authorizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;modelValidation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;ModelValidator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nx"&gt;dynamoDbCheck&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;indexerCheck&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;sqsValidation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="nx"&gt;authorizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;modelValidation&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;HealthCheck Failed&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;statusCode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;503&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;profiler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;end&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;statusCode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;span&gt;&lt;/span&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;We start a profiler to know how long our requests are taking.&lt;/li&gt;
&lt;li&gt;Then we interact with our databases, as well as validate some secondary components, such as SQS. While issues with secondary components may not always be a reason to failover, they can cause impacts to response time, and those indicators can be used to predict incoming incidents.&lt;/li&gt;
&lt;li&gt;From there, we check whether or not the most critical business logic is working correctly. In our case, that's interactions with DynamoDB as well as core authorizer logic. Compared to a simple unit test, this accounts for corruption in a deployment package, as well instances where some subtle differences between regions interact with our code base. We can catch those sorts of problems here, and know that the primary region that we're utilizing, one of the six, is having a problem and automatically update the DNS based on this.&lt;/li&gt;
&lt;li&gt;When we're done, we return success or failure so the health check can track changes.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  🌿 Improving the Failover Strategy​
&lt;/h2&gt;

&lt;p&gt;And we don't stop here with our infrastructure failover however. With the current strategy, it's good, in some cases, even sufficient. But it isn't that great. For starters, we have to completely failover. If there's just one component that's problematic, we can't just swap that one out easily, it's all or nothing with the Route 53 health check. So when possible, we push for an edge-optimized architecture. In AWS, this means utilizing &lt;a href="https://aws.amazon.com/cloudfront/" rel="noopener noreferrer"&gt;AWS CloudFront&lt;/a&gt; with AWS Lambda@Edge for compute. This not only helps reduce latency for our customers and their end users depending where they are around the world, as a secondary benefit, fundamentally, it is an improved failover strategy.&lt;/p&gt;

&lt;p&gt;And that looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fef8jpguja4meqn08b61a.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fef8jpguja4meqn08b61a.gif" alt="CloudFront Edge Failover"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Using CloudFront gives us a &lt;a href="https://aws.amazon.com/blogs/networking-and-content-delivery/charting-the-life-of-an-amazon-cloudfront-request/" rel="noopener noreferrer"&gt;highly reliable CDN&lt;/a&gt;, which routes requests to the locally available compute region. From there, we can interact with the local database. When our database in that region experiences a health incident, we automatically failover, and check the database in a second adjacent region. And when there's a problem there as well, we do it again to a third region. We can do that because when utilizing DynamoDB we have &lt;a href="https://aws.amazon.com/dynamodb/global-tables/" rel="noopener noreferrer"&gt;Global Tables&lt;/a&gt; configured for authorization configuration. In places where we don't need the data duplicated, we just interact with the table in a different region without replication.&lt;/p&gt;

&lt;p&gt;After a third region with an issue, &lt;strong&gt;we stop.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And maybe you're asking why three and not four or five or six? Aren't you glad we did the probabilities exercise earlier? Now you can actually figure out why it's three here. But, I'll leave that math as an exercise for you.&lt;/p&gt;

&lt;p&gt;As a quick recap, this handles the problems with at the infrastructure level and with third-party components. And if we solve those, is that sufficient for us to achieve our goal the 5-nines SLA?&lt;/p&gt;

&lt;p&gt;For us the answer is &lt;strong&gt;No&lt;/strong&gt; , and you might have guessed, if you peaked at the scrollbar or table contents that there are still quite some additional components integrated into our solution. One of them is knowing that at some point, there's going to be a bug in our code, unfortunately.&lt;/p&gt;

&lt;h2&gt;
  
  
  💻 Application level failures​
&lt;/h2&gt;

&lt;p&gt;And that bug will get committed to production, which means we're going to end up with an application failure. It should be obvious that it isn't achievable to write completely bug-free code. Maybe there is someone out there that thinks that, and maybe even that's you, and I believe you that you believe that. However, I know it's not me, and realistically, I don't want to sit around and pray that it's also my fellow team members. The risk is too high, because in the case something does get into production, that means it can impact some of our customers. So instead, let's assume that will happen and design a strategy around it.&lt;/p&gt;

&lt;p&gt;So when it does happen, we of course have to trigger our incident response. For us, we send out an email, we post a message on our community and internal communication workspaces, and start an on-call alert. The technology here isn't so relevant, but tools like AWS SES, SQS, SNS, Discord, and emails are involved.&lt;/p&gt;

&lt;p&gt;Incidents wake an engineer up, so someone can start to take look at the incident, and most likely the code.&lt;/p&gt;

&lt;p&gt;But by the time they even respond to the alert, let alone actually investigate and fix the cause of the incident, we would long violated our SLA. So an alert is not sufficient for us. We need to also implement automation to automatically remediate any of these problems. Now, I'm sure you're thinking, &lt;em&gt;yeah, okay, test automation&lt;/em&gt;. You might even be thinking about an LLM agent that can automatically create PRs. (Side note: LLM code generation, doesn't actually work for us, and I'll get to that a little further down) Instead, we have to rely on having sufficient testing in place. And yes, of course we do. We test before deployment. There is no better time to test.&lt;/p&gt;

&lt;p&gt;This seems simple and an obvious answer, and I hope that for anyone reading this article it is. Untested code never goes to production. Every line of code is completely tested before it is merged to production, even if it is enabled on some flag. Untested code is never released, it is far too dangerous. Untested code never makes it to production behind some magic flag. Abusing feature flags to make that happen could not be a worse decision for us. And that's because we can need to be as confident as possible before those changes actually get out in front of our customers. The result is — we don't focus on test coverage percentage, but rather &lt;strong&gt;test value&lt;/strong&gt;. That is, which areas provide most value, that are most risky, that we care about being the most reliable for our customers. Those are the ones we focus on testing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Root Cause Analysis (RCA)​
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Every incident could have been prevented if we just had one more test.&lt;/strong&gt; The trick though is actually having that right test, before the incident.&lt;/p&gt;

&lt;p&gt;And in reality, that's not actually possible. Having every right test for a service that is constantly changing, while new features are being added, is just unmaintainable. Every additional test we write increases the maintenance burden of our service. Attempting to achieve 100% complete test coverage would require an infinite amount of time. This is known as the &lt;a href="https://en.wikipedia.org/wiki/Pareto_principle" rel="noopener noreferrer"&gt;Pareto Principle&lt;/a&gt;, more commonly the 80-20 rule. If it takes 20% of the time to deliver 80% of the tests, it takes an infinite amount of time to achieve all the tests, and that assumes that the source code isn't changing.&lt;/p&gt;

&lt;p&gt;The result is we'll never be able to catch everything. &lt;strong&gt;So we can't just optimize for prevention. We also need to optimize for recovery.&lt;/strong&gt; This conclusion for us means also implementing tests against our deployed production code. One example of this are validation tests.&lt;/p&gt;

&lt;h2&gt;
  
  
  📋 Validation Tests​
&lt;/h2&gt;

&lt;p&gt;A validation test is where you have some data in one format and data in another format and you use those two different formats to ensure referential consistency. (Side note: There are many different kinds of tests, and I do a deep dive in &lt;a href="https://authress.io/knowledge-base/academy/topics/user-impersonation-risks#solution-b-dom-recording" rel="noopener noreferrer"&gt;the different types of tests&lt;/a&gt; and how they're relevant in building secure and reliable systems). One concrete example could be you have a request that comes in, you end up logging the request data and the response, then you can compare that logged data to what's actually saved in your database.&lt;/p&gt;

&lt;p&gt;In our scenario, which focuses on the authorization and permissions enforcement checks, we have multiple databases with similar data. In one case, there's the storage of permissions as well as the storage of the expected checks and the audit trail tracking the creation of those permissions. So we actually have multiple opportunities to compare the data between our databases asynchronously outside of customer critical path usage.&lt;/p&gt;

&lt;h3&gt;
  
  
  Running the Validation​
&lt;/h3&gt;

&lt;p&gt;On a schedule, via an AWS CloudWatch Scheduled Rule, we load the data from our different databases and we compare them against each other to make sure it is consistent. If there is a problem, then if this fires off an incident before any of our customers notice, so that we can actually go in and check what's going on.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpgx7zftaky0c0kg1k1tk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpgx7zftaky0c0kg1k1tk.png" alt="The architecture flow to trigger the validation tests"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This sounds bad on the surface that it could ever happen. But the reality of the situation is that a discrepancy can show up as a result of any number of mechanisms. For instance, the infrastructure from AWS could have corrupted one of the database shards and what is written to the databases is inconsistent. We know that this can happen as there is no 100% guarantee on database durability, even from AWS. &lt;strong&gt;AWS does not guarantee Database Durability&lt;/strong&gt; , are you assuming they do, because we don't! So actually reading the data back and verifying its internal consistency is something that we must do.&lt;/p&gt;

&lt;p&gt;While it might not seem that this could reduce the probability of there being an incident. Consider that a requested user permission check whose result doesn't match our customer's expectation is an incident. It might not always be one that anyone identifies or even becomes aware of, but it nonetheless a problem, just like a publicly exposed S3 is technically an issue, even if no one has exfiltrated the data yet, it doesn't mean the bucket isn'is sufficiently secured.&lt;/p&gt;

&lt;h2&gt;
  
  
  🎯 Incident Impact​
&lt;/h2&gt;

&lt;p&gt;There are two parts to the actual risk of an incident. The probability and the impact. Everything in this article I've discuss until now talks about reducing the probability of an incident, that is — the likelihood of it happening. But since we know that we can't avoid ever having an incident, we also have to reduce the impact when it happens.&lt;/p&gt;

&lt;p&gt;One way we do that is by utilizing an &lt;strong&gt;incremental rollout&lt;/strong&gt;. Hopefully everyone knows what incremental rollout is, so I'll instead jump straight into how we accomplish it utilizing AWS. And for that we focus again on our solution integrating with CloudFront and our edge architecture.&lt;/p&gt;

&lt;p&gt;The solution for us is what I call &lt;strong&gt;Customer Deployment Buckets&lt;/strong&gt;. We bucket individual customers into separate buckets and then deploy to each of the buckets sequentially. If the deployment rolls out without a problem, and it's all green, that is everything works correctly, then we go on to the second bucket and then deploy our code to there, and then the third bucket, and so on and so forth until every single customer has the new version.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7g093ezmn0efir72exnj.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7g093ezmn0efir72exnj.gif" alt="Rolling out to customer buckets one at a time"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If there is an issue, we stop the rollout and we go and investigate what's actually going on. While we can't prevent the issue from happening to the earlier buckets, we are able to stop that issue from propagating to more customers, having an impact on everyone, and thus reduce the impact of the incident.&lt;/p&gt;

&lt;p&gt;As I mentioned before the biggest recurring issue isn't executing an operations process during an incident, it's identifying there is a real incident in the first place. So, &lt;strong&gt;How do we actually know that there's an issue?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If it was an easy problem to solve, you would have written a unit task or &lt;a href="https://authress.io/knowledge-base/academy/topics/user-impersonation-risks#solution-b-dom-recording" rel="noopener noreferrer"&gt;integration test or service level test&lt;/a&gt; and thus already discovered it, right? So adding tests can't, by design, help us. Maybe there's an issue with the deployment itself or during infrastructure creation, but likely that's not what's happening.&lt;/p&gt;

&lt;p&gt;Now, I know you're thinking, _ &lt;strong&gt;When is he going to get to AI?&lt;/strong&gt; _&lt;/p&gt;

&lt;p&gt;Whether or not we'll ever truly have AI is a separate &lt;code&gt;&amp;lt;rant /&amp;gt;&lt;/code&gt; that I won't get into here, so this is the only section on it, I promise. What we actually do is better called &lt;strong&gt;anomaly detection.&lt;/strong&gt; Historically anomaly detection, was what AI always meant, true AI, rather than an LLM or agent in any way.&lt;/p&gt;

&lt;h2&gt;
  
  
  🔎 AI: Anomaly Detection​
&lt;/h2&gt;

&lt;p&gt;This is a graph of our detection analysis:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fweyw1uefylwhpc2prxq6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fweyw1uefylwhpc2prxq6.png" alt="namely detection graph"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You might notice that it's not tracking 400s or 500s, which are in reality relatively easy to detect. But in fact don't actually tell us meaningfully what's wrong with our service or whether or not there really is a problem. Impact is measured by business value, not technical protocol level analytics, so we need to have a business-focused metric.&lt;/p&gt;

&lt;p&gt;And for us, at Authress, the business-focussed metric we use to identify meaningful incidents we call: &lt;strong&gt;The Authorization Ratio&lt;/strong&gt;. That is the ratio of successful logins and authorizations to ones that are blocked, rejected, timeout or are never completed for some reason.&lt;/p&gt;

&lt;p&gt;The above CloudWatch metric display contains this exact ratio, and here in this timeframe represents an instance not too long ago where we got really close to firing off our alert.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzoklmvhrrqxccuwumau8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzoklmvhrrqxccuwumau8.png" alt="Anomaly Detection allowance bands"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here, there was a slight elevation of errors soon after a deployment. The expected ratio was outside of our allowance span for a short period of time. However not long enough to trigger an incident. We still investigated, but it wasn't something that required immediate remediation. And it's a good reminder that identifying problems in any production software isn't so straightforward. To achieve high reliability, we've needed an AI or in this case anomaly detection to actually identify additional problems. And realistically, even with this level of sophistication in place, we still can never know with 100% certainty that there is actually an incident at any moment. And that's because "what is an incident", is actually a philosophical question...&lt;/p&gt;

&lt;h2&gt;
  
  
  🌹 Does it smell like an incident?​
&lt;/h2&gt;

&lt;p&gt;Our anomaly detection said – almost an incident, and we determined the result – no incident. But does that mean there wasn't an incident? What makes an incident, how do I define an incident? And is that exact definition ubiquitous, for every system, every engineer, every customer?&lt;/p&gt;

&lt;p&gt;Obviously not, and one look at the &lt;a href="https://health.console.aws.amazon.com/health/home" rel="noopener noreferrer"&gt;AWS Health Status Dashboard&lt;/a&gt; is all you need to determine that the identification of incidents is based on subjective perspective, rather than objective criteria. What's actually more important is the synthesis of our perspective on the situation and what our customers believe. To see what I mean, let's do a comparison:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F52a9g5937u786bvzpbnm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F52a9g5937u786bvzpbnm.png" alt="incident perspective comparison"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I'm going to use Authress as an example. So I've got the product services perspective on one side and our customer's perspective on the other.&lt;/p&gt;

&lt;h3&gt;
  
  
  Incident Alignment​
&lt;/h3&gt;

&lt;p&gt;In the top left corner we have alignment. If we believe that our system is up and working and our customers do, too, then success, all good. Everything's working as expected.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdi3qzdxcq5rbzcjlm2iy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdi3qzdxcq5rbzcjlm2iy.png" alt="incident perspective comparison alignment"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Inversely in the opposite corner, maybe there is a problem. We believe that one of our services is having an issue, and successfully, we're able to identify it. Most importantly, our customers say–yes, there is an issue for us.&lt;/p&gt;

&lt;p&gt;It's not great that there's an incident, but as I've identified incidents will absolutely happen, and the fact we've correctly aligned with our customers on the problem's existence independently allows us to deploy automation to automatically remediate the issue. That's a success! If it's a new problem that we haven't seen before, we can even design new automation to fix this. Correctly identifying incidents is challenging, so doing that step correctly, leads itself very well to automation for remediation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Perspective Mismatch​
&lt;/h3&gt;

&lt;p&gt;One interesting corner is when our customers believe that there's nothing wrong, there have been no incidents reported, but all our alerts are saying – &lt;em&gt;RED ALERT&lt;/em&gt; — someone has to go look at this!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Famydmdwz08zwrv3amc91.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Famydmdwz08zwrv3amc91.png" alt="incident perspective mismatch"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this case, our alerts have identified a problem that no one cares about. This often happens in scenarios where our customers are in one region, Switzerland for example, with local region users, a health care, manufacturing, or e-commerce app, is a good example, rather than global, who are likely asleep at 2:00 AM. And that means an incident at the moment, could be an issue affecting some customers. But if they aren't around to experience it, is it actually happening?&lt;/p&gt;

&lt;p&gt;You are probably wincing at that idea. There's a bug, it must be fixed! And sure that's a problem, it's happening and we should take note of what's going on. But we don't need to respond in real time. That's a waste of our resources where we could be investing in other things. Why wake up our engineers based on functionality that no one is using?&lt;/p&gt;

&lt;p&gt;I think one of the most interesting categories is in the top right-hand corner where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;our customers say, &lt;em&gt;"hey, your service is down"&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;But we say, "Wait, really, is it?"_&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is known as a &lt;strong&gt;gray failure&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gray Failures​
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F01m60p4zzkauenyib9a8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F01m60p4zzkauenyib9a8.png" alt="Gray failures identified"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And it can happen for any number of reasons. Maybe there is something in our knowledge base that tells our customers to do something one way and it's confusing and they've interpreted it in a different way. So there's a different expectation here. That expectation can get codified into customer processes and product services.&lt;/p&gt;

&lt;p&gt;Or maybe our customer is running different tests from us, ones that are of course, valuable for their business, but not ones that we consider. Or more likely they are just using a less resilient cloud provider.&lt;/p&gt;

&lt;p&gt;Most fundamentally, there could really be an incident, something that we haven't detected yet, but they have. And if we don't respond to that, it could grow, and left unchecked, escalate, and eventually impact all our customers. This means we need to give our customers an easy way to report incidents to us, which we can immediately follow up with.&lt;/p&gt;

&lt;p&gt;For us, every single incident, every single customer support ticket that comes into our platform, we immediately and directly send it to our engineering team. Now, I often get pushback on this from other leaders. I'm sure, even you might be thinking something like — &lt;em&gt;I don't want to be on call for customer support incidents.&lt;/em&gt; But if you throw additional tiers in your organization between your engineering teams and your customers, that means you're increasing the time to actually start investigating and resolving those problems. If you have two tiers before your engineering team and each tier has its own SLA of 10 minutes to triage the issue, that means you've already gone through 20 minutes before an engineer even knows about it and can go and look at it. That violates our SLA by fourfold before investigation and remediation can even begin.&lt;/p&gt;

&lt;p&gt;Instead, in those scenarios, what I actually recommend thinking about is how might you reduce the number of support tickets you receive in aggregate? This is the much more appropriate way to look at the problem. If you are getting support tickets that don't make sense, then you've got to investigate, &lt;em&gt;why did we get this ticket?&lt;/em&gt; Do the root cause analysis on the ticket, not just the issue mentioned in it — why the ticket was even created in the first place.&lt;/p&gt;

&lt;p&gt;A ticket means: Something is broken. From there, we can figure out, OK, maybe we need to improve our documentation. Or we need to change what we're doing on one of our endpoints. Or we need to change the response error message we're sending. But you can always go deeper.&lt;/p&gt;

&lt;h3&gt;
  
  
  The customer support advantage​
&lt;/h3&gt;

&lt;p&gt;And going deeper, means customer support is critical for us. We consider customer support to be the lifeline of our service level agreement (SLA). If we didn't have that advantage, then we might not have been able to deliver our commitment at all. So much so that we report some of our own CloudWatch custom metrics to our customers so they can have an aggregate view of both what they know internally and what we believe. We do this through our own internal dashboard in our application management UIs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxmgmwgjm0hbw1m3ev6je.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxmgmwgjm0hbw1m3ev6je.png" alt="Authress metric dashboard"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Helping our users identify incidents benefits us; because we can't catch everything. It's just not possible.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  💀 Negligence and Malice​
&lt;/h2&gt;

&lt;p&gt;To this point, we've done the math on reliability of third-party components. We've implemented an automatic region failover and added incremental rollout. And we have a core customer support focus. Is that sufficient to achieve 5-nines of reliability?&lt;/p&gt;

&lt;p&gt;If you think yes, then you'd expect the meme pictures now. And, I wish I could say it was enough, but it's not. That's because we also have to deal with negligence and malice.&lt;/p&gt;

&lt;p&gt;We're in a privileged position to have numerous security researchers out there on the internet constantly trying to find vulnerabilities within our service. For transparency, I have some of those reports I want to share:&lt;/p&gt;

&lt;h3&gt;
  
  
  “Real” Vulnerability Reports​
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxy2dgulk36qid68k18bc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxy2dgulk36qid68k18bc.png" alt="fake vulnerability disclosure"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I am a web security researcher enthusiast. Do you give a monetary reward?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Okay, this isn't starting out that great. What else have we received?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgct2u339n5cak8rf4qox.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgct2u339n5cak8rf4qox.png" alt="appeal to ethical hacking rewards"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I found some vulnerabilities in your website. Do you offer rewards for ethical hackers?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Well, maybe, but I think you would actually need to answer for us, what the problem actually is. And you also might notice this went to our spam. It didn't even get to our inbox. So a lot of help they might be providing. Actually we ignore any &lt;em&gt;”security”&lt;/em&gt; email sent from a non-custom domain.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fclh929kh86ucu0a7m36b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fclh929kh86ucu0a7m36b.png" alt="Phishing attempt using our own credentials"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This one was really interesting. We had someone attempting to phish our engineering team by creating a support ticket and putting in some configuration trying to get us to provide them our own credentials to one of our third-party dependencies. Interestingly enough, our teams don't even have access to those credentials directly.&lt;/p&gt;

&lt;p&gt;And, we know this was malicious because the credentials that they are referencing in the support request are from our honey pot, stuck in our UI to explicitly catch these sorts of things. The only way to get these credentials is if they hacked around our UI application and pulled out of the HTML. They aren't readily available any other way. So it was very easy for us to detect that this “report” was actually a social engineering attack.&lt;/p&gt;

&lt;p&gt;And this is one of my favorites, and I can't make this up:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpyuecblge6tw7wo0l90a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpyuecblge6tw7wo0l90a.png" alt="Bugbounty vulnerability reporting"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I have found many security loophole. How much will you pay if you want to working with me like project?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's the exact quote, I don't even know what that means. Unfortunately, LLMs will actually start to make all of these future "vulnerability reports" sound more appealing to read in the future, for better or worse. However, at the end of the day, the truth is that these are harmless. And we actually do have a &lt;a href="https://authress.io/app/#/disclosure" rel="noopener noreferrer"&gt;security disclosure program&lt;/a&gt; that anyone can go and submit problems for. I hope the message to white-hat hackers is please use that process, and the legitimate reports usually do go through it. Do not send us emails. Those are going to go into the abyss. Alternatively, you can follow our &lt;a href="https://authress.io/.well-known/security.txt" rel="noopener noreferrer"&gt;security.txt&lt;/a&gt; public page or go to the disclosure form, but with email, the wrong people are going to get that and we can't triage effectively.&lt;/p&gt;

&lt;p&gt;Vulnerabilities in our services can result in production incidents for our customers. That means security is part of our SLA. Don't believe me, I'll show you how:&lt;/p&gt;

&lt;h3&gt;
  
  
  Multitenant considerations​
&lt;/h3&gt;

&lt;p&gt;It's relevant for us, that Authress is a multitenant solution. So some of the resources within our service are in fact shared between customers.&lt;/p&gt;

&lt;p&gt;Additionally, customers could have multiple services in a microservice architecture or multiple components. And one of these services could theoretically consume all of the resources that we've allocated for that customer. In that scenario, that would cause an incident for that customer. So we need to protect against resource exhaustion &lt;strong&gt;Intra-Tenant&lt;/strong&gt;. Likewise, we have multiple customers. One of those customers could be consuming more resources than we've allocated to the entire tenant. And that could cause an incident across &lt;strong&gt;Inter-Tenant&lt;/strong&gt; and cause an incident across our platform and impact other customers.&lt;/p&gt;

&lt;p&gt;Lastly, we have to be worried about our customers, our customers' customers, and our customers' customers' customers, because any one of those could be malicious and consume their resources and so on and so forth, thus causing a cascading failure. &lt;strong&gt;A failure due to lack of resources is an incident&lt;/strong&gt;. The only solution that makes sense for this is, surprise, rate limiting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Helpful Rate Limiting​
&lt;/h3&gt;

&lt;p&gt;So we need to rate-limit these requests at different levels for different kinds of clients, different kinds of users, and we do that within our architecture, at different fundamental levels within our infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc07ujv2eln2wegxckcoc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc07ujv2eln2wegxckcoc.png" alt="CloudFront and Region based rate limiting locations"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Primarily there are protections at our compute level, as well at the region level, and also place protections at a global level. In AWS, this of course means using a &lt;a href="https://aws.amazon.com/waf/" rel="noopener noreferrer"&gt;web application firewall or WAF&lt;/a&gt;. I think our WAF configuration is interesting and in some ways novel.&lt;/p&gt;

&lt;p&gt;Fundamentally, one of the things that we love to use is the &lt;a href="https://docs.aws.amazon.com/waf/latest/developerguide/aws-managed-rule-groups-ip-rep.html#aws-managed-rule-groups-ip-rep-amazon" rel="noopener noreferrer"&gt;AWS managed IP reputation list&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The reputation list is list of IP addresses that have been associated with malicious activity outside of our service throughout other customers at AWS and other providers out there in the world where a problem has been detected. That means before those attacks even get to our service or to our customers' instances of Authress, we can already know to block them, and the WAF does that. This is great, and most importantly, has a very low false positive rate.&lt;/p&gt;

&lt;p&gt;However, the false positive rate is an important metric for consideration of counter measures against malicious attacks or negligent accidental abuse of resources, and something that prevents us from using any other managed rules from AWS or external providers. There's two problems with managed rules, fundamentally:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Number one is the false positive rate. If that is even a little bit more than, it couldn't be sustainable, and would result in us blocking legitimate requests coming for a customer. This means it is a problem, and it's an incident for them if some of their users can't utilize their software because of something we did. False positives are customer incidents.&lt;/li&gt;
&lt;li&gt;The second one is that managed rules are gratuitously expensive. Lots of companies are building these just to charge you lots of money, and the ROI just doesn't seem to be there. We don't see useful blocks from them.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;But the truth is, we need to do something more than just the reputation list rule.&lt;/p&gt;

&lt;h3&gt;
  
  
  Handling Requests at Scale​
&lt;/h3&gt;

&lt;p&gt;And the thing that we've decided to do is — add blocking for sufficiently high requests. By default, any Authress account's service client that goes above 2,000 requests per second (RPS), we just immediately terminate. Now, this isn't every customer, as there are some out there for us that do require such a high load or even higher (as 2k isn't that high). But for the majority of them, if you get to this number and they haven't talked to us about their volume, then it is probably malicious in some way. You don't magically go from zero to 2,000 one day, unless it is an import job.&lt;/p&gt;

&lt;p&gt;Likewise, we can actually learn about a problem long before it gets to that scale. We have milestones, and we start reporting loads from clients at 100, 200, 500, 1,000, et cetera. If we see clients hitting these load milestones, we can already start to respond and create an incident for us to investigate before they reach a point where they're consuming all of the resources in our services for that customer. And we do this by adding alerts on the COUNT of requests for WAF metrics.&lt;/p&gt;

&lt;p&gt;However, we also get attacks at a smaller scale. Just because we aren't being DDoS-ed doesn't mean there isn't attack. And those requests will still get through because they don't meet our blocking limits. They could be malicious in nature, but only identifiable in aggregate. So while single request might seem fine, if you see the same request 10 times a second, 100 times a second, something is probably wrong. Or if you have request urls that end in &lt;code&gt;.php?admin&lt;/code&gt;, when no one has run WordPress in decades, you also know that there's a problem. We catch these by logging all of the blocked requests.&lt;/p&gt;

&lt;p&gt;We have automation in place to query those results and update our rules, but a picture is worth a thousand words:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ro8l9l1rn6gpkq4alz2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ro8l9l1rn6gpkq4alz2.png" alt="WAF COUNT metrics display"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here you can see a query based off of the IP addresses from the client that are being utilized and sorted by frequency. When we get these requests that look non-malicious individually, we execute a query such as this one and we check to see if the results match a pattern. You can use ip address matching or more intelligently, something called the JA3 or JA4 fingerprints of those requests There are actually lots of options available, I'm not going to get into exactly what they are, there are some &lt;a href="https://ramimac.me/waf-ddos" rel="noopener noreferrer"&gt;great articles on the topic&lt;/a&gt;. And there are more mechanisms to actually track these used throughout the security industry, and utilizing them let's you instantly identify: &lt;em&gt;Hey, you know what? This request violates one of our patterns, maybe we should block all the requests from that client.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;And so, rather than waiting for them to get to the point where an attacker is consuming 2,000 requests per second worth of resources, you can stop there right away. In the cases where we can't make a conclusive decision, this technology gives us another tool that we can utilize to improve our patterns for the future. Maybe it goes without saying, but of course because we've running our technology to many regions around the world, we have to work on deploying this infrastructure in all these places and push it out to the edge where possible.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F41hryyea3htsxu8m6frp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F41hryyea3htsxu8m6frp.png" alt="Authress AWS Regional and Global locations"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  🎁 The Conclusion​
&lt;/h2&gt;

&lt;p&gt;I said a lot of things, so I to quickly want to quickly summarize our architecture that we have in place:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Third-party component reliability reviews&lt;/strong&gt;. I can't stress this enough. Don't just assume that you can utilize something. And sometimes in order to achieve 5-nines, you actually have to remove components from your infrastructure. Some things are just not able to be utilized no matter what. Now maybe you can put it in some sort of async background, but it can't be on the critical path for your endpoints.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DNS failover and health checks.&lt;/strong&gt; For places where you have an individual region or availability zone or cluster, having a full backup with a way to conclusively determine what's up and automatically failover is critical.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edge compute where possible&lt;/strong&gt;. There's a whole network out there of services that are running on top of the cloud providers, which help guarantee your capability to run as close to as possible to where your users are and reduce latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incremental rollout&lt;/strong&gt; for when you want to reduce the impact as much as possible.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;Web Application Firewall&lt;/strong&gt; for handling those malicious requests.&lt;/li&gt;
&lt;li&gt;Having a &lt;strong&gt;Customer Support Focus&lt;/strong&gt; to enable escalating issues that outside your area of detection.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And through seven years or so that we've been doing this and building up this architecture, there's a couple of things that we've learned:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fychzurqaomgucefd9aiu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fychzurqaomgucefd9aiu.png" alt="Unsolvable Problems at scale"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Murphy's Law​
&lt;/h3&gt;

&lt;p&gt;Everything fails all the time. There absolutely will be failures everywhere. Every line of code, every component you pull in, every library, there's guaranteed to be a problem in each and everyone of those. And you will for sure have to deal with it, at some point. So being prepared to handle that situation, is something you have to be thinking through in your design.&lt;/p&gt;

&lt;h3&gt;
  
  
  DNS​
&lt;/h3&gt;

&lt;p&gt;DNS, yeah, AWS will say it, everyone out there will say, and now we get to say it. The global DNS architecture is pretty good and reliable for a lot of scenarios, but I worry that it's still a single point of failure in a lot of ways.&lt;/p&gt;

&lt;h3&gt;
  
  
  Infrastructure as Code (IAC)​
&lt;/h3&gt;

&lt;p&gt;The last thing is infrastructure as code challenges. We deploy primary regions, but then there's also the backup regions, which are slightly different from the primary regions, and then there are edge compute, which are, again, even more slightly different. And then sometimes, we do this ridiculous thing, where we deploy infrastructure dedicated to one customers. And in doing so, we're running some sort of IaC to deploy those resources.&lt;/p&gt;

&lt;p&gt;It is almost exactly the same architecture. Almost! Because it isn't exactly the same there are quite the opportunities for challenges to sneak it. That's problematic with even Open Tofu or CloudFormation, and often these tools make it more difficult, not less. And good luck to you, if you're still using some else that hasn't been modernized. With those, it's even easier to run into problems and not get it exactly correct.&lt;/p&gt;

&lt;p&gt;The last thing I want to leave you with is, well, &lt;strong&gt;With all of these, is that actually sufficient to achieve five nines?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No. Our commitment is 5-nines, what we do is in defense of that, just because you do all these things doesn't automatically mean your promise of 5-nines in guaranteed. And you know what, you too can promise a 5-nines SLA without doing anything. You'll likely break your promise, but for us our promise is important, and so this is our defense.&lt;/p&gt;




&lt;p&gt;For help understanding this article or how you can implement a solution like this one in your services, feel free to reach out to me and join my community:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://rhosys.ch/community" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Join the community&lt;/a&gt;
&lt;/p&gt;

</description>
      <category>aws</category>
      <category>reliability</category>
      <category>architecture</category>
      <category>serverless</category>
    </item>
    <item>
      <title>AWS Auth Caching Strategies</title>
      <dc:creator>Warren Parad</dc:creator>
      <pubDate>Tue, 17 Jun 2025 13:10:24 +0000</pubDate>
      <link>https://dev.to/aws-builders/aws-auth-caching-strategies-4121</link>
      <guid>https://dev.to/aws-builders/aws-auth-caching-strategies-4121</guid>
      <description>&lt;p&gt;Caching is difficult to get right and often means you need to pull in additional frameworks into your code. Fine tuning the balance between performance and data freshness takes time and experience. In case of User-Agent integrations (for example, an application UI running in your user’s browser), it is even more crucial, as the User-Agent is rarely under your control and yet demands fast response times. This is why often I opt to provide cache recommendations for the service side in many cases. One such example of this, is in the product I work heavily with—&lt;a href="https://authress.io" rel="noopener noreferrer"&gt;Authress&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;That doesn’t mean you can’t cache returned values for longer.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I'm going to use Authress as example for caching, so a quick summary might make sense. Authress provides login and access control for the applications you write. This means permissions checks. (And yes, because we are Swiss company focusing on the EU market is critical).&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;So, in the case that you’re making a lot of the same, low variability permission checks, for example, you may want to build a cache on top of Authress to limit your costs. It is not strictly necessary though. I'm going to walk through how AWS can be utilized to provide different caching opportunities when interacting with third party services.&lt;/p&gt;

&lt;h2&gt;
  
  
  General caching strategies
&lt;/h2&gt;

&lt;p&gt;In the context of Authorization, frequently the goal is to cache &lt;a href="https://authress.io/knowledge-base/docs/category/authorization" rel="noopener noreferrer"&gt;Authorization Requests&lt;/a&gt; as much as is useful. The following strategies will review the available possibilities. Let's assume that recommendations for cache times will always be returned in the Cache-Control header in the response from API Authorization User Permission Requests.&lt;/p&gt;

&lt;h2&gt;
  
  
  A. API Gateway
&lt;/h2&gt;

&lt;p&gt;If you run an API Gateway, there is an automatic caching strategy to support caching data for a short period of time. If data can be cached on a per request basis, then adding into the cache details about the user's permissions and authorization is an option. This is known as "Caching Authorization checks in API Gateway".&lt;/p&gt;

&lt;p&gt;Depending on your API Gateway, this can work better for serverless solutions compared to others. The API Gateway caching uses the Access Token as the default cache key, and that means you must add in to the cache key, the &lt;code&gt;Resource URI Path&lt;/code&gt; and the &lt;code&gt;Request HTTP Method&lt;/code&gt; to ensure a path specific authorization is cached.&lt;/p&gt;

&lt;p&gt;The most common and effective cache examples would include &lt;code&gt;A list of all the tenants&lt;/code&gt; or &lt;code&gt;customer accounts a user has access to&lt;/code&gt;. Since these list would change rarely, storing this information in the AWS API Gateway cache works well.&lt;/p&gt;

&lt;p&gt;Getting the list of tenants a user has access to in the API Gateway authorizor:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;AuthressClient&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@authress/sdk&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;authressClient&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AuthressClient&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;authressApiUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://auth.yourdomain.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;userResources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt;  
  &lt;span class="nx"&gt;authressClient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;userPermissions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getUserResources&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;`tenants/*`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;CollectionConfiguration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;TOP_LEVEL_ONLY&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Stringify is because API does not support arrays.&lt;/span&gt;
    &lt;span class="na"&gt;userResources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;userResources&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resources&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;,&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Danger!
&lt;/h3&gt;

&lt;p&gt;I'm going to repeat this: &lt;strong&gt;You must ensure that the cache key associated with the API includes the HTTP Method and the full resource URI.&lt;/strong&gt; If you are not sure what this means please consult with your API Gateway documentation. In API Gateway, update the &lt;code&gt;Identity Source&lt;/code&gt; to include both the HTTP Method and the Path, which are both sourced from the context.&lt;/p&gt;

&lt;p&gt;See &lt;a href="https://dev.to/aws-builders/api-gateway-vulnerabilities-by-design-be-careful-2094"&gt;API Gateway configuration vulnerabilities&lt;/a&gt; for more information.&lt;/p&gt;

&lt;h2&gt;
  
  
  B. Content Delivery Networks and Edge-based caching
&lt;/h2&gt;

&lt;p&gt;A CDN can often work to proxy all requests to a target provider. Instead of integrating directly with our API target of choice, you can proxy the requests through another solution that sits in front of your Auth provider. Some CDNs work well for this, others might not.&lt;/p&gt;

&lt;p&gt;In the case of AWS, the canonical solution would be using &lt;strong&gt;AWS CloudFront&lt;/strong&gt;. From the experience of my development team, using AWS CloudFront can be a bit finicky when putting CloudFront in front of other services that you don't own. Some of our users say that it has worked, others have run into limitations from CloudFront especially regarding cache times and configuration. Usually in these cases, you might need to use a Lambda@Edge function attached to your CloudFront to interact with the third party.&lt;/p&gt;

&lt;p&gt;Due to this, there might be limited value in the benefit from the caching that CloudFront could provide. A common corner case I've found is that sometimes you are thinking about doing this to help reduce costs. Costs incurred by calling that third party API. Costs of course are relevant at scale, however at that same scale, I tend to think about partial volume discounts so that rather than forcing the use and therefore additionally paying for the CDN in above and beyond the third party.&lt;/p&gt;

&lt;p&gt;Take for an example &lt;a href="https://authress.io" rel="noopener noreferrer"&gt;Authress&lt;/a&gt;, as a company we would much prefer to offer a discount than force you to have to build complexity. You would get the benefit directly from Authress Billing without having to write or maintain anything yourself or pay for a second technology on top (Price or Total Cost of Ownership). If you are investigating a caching solution to handle scale due primarily to costs, please contact your provider. If your provider won't offer alternatives to make your integration seamless, then that might not be a provider that makes sense to continue with. Rather than trying to wrap a bad solution, find a better one!&lt;/p&gt;

&lt;p&gt;Once a request is passed to Lambda@Edge, that would grant full capabilities to storing and retrieving data through different data stores, such as DynamoDB. But, the implementation details would be up to you.&lt;/p&gt;

&lt;h3&gt;
  
  
  Troubleshoot AWS CloudFront
&lt;/h3&gt;

&lt;p&gt;I do want to share a quick callout though. One possible error you might see is related to a &lt;a href="https://stackoverflow.com/questions/62811208/daisy-chained-cloudfront-with-host-header-forwarding" rel="noopener noreferrer"&gt;CloudFront stacking issue&lt;/a&gt;. Since Authress itself is using CloudFront, depending on your setup you might run into a stacking problem. At the current moment, if you are seeing this issue, there isn't a way for CloudFront to be used in your scenario, so we recommend switch to Lambda@Edge with CloudFront and interacting with Authress through there. This is explored further in the next sections.&lt;/p&gt;

&lt;h2&gt;
  
  
  C. Self-hosted internal proxy
&lt;/h2&gt;

&lt;p&gt;When you are at the point of wanting a proxy to cache authorization requests, a quick microservice service could be separated and created to proxy all the requests to your provider. This could be run as standalone service. The proxy would need to pass along requests Authress after interacting with your cache datastore.&lt;/p&gt;

&lt;p&gt;Hopefully the Third Party's SDKs support an a configurable target endpoint. Instead of setting it to be your &lt;a href="https://authress.io/knowledge-base/docs/introduction/getting-started-with-authress#custom-domains" rel="noopener noreferrer"&gt;Custom Domain&lt;/a&gt; such as &lt;a href="https://auth.yourdomain.com" rel="noopener noreferrer"&gt;https://auth.yourdomain.com&lt;/a&gt;, you would set the target endpoint to be your own microservice's URL.&lt;/p&gt;

&lt;p&gt;Proxy service for caching permissions requests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;AuthressClient&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@authress/sdk&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Switch this to be your cache's URL:&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;authressClient&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AuthressClient&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;authressApiUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://cache.yourdomain.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;userId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;User&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;resourceUri&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`resources/&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;resourceId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;permission&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;READ&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;authressClient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;userPermissions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;authorizeUser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;resourceUri&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;permission&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;code&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;UnauthorizedError&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;statusCode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;403&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For assistance with creating a proxy, I have to recommend reaching out to the provider with questions. Many products have secret fields and configurations in their SDKs, or in the case of our own SDKs we have increased security configuration in there, attempting to side-step the SDK to build a custom caching layer without the SDK will cause you to lose those optimizations.&lt;/p&gt;

&lt;p&gt;D. SDK configured caching&lt;br&gt;
Recently I've been investing further resources into improving built-in caching for our own SDKs, but in general each SDKs for each different language for each different provider has varying levels of support for caching.&lt;/p&gt;

&lt;p&gt;Caching in the SDK works well for longer lived containers. For sustained requests to your API, even with a serverless solution, your function will have this data cached for the lifetime of the container. This works great for balanced predictable usage. This is less valuable for bursts. For non-serverless solutions when utilizing the caching if it is provided by the SDK, in your language, it can work out of the box.&lt;/p&gt;

&lt;p&gt;Some SDKs support caching and caching configuration and others do not. The reason for this is contingent on the tools available in the language as well as libraries supporting &lt;a href="https://en.wikipedia.org/wiki/Memoization" rel="noopener noreferrer"&gt;memoization&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  In-memory caching
&lt;/h2&gt;

&lt;p&gt;Depending on the sort of caching you are looking for or how your requests look, in memory can often provide the best impact. This would give you full control over how caching is done. So there are a bunch of options available, and which levers you want to pull is going to be based on your core needs.&lt;/p&gt;

&lt;p&gt;Long term, if the SDK you are using doesn't support the caching configuration you need and you have a solution you have been using effectively, please let us (or your provider) know and hopefully they'll opt for converting your In-memory caching configuration into a first-class option in the SDK for that language. (Note: Company Value of Customer-Obsession may be required for this last part to work)&lt;/p&gt;

&lt;p&gt;This example of how a cache could work:&lt;/p&gt;

&lt;p&gt;In-memory cache wrapper for javascript:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;AuthressClient&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@authress/sdk&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;authressClient&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AuthressClient&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;authressApiUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://auth.yourdomain.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// create a cache that stores the results for 10 seconds&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Cache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;userId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;User&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;resourceUri&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`resources/&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;resourceId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;permission&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;READ&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;hasAccess&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;resourceUri&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;permission&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// No value is cached&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hasAccess&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;authressClient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;userPermissions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;authorizeUser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;resourceUri&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;permission&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;storeValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;resourceUri&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;permission&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;hasAccess&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;code&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;UnauthorizedError&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;storeValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;resourceUri&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;permission&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nx"&gt;hasAccess&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;hasAccess&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;statusCode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;403&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Shared internal cache
&lt;/h2&gt;

&lt;p&gt;One strategy that works well with multiple services when not using serverless or even sometimes when using serverless, is using a server that optimizes providing fast-lookup caches. That is, if you have multiple services that all need to interact with the same third party in the same way, and access to that third party isn't necessarily well-secured, or all your services use similar credentials for accessing that third party, you might benefit from a shared cache.&lt;/p&gt;

&lt;p&gt;Back to the authorization example, after an SDK returns a success for an authorization request, you could store the result in cache-optimized solution. A recommendation for this strategy would be to use Valkey. Most cloud providers either support a Valkey solution or support deploying the open source container to your infrastructure, and AWS is no exception:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/elasticache/what-is-valkey/" rel="noopener noreferrer"&gt;AWS ValKey&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/elasticache/redis/" rel="noopener noreferrer"&gt;AWS ElastiCache&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Further Caching Support
&lt;/h2&gt;

&lt;p&gt;Have some ideas that aren't listed here, and think I should extend this list? Please let me know so I can extend the recommended caching strategies in this article.&lt;/p&gt;

&lt;p&gt;For help understanding this article or how you can implement a solution like this one in your services, feel free to reach out to me and join my community:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://rhosys.ch/community" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Join the community&lt;/a&gt;
&lt;/p&gt;

</description>
      <category>security</category>
      <category>aws</category>
      <category>cloud</category>
      <category>architecture</category>
    </item>
    <item>
      <title>API Gateway Authorizers: Vulnerable By Design (be careful!)</title>
      <dc:creator>Warren Parad</dc:creator>
      <pubDate>Fri, 23 May 2025 08:51:49 +0000</pubDate>
      <link>https://dev.to/aws-builders/api-gateway-vulnerabilities-by-design-be-careful-2094</link>
      <guid>https://dev.to/aws-builders/api-gateway-vulnerabilities-by-design-be-careful-2094</guid>
      <description>&lt;p&gt;I had the benefit of joining the &lt;a href="https://www.awsug.ch/" rel="noopener noreferrer"&gt;AWS Community Day in Zürich&lt;/a&gt; this week, most went as expected but, then an interesting question came up....&lt;code&gt;Does caching in API Gateway create vulnerabilities for products using Authorizer Caching?&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Authorization
&lt;/h2&gt;

&lt;p&gt;When your users call your API, you have an obvious need to verify these requests should actually be allowed. I've talked extensively about this in my academy article on &lt;a href="https://authress.io/knowledge-base/academy/topics/implementating-user-login" rel="noopener noreferrer"&gt;what the @#!? is Auth&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Even if you haven't read that article, if you are well versed in the need for users to authenticate and authorize to your specific service API and endpoints, then you get the gist.&lt;/p&gt;

&lt;p&gt;So you have a need to verify the access tokens sent by users on ever request. When using AWS this means using API Gateway, and when using API Gateway that likely means you'll be using an API Gateway Authorizer.&lt;/p&gt;

&lt;p&gt;Authorizers in API exist so that you can verify more easily verify the user access tokens. As a reminder an authorization token looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"identityProviderId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://authress.io"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"userId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"TechInternals|test-user-001"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"expires"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1761483600&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"signatureKeyId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"example-key"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"signature"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"SflKxwRJSMeKKF2Qt4fwpMe"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the process to verify the token looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;authressClient&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AuthressClient&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;authressApiUrl&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;userIdentity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;authressClient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;verifyToken&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userToken&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Of course swapping in your favorite open source JWT verifier. &lt;a href="https://authress.io/knowledge-base/docs/authentication/validating-jwts" rel="noopener noreferrer"&gt;More extensive details on this depending on your identity provider are available&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now I know what you are thinking&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I'm going to get a lot of requests from the same user to my same API, for different resources. That means they are all going to have the same JWT. Wouldn't it be great to cache those results so that I don't need to verify the same JWT over and over again every time this same user makes a similar request for similar data with same JWT.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And you would be right!&lt;/p&gt;

&lt;h2&gt;
  
  
  Caching
&lt;/h2&gt;

&lt;p&gt;However, if you wrote the above code and you cache it, you might start to see a problem with it...&lt;/p&gt;

&lt;p&gt;Caching by default in API gateway is keyed from the authorization token only and nothing else. This means that the result from one request will interfere with the next one.&lt;/p&gt;

&lt;p&gt;Let's take for example the policy result from an AWS API Gateway Authorizer. It might see something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;policy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;principalId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;userIdentity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;policyDocument&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;Version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;2012-10-17&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;Statement&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
            &lt;span class="na"&gt;Effect&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Allow&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;Action&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;execute-api:Invoke&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;Resource&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;methodArn&lt;/span&gt;
        &lt;span class="p"&gt;}]&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;principalId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;userIdentity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;sub&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is actually a problem with this however. The cache key by default is only the JWT, but the result of this policy says that the user is only allowed to one particular &lt;code&gt;event.methodArn&lt;/code&gt;. A method ARN as a reminder is like &lt;code&gt;GET /orders/order_id_123&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That means on a followup request with the same JWT to a different endpoint &lt;code&gt;GET /orders/order_id_456&lt;/code&gt;, even if the user should have access to that resource and their JWT is still valid, API Gateway will deny that request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Well that is simple, because the result is cached based only on the JWT. The cached result specifies only that one route &lt;code&gt;GET /orders/order_id_123&lt;/code&gt; has been authorized.&lt;/p&gt;

&lt;p&gt;Worst case scenario, you have a short cache time, and the only thing that happens is a short but confusing user experience, that quickly results in the correct behavior.&lt;/p&gt;

&lt;p&gt;But you are smart, you realize there is a fix, instead of passing the &lt;code&gt;event.methodArn&lt;/code&gt; as the result policy you specify &lt;code&gt;['arn:aws:execute-api:*:*:*']&lt;/code&gt; as the resource result.&lt;/p&gt;

&lt;p&gt;Now subsequent requests as long as the JWT is still valid, irrespective of the endpoint, will allow the user through!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🎉🎉🎉&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And this works great.&lt;/p&gt;

&lt;p&gt;But you are thinking why stop there. Can we go further?&lt;/p&gt;

&lt;p&gt;And the answer is also yes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Authorization of Granular Resources Based Access Control
&lt;/h2&gt;

&lt;p&gt;You might be using solutions such as &lt;a href="https://aws.amazon.com/verified-permissions/" rel="noopener noreferrer"&gt;AWS Verified Permissions&lt;/a&gt; hoping to connect it together with Cognito and API Gateway.&lt;/p&gt;

&lt;p&gt;Now I know what you are thinking, why is Warren investigating verified permissions when &lt;a href="https://authress.io" rel="noopener noreferrer"&gt;Authress&lt;/a&gt; already solves all these problems? Well sometimes even I have to write an article about how the integration of default resources in AWS can cause security misconfigurations.&lt;/p&gt;

&lt;p&gt;Your decision is—Not just cache the validity of the JWT, but you also want to cache whether or not the user actually has access to call the endpoint in question. That is, you decide to take the additional step of verifying the user's authorization and you also cache it, then you will have just created a majority security vulnerability in your application.&lt;/p&gt;

&lt;p&gt;Do you already see what the problem might be?&lt;/p&gt;

&lt;p&gt;In your authorizer you are likely to write:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;hasAccess&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;authress&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;userPermissions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;authorizeUser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
   &lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;resource&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;resource:read&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you are checking the user's access inside the authorizer and it is cached, then subsequent requests to the same API will utilize the cached result.&lt;/p&gt;

&lt;p&gt;If the user has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Access to &lt;code&gt;orders_123&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;No Access to &lt;code&gt;orders_456&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And then calls&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;GET &lt;code&gt;orders_123&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;GET &lt;code&gt;orders_456&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;They will incorrectly be allowed to access that second order.&lt;/p&gt;

&lt;p&gt;That's because the authorizer will have access ALLOW for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;hasAccess&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;authress&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;userPermissions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;authorizeUser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
   &lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;orders_123&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;orders:read&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;ALLOW&lt;/code&gt; is set as the cache result for the user's JWT:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;JWT_001 =&amp;gt; ALLOW
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cache doesn't contain the orderId in it. Or said differently the cache is &lt;strong&gt;NOT&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[JWT_001, GET, orders_123] =&amp;gt; ALLOW
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That means when the second request comes in, we got to the cache table, see the cache already exists for &lt;code&gt;JWT_001&lt;/code&gt;, return &lt;code&gt;ALLOW&lt;/code&gt;, and never actually check the authorization for &lt;code&gt;orders_456&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Removing the security vulnerability
&lt;/h2&gt;

&lt;p&gt;It would be nice of API Gateway to be secure by default and require the &lt;code&gt;identity source&lt;/code&gt; cache key to include the resource path and method. But it isn't, so it doesn't. And this risk is similar to ones experienced by engineers all day long with caching in CloudFront. And if we think about the frequency of issues with caching in CloudFront which has no security vulnerability, we can realize that—since AWS created the Verified Permissions service and related functionality, this opened a huge security vulnerability potential configuration in API Gateway.&lt;/p&gt;

&lt;p&gt;This isn't an explicit vulnerability in the service though, since the vulnerability only exists based on improper configuration, but here the improper configuration is the default. Show me a company using API Gateway and AWS Verified Permissions, and I bet I can show you a Security Bounty waiting to be collected.&lt;/p&gt;

&lt;p&gt;The resolution here is to force the API Gateway Authorizer to cache also on the &lt;code&gt;httpMethod (Context)&lt;/code&gt; and &lt;code&gt;path (Context)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj6phmsn7yjnu5657fxa2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj6phmsn7yjnu5657fxa2.png" alt="API Gateway Authorizer expected configuration" width="800" height="1265"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once that is done, now API Gateway will close this security hole because the cache key will match the authorization check performed by your authorization provider.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recommendations
&lt;/h2&gt;

&lt;p&gt;On your side there is little you can do to remove the pit of failure. Review documentation, invest in deep understanding of  the tools you used especially when security is involved. I guess also keep reading my posts as I often try to focus on security related topics.&lt;/p&gt;

&lt;p&gt;On the AWS side, there is absolutely a strategy that would have fixed this by design. The authorizer should not have access to the Path and Method properties of the HTTP request unless the identity source cache key includes them. This would require breaking existing configurations, but it would be in the name of security by default.&lt;/p&gt;

&lt;h2&gt;
  
  
  Going further
&lt;/h2&gt;

&lt;p&gt;There are actually lots of different ways to cache permissions results in AWS when not even using Verified Permissions and for an extensive list of the options and my personal recommendations check out this &lt;a href="https://authress.io/knowledge-base/docs/advanced/caching" rel="noopener noreferrer"&gt;Auth Academy article&lt;/a&gt; on the topic.&lt;/p&gt;




&lt;p&gt;Come join my &lt;a href="https://authress.io/community/" rel="noopener noreferrer"&gt;Community&lt;/a&gt; and discuss this and other security related topics!&lt;/p&gt;

</description>
      <category>aws</category>
      <category>security</category>
      <category>api</category>
      <category>authentication</category>
    </item>
    <item>
      <title>The Blog Post Release Automation</title>
      <dc:creator>Warren Parad</dc:creator>
      <pubDate>Mon, 19 May 2025 13:46:12 +0000</pubDate>
      <link>https://dev.to/aws-builders/the-blog-post-release-automation-3kbd</link>
      <guid>https://dev.to/aws-builders/the-blog-post-release-automation-3kbd</guid>
      <description>&lt;h2&gt;
  
  
  The Blog Post Release Automation
&lt;/h2&gt;

&lt;p&gt;I made this mistake this week of believing I wanted to automate, using an LLM of course, some parts of the painful podcast release cycle.&lt;/p&gt;

&lt;p&gt;Weekly I record episodes of the podcast &lt;a href="https://adventuresindevops.com" rel="noopener noreferrer"&gt;Adventures in DevOps&lt;/a&gt; with my awesome co-host. Of course all the episodes are available on our podcast website as well other streaming platforms.&lt;/p&gt;

&lt;p&gt;But! Since we're a technical podcast, we decided to make our infrastructure open source (On &lt;a href="https://github.com/AdventuresInDevops/Website" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; unfortunately), but to go further it also uses &lt;a href="https://github.com/AdventuresInDevops/Website/blob/main/.github/workflows/build.yml" rel="noopener noreferrer"&gt;GitHub Actions&lt;/a&gt; to publish the &lt;a href="https://github.com/AdventuresInDevops/Website/blob/main/.github/workflows/build.yml" rel="noopener noreferrer"&gt;episodes to our website&lt;/a&gt;. There is of course the nasty bit of actually recording the episodes, editing the episodes, and then downloading and formatting them to make them nice.&lt;/p&gt;

&lt;p&gt;After that is all done though, it is time to create the episode page and most importantly the cornerstone of ever podcast, &lt;strong&gt;an awesome episode image&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So let's get down to it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Execution
&lt;/h2&gt;

&lt;p&gt;Interestingly enough, the Nova Lite model failed completely attempting to request it to actually build the command I needed to execute the model itself. Not very self-aware, you might say.&lt;/p&gt;

&lt;p&gt;However using other models I was able to coax out the following recommendation:&lt;/p&gt;

&lt;p&gt;With the episode saved in the transcript.txt file, and the instructions we want to run in the instructions.txt&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="cp"&gt;#!/usr/bin/env node
&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;BedrockRuntimeClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;InvokeModelCommand&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@aws-sdk/client-bedrock-runtime&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;fs/promises&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;path&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;fileURLToPath&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;url&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Resolve file paths&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;__dirname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dirname&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;fileURLToPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;import&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;instructionsPath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;__dirname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;instructions.txt&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;transcriptPath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;__dirname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;transcript.txt&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Set up Bedrock client&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;BedrockRuntimeClient&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;eu-west-1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Read both input files&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;instructions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;transcript&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
      &lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;readFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;instructionsPath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;utf-8&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;readFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;transcriptPath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;utf-8&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;]);&lt;/span&gt;

    &lt;span class="c1"&gt;// Build prompt&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`Instructions:\n&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;instructions&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;\n\nTranscript:\n---\n&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;transcript&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;\n---`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
        &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nx"&gt;content&lt;/span&gt;
      &lt;span class="p"&gt;}],&lt;/span&gt;
      &lt;span class="c1"&gt;// Max Token Count and other parameters: https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-titan-text.html&lt;/span&gt;
      &lt;span class="na"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;top_p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4096&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;

    &lt;span class="c1"&gt;// Invoke the model&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;command&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;InvokeModelCommand&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;modelId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;amazon.nova-lite-v1:0&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;contentType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;application/json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;accept&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;application/json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;command&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Decode and print response&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;responseBody&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transformToString&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;✅ Model response:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;responseBody&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;❌ Failed to invoke model:&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;})();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And that's it, we can take the output create a pull request and then release the episode.&lt;/p&gt;

&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;p&gt;Of course nothing works the first time, and for us the first issue is&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Failed to invoke model: ValidationException: Invocation of model ID amazon.nova-lite-v1:0 with on-demand throughput isn’t supported. Retry your request with the ID or ARN of an inference profile that contains this mode.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Okay turns out there is some magic that it takes to run the Nova model in other regions, so instead of trying to get that to work, we'll switch to the region &lt;code&gt;us-east-1&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Malformed input request: #: required key [messages] not found, please reformat your input and try again.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hmmm, weird, it turns out there have been some serious changes to the API, in which the documentation is not really up to date. So figuring out the correct parameters is actually a bit of a problem.&lt;/p&gt;

&lt;p&gt;But setting the payload as just:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
        &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
          &lt;span class="c1"&gt;// type: "text",&lt;/span&gt;
          &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;inputText&lt;/span&gt;
        &lt;span class="p"&gt;}]&lt;/span&gt;
      &lt;span class="p"&gt;}]&lt;/span&gt;
      &lt;span class="c1"&gt;// Max Token Count and other parameters: https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-titan-text.html&lt;/span&gt;
      &lt;span class="c1"&gt;// temperature: 0.7&lt;/span&gt;
      &lt;span class="c1"&gt;// top_p: 0.9,&lt;/span&gt;
      &lt;span class="c1"&gt;// max_tokens: 4096&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Solves most of this problem.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzz9awqlw574tpln2ag2y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzz9awqlw574tpln2ag2y.png" alt="Nova Blocks itself" width="800" height="554"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Although one problem we keep running into is that the Nova Modals "Content" filter keeps blocking itself. Even sending the very innocuous "hey" to the model to generate three images, fails after the first one.&lt;/p&gt;

&lt;p&gt;Success!?&lt;/p&gt;

&lt;h2&gt;
  
  
  The podcast image
&lt;/h2&gt;

&lt;p&gt;The next step is run the generator a second time, but this time use the output from the first step as the input to generate an image relevant to the podcast.&lt;/p&gt;

&lt;p&gt;There are a couple of changes that have to be made.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;We don't need the transcript anymore since we already have a summary.&lt;/li&gt;
&lt;li&gt;We need to pass an input image, we don't want some random picture we want something that is brand aware.&lt;/li&gt;
&lt;li&gt;The output will be an image as well.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So instead we'll use the Nova Canvas model: &lt;code&gt;amazon.nova-canvas-v1:0&lt;/code&gt; with the parameters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;referenceImage1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;readFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;referenceImagePath1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;referenceImage2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;readFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;referenceImagePath2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;inputText&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
              &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;jpeg&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                  &lt;span class="na"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;referenceImage1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;base64&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
              &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
              &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;png&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                  &lt;span class="na"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;referenceImage2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;base64&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
              &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
          &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And we can write out the results using the response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;imageData&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;responseBody&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;writeFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;__dirname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;`image.png`&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nx"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;imageData&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;base64&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The result
&lt;/h2&gt;

&lt;p&gt;We'll I think pictures are way more expressive than words, so check out the latest episode here on &lt;a href="https://adventuresindevops.com/episodes" rel="noopener noreferrer"&gt;Adventures in DevOps&lt;/a&gt; to see how exactly well we did!&lt;/p&gt;

&lt;h2&gt;
  
  
  Our Verdict
&lt;/h2&gt;

&lt;p&gt;Nova is not ready for prime time. For now, we are going to try out some of the other models offered through Bedrock and focus on getting more high quality content. Quality and reliability are crucial here as we aim to cut down on time to create the episode releases.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>ai</category>
      <category>aws</category>
      <category>bedrock</category>
    </item>
    <item>
      <title>[Boost]</title>
      <dc:creator>Warren Parad</dc:creator>
      <pubDate>Fri, 24 Jan 2025 17:59:44 +0000</pubDate>
      <link>https://dev.to/wparad/-44dn</link>
      <guid>https://dev.to/wparad/-44dn</guid>
      <description>&lt;div class="ltag__link"&gt;
  &lt;a href="/authress" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__org__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F2625%2F18c2bb45-3a91-4fc8-86a0-3006f2b6b93a.png" alt="Authress Engineering Blog" width="512" height="512"&gt;
      &lt;div class="ltag__link__user__pic"&gt;
        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F86409%2Fad0e5c54-e76f-4fd9-864e-f04b266ab62f.jpg" alt="" width="800" height="800"&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="https://dev.to/authress/the-risks-of-user-impersonation-58nf" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;The Risks of User Impersonation&lt;/h2&gt;
      &lt;h3&gt;Warren Parad for Authress Engineering Blog ・ Jan 24&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#authentication&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#authorization&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#identity&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#security&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


</description>
      <category>webdev</category>
      <category>security</category>
      <category>api</category>
    </item>
    <item>
      <title>The Risks of User Impersonation</title>
      <dc:creator>Warren Parad</dc:creator>
      <pubDate>Fri, 24 Jan 2025 17:58:49 +0000</pubDate>
      <link>https://dev.to/authress/the-risks-of-user-impersonation-58nf</link>
      <guid>https://dev.to/authress/the-risks-of-user-impersonation-58nf</guid>
      <description>&lt;h2&gt;
  
  
  What is user impersonation?
&lt;/h2&gt;

&lt;p&gt;User impersonation is anything that allows your systems to believe the current logged in user is someone else. With regards to JWTs and access tokens, this means that one user obtains a JWT that contains another user's &lt;code&gt;User ID&lt;/code&gt;. User impersonation or logging in as a customer can be used as a tool to help identify many issues from user authentication and onboarding to corrupted data in complex multi-service business logic flows.&lt;/p&gt;

&lt;p&gt;However, at first glance it should is obvious that there are major security implications with such an approach. Even if it isn't, this article will extensively review user impersonation and the security implications as well as offer alternative suggestions to achieve a similar outcome in a software system without compromising security.&lt;/p&gt;

&lt;h2&gt;
  
  
  The impersonation use cases
&lt;/h2&gt;

&lt;p&gt;No solution is relevant in a vacuum, so let's consider the concrete issues that you might actually have, and the reason you've arrived at this &lt;a href="https://authress.io/knowledge-base/academy/topics" rel="noopener noreferrer"&gt;Authress Academy&lt;/a&gt; article. If we were to jump straight into a solution, then we'll definitely end up sacrificing security or worse, our user's sensitive data in favor of suboptimal solutions.&lt;/p&gt;

&lt;p&gt;Possible use case user stories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One of your users reports that they are experiencing an issue with a screen in your application portal not showing the correct information. As a support engineer, you want to review the exact display in the application UI that your user sees, so that you can verify the UI is indeed broken and something is actually going wrong.&lt;/li&gt;
&lt;li&gt;Similar to above, can you know whether or not the display having an issue is a result of a problem with the UI itself or with the data that application UI is fetching, hence a service API issue.&lt;/li&gt;
&lt;li&gt;Sometimes it is a problem with a complex API server flow. A click in your application portal was expected to perform a data change, transformation, or API request to your backend services, but is may not have been sent with the appropriate data. As an product engineer, you would like to know that the correct request data is being sent in the request to my service API.&lt;/li&gt;
&lt;li&gt;As an system admin, multiple third party systems are interacting with each other and something™ isn't working, and because you are a great collaborator, even though it isn't your responsibility, you want to help out your customers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now, this list isn't exhaustive, but already you can start see that while focusing on the concrete problems, user impersonation might be useful, but these don't actually require it to debug. The root causes often fall into at least one of these categories:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;This is a UI component display issue.&lt;/li&gt;
&lt;li&gt;An unexpected request is being sent or isn't sent to your service API from your application portal.&lt;/li&gt;
&lt;li&gt;The wrong data is being sent in the request from your application UI to your API.&lt;/li&gt;
&lt;li&gt;It is a &lt;code&gt;READ&lt;/code&gt; permissions data issue for the user.&lt;/li&gt;
&lt;li&gt;It is a &lt;code&gt;WRITE&lt;/code&gt; permissions data issue for the user.&lt;/li&gt;
&lt;li&gt;In is multi-system problem and not an access issue, and having a duplicated environment that exactly matches the current production was your goal to continue debugging.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Note: out of these solutions, none of them even get close to needing user impersonation, they each have straightforward alternatives that are both secure and frequently simpler to implement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Supported libraries
&lt;/h2&gt;

&lt;p&gt;Fundamentally, &lt;strong&gt;user impersonation&lt;/strong&gt; is insecure by design, we'll see why in a moment. There are much better ways to provide insight into your specific scenario that actually take security into account. But let's assume that we do implement user personation. Is there help available for us by utilizing support from our favorite overengineered solution?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/ankane/pretender" rel="noopener noreferrer"&gt;Ruby - Rails pretender&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/django-hijack/django-hijack" rel="noopener noreferrer"&gt;Python - Django hijack&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.npmjs.com/package/express-user-impersonation" rel="noopener noreferrer"&gt;Nodejs - Express/Passport impersonate&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Insert your favorite monolithic HTTP Framework here&lt;/strong&gt; ➤ Deprecated Solution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What's interesting is that in doing the research to actually find existing implementations, 86% of the repos and links I found:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No longer exist, and haven't existed for quite some time&lt;/li&gt;
&lt;li&gt;Were archived over 5 years ago&lt;/li&gt;
&lt;li&gt;Have less than 10 stars on GitHub&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even if people are trying to make this happen, the tools don't even exist to ensure that we are doing it correctly and safely. The results of this search, tell us something. Even more surprisingly is that most of the Auth SaaS solutions don't offer this either. As it turns out, either no one really cares that much or it is next to impossible to get it right such that no solution can exist. Well that can't be right.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dangers of user impersonation
&lt;/h2&gt;

&lt;p&gt;Let's assume for a moment that the collective wisdom is correct, and no solutions exist because it is dangerous. What exactly are those dangers? To help convey these issues, say that we managed to get one of these legacy packages above actually working with our system, the first problem that we'll run into is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Who actually has access to perform this User Impersonation in the first place? Who are our admins?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  1. Defining the admins
&lt;/h3&gt;

&lt;p&gt;Of course allowing everyone to impersonate one another basically means our authentication provides no value. We might as well let users enter whatever username they like on every post they make. Realistically, we want to restrict this list to those that it actually makes sense to have the ultimate &lt;code&gt;su&lt;/code&gt; privilege.&lt;/p&gt;

&lt;p&gt;Figuring out who the admins should be and maintaining access to that closely guarded endpoint that grants user impersonation is a common problem that even eludes the most sophisticated companies. The most notorious example of getting this wrong were the &lt;a href="https://en.wikipedia.org/wiki/2020_Twitter_account_hijacking" rel="noopener noreferrer"&gt;Twitter 2020 admin tools hack&lt;/a&gt; and the &lt;a href="https://msrc.microsoft.com/blog/2023/07/microsoft-mitigates-china-based-threat-actor-storm-0558-targeting-of-customer-email/" rel="noopener noreferrer"&gt;Microsoft Storm-0558&lt;/a&gt; breaches. Attackers were able to compromise admin-level account tools, and use them to steal and impersonate actual users. Historically, one of these companies had paid significant attention to their own internal security, were, if not the first, one of the first to introduce the notion of public social logins, and were no stranger to the issues at hand, and the other was Microsoft.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Challenge 1: Maintaining both the admin list, and correctly securing the endpoint to allow impersonation in the first place.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The implementation
&lt;/h3&gt;

&lt;p&gt;The next issue regarding impersonation becomes transparent when we start to question how it can even work in practice. &lt;em&gt;In theory, practice is the same as theory, in practice it is not.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Once admin is authorized to impersonate a user, what exactly is happening in our platform? Let's flash back to &lt;a href="https://authress.io/knowledge-base/academy/topics/implementating-user-login" rel="noopener noreferrer"&gt;Authentication&lt;/a&gt;. In order to secure your system, to ensure the right users have access to the right data at the right time, your users must use a session cookie or session token sent on every request for which your API can verify that user is logged in. This could be a completely opaque GUID that represents some data in your database (a reference token) or a more secure JWT that is stateless. In any case, your system identifies users via your &lt;a href="https://authress.io/knowledge-base/academy/topics/implementating-user-login" rel="noopener noreferrer"&gt;Authentication Strategy&lt;/a&gt;, and at the end of the day identification comes down to a single property in a single object somewhere. An example could be the JWT &lt;code&gt;subject claim&lt;/code&gt; property:&lt;/p&gt;

&lt;p&gt;User user_001 JWT access token:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;iss&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://login.authress.io&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

        &lt;span class="c1"&gt;// highlight&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;sub&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user_001&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="c1"&gt;// highlight&lt;/span&gt;

        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;iat&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1685021390&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;exp&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1685107790&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;scope&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;openid profile email&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In OAuth/OpenID, the &lt;code&gt;sub&lt;/code&gt; claim in a JWT represent the &lt;strong&gt;User ID&lt;/strong&gt;. Thus this particular token represents a verified user with the identity &lt;code&gt;user_001&lt;/code&gt;. Anyone that holds this token is now has access to impersonate this user. Hopefully, you have some logging in place to identify when a user is being impersonated and who actually started the impersonation process. But how do we actually impersonate this user?&lt;/p&gt;

&lt;p&gt;Well of course, I need to convert a token that represents my admin user into a token that represents the user I want to impersonate. This would be an example of the token that I have right now.&lt;/p&gt;

&lt;p&gt;My admin user token:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;iss&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://login.authress.io&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

        &lt;span class="c1"&gt;// highlight&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;sub&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;me_admin&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="c1"&gt;// highlight&lt;/span&gt;

        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;iat&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1685021390&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;exp&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1685107790&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;scope&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;openid profile email&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since our system, in this scenario uses the &lt;code&gt;sub&lt;/code&gt; property to determine which user is accessing the system, I of course need a token that replaces the current value of the &lt;code&gt;sub&lt;/code&gt; which is &lt;code&gt;me_admin&lt;/code&gt; for me, to one that contains the &lt;code&gt;sub&lt;/code&gt; of &lt;code&gt;user_001&lt;/code&gt;. So when I impersonate the user, the result &lt;strong&gt;must be a token&lt;/strong&gt; that looks exactly like the user token:&lt;/p&gt;

&lt;p&gt;User token generated by the admin:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;iss&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://login.authress.io&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

        &lt;span class="c1"&gt;// highlight&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;sub&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user_001&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="c1"&gt;// highlight&lt;/span&gt;

        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;iat&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1685021390&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;exp&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1685107790&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;scope&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;openid profile email&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Some of http/auth frameworks have thought a whole two seconds longer than the rest and might have decided to add an additional property to indicate that the token was created through the process of impersonation by an admin instead of directly by the user:&lt;/p&gt;

&lt;p&gt;User token generated by the admin with magic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;iss&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://login.authress.io&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;sub&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user_001&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

        &lt;span class="c1"&gt;// highlight&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;generated_by&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;me_admin&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="c1"&gt;// highlight&lt;/span&gt;

        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;iat&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1685021390&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;exp&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1685107790&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;scope&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;openid profile email&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And this might even seem like a good idea, however, in practice it creates a &lt;a href="https://authress.io/knowledge-base/articles/2025/01/03/bliss-security-framework" rel="noopener noreferrer"&gt;Pit of Failure&lt;/a&gt;. Enabling admin to create new tokens that contain the original user causes two distinct problems.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;The first issue is that one admin user can impersonate another admin user. And that second admin user might be one that potentially has more access and is authorized for more sensitive information. This means that it isn't so straightforward to just add in impersonation and assume that everything will just work out. Our &lt;strong&gt;List of Admin&lt;/strong&gt;, no longer can just be a list of admin, it now must also contain some hierarchal order of who can impersonate whom. If you've been following along this looks a lot like what &lt;a href="https://authress.io/knowledge-base/docs/category/authorization" rel="noopener noreferrer"&gt;Authress Authorization&lt;/a&gt; provides. Of course you don't absolutely have to have that, but if you don't then you've sacrificed some security.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The second issue is that not every application you have might be interested in allowing users to be impersonated. In any mature system, and even most early software ventures, have some data that you are even less interested in exposing than rest. Sensitive by nature or Regulated data fits this picture. This could be Personal Identifiable Information (PII), Credit Cards (PCI-DSS), or really anything that has been regulated in your locality as a result of governing bodies. You might breach this through user impersonation if for instance your support engineer is in different &lt;a href="https://authress.io/knowledge-base/docs/authentication/user-authentication/selecting-data-residencies" rel="noopener noreferrer"&gt;Data Residency&lt;/a&gt; than the user is in. For example, when attempting to debug issues in a UI, almost never is the &lt;strong&gt;Date Of Birth (DOB)&lt;/strong&gt; of the user absolutely necessary to be shown on the screen. Sure it is relevant in some user use cases, but in most debugging scenarios it is not.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;If your authentication depends on the property &lt;code&gt;sub&lt;/code&gt; in the JWT, then an application cannot opt out of user impersonation. Since you are changing the &lt;code&gt;sub&lt;/code&gt; to be the impersonated user, every application will see the new &lt;code&gt;sub&lt;/code&gt; value, even if they do not want to support user impersonation. &lt;strong&gt;Strike 1.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;All applications are forced opted-in. If an application wants to opt-out then the second claim &lt;code&gt;generated_by&lt;/code&gt; or it's respective implementation is required. But then still, all applications are opted-in. That means when you design a new application you have to know that you might want to opt out admin from accessing user data in this application, "data is insecure by default, unless explicitly designed otherwise". This is the pit of failure, a pit of success would be opt-in, Data is secured by default, unless otherwise excluded. &lt;strong&gt;Strike 2.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;A quick call-out is worthwhile on how to secure data like a user's DOB. UIs don't need to know this information in most cases. The screens and activities where DOB is valuable, actually care that the user &lt;code&gt;isBornInJanuary&lt;/code&gt; or &lt;code&gt;isOlderThan18&lt;/code&gt;, and not the actual date of birth of the user. Unless of course this is the users DOB selection, in which case this component rarely needs to be validated by a support engineer, and if you believe that user impersonation is necessary to help validate the user DOB entry screen, this article isn't going to be of any help for you.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  3. Secondary system data leakage
&lt;/h3&gt;

&lt;p&gt;Not only do we need to worry about vulnerabilities in our primary user applications, as well as leaking the data associated with them. Now we also need to worry about protecting these secondary systems used to impersonate users AND leaking the data associated with them as well. Internal systems, by their very design usually end having worse security measures in place because fewer people use them. Fewer users and lower volume means more hacks and less attention given to such an app. In practice, these applications are rarely changed, but frequently break, and most importantly have low priority when it comes to innovation and implementing necessary improvements. They don't end up in your OKR Objectives for this quarter and no one is getting promoted over them.&lt;/p&gt;

&lt;p&gt;We are so concerned that someone is abusing these tools that we ourselves leak user access tokens and data to logging systems. We log so zealously to ensure we have captured the usage of these tools, that we end up logging that which we should not. When we log that means we've probably also exported these logs to some third party reporting tools. It is a Catch-22, we know we need to log and report on actions taken as an admin when impersonating a user that log data that we would not normally be logging. The goal to prevent security issues creates a new attack surface.&lt;/p&gt;

&lt;p&gt;The result is that these systems will likely end up logging usage of user tokens. That's an introduction of a new attack surface, and due to the issues in priority with fixing, these systems are actually &lt;strong&gt;twice as likely to leak user data&lt;/strong&gt; compared to our primary user applications.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Corrupted audit trails
&lt;/h3&gt;

&lt;p&gt;Frequently we can a priori conclude that user impersonation is actually wrong. In the debugging scenarios, the last thing you want to do is gain access to modify the users' data. If you actually needed to modify a user's private data, or one of your customer's account information, you definitely want a dedicated system to handle that. This means, you actually don't want to the be the user, you don't want to impersonate the user, you just want to be the user with the explicit caveat of &lt;strong&gt;read only permissions&lt;/strong&gt;. You only want to see what they see, not actually be able to modify their data. Accidentally modifying user data is guaranteed to happen accidentally if the only way to to verify a user facing UX problem is to completely impersonate a user and get full write access to their account.&lt;/p&gt;

&lt;p&gt;Without thinking about, the following issues are associated with impersonating the user in this context:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Audit trails incorrect say the user changed data when they did not. ➤ An admin impersonating the user did it.&lt;/li&gt;
&lt;li&gt;The user's sessions may start to include the one generated by the admin. ➤ As a user, it would be an understatement to say they would be concerned if they saw a session in a sensitive account modifying data from a location they are not in.&lt;/li&gt;
&lt;li&gt;Logging data in the applications is incorrectly recorded, or may not be recorded at all. ➤ You may be tempted to hide these admin interactions.&lt;/li&gt;
&lt;li&gt;And lastly, in every case, now we need to alter our systems to be not only aware of how to process the data due to impersonation, but how to log it.  ➤ Impersonation is a virus that starts to infect all of our systems.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The practical-ish solutions
&lt;/h2&gt;

&lt;p&gt;If generating a new token that contains the impersonated &lt;strong&gt;User ID&lt;/strong&gt; is so bad, there must be better solutions out there.&lt;/p&gt;

&lt;h3&gt;
  
  
  Solution A: Additional token claim property
&lt;/h3&gt;

&lt;p&gt;What if we don't change the subject &lt;code&gt;sub&lt;/code&gt; claim, but instead add a new claim. That way, only those services that understand this claim, and actually want to use it would choose to use it. Services that don't know about it, keep using the unmodified &lt;code&gt;sub&lt;/code&gt; claim. Admins would still look like admins. Only services that care about a new &lt;code&gt;adminIsImpersonatingUserId&lt;/code&gt; claim property would know to use it and how to handle it. This would give you security by default, and only expose services to the danger that have already explicitly designed support for it. You would have to opt in, success finally!&lt;/p&gt;

&lt;p&gt;Theoretically this is great, and while it is a bit more secure than altering the subject, in practice, we start to write code that looks like this:&lt;/p&gt;

&lt;p&gt;Resolve User Identity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;resolveUserIdentity&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;userId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jwtToken&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;adminIsImpersonatingUserId&lt;/span&gt;
          &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="nx"&gt;jwtToken&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;sub&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then that code ends up in a shared library which all our services implement. So while our intentions were good, the reinforcing system loops cause this to be no better than the alternatives. The reason is, we often find the need to optimize our usage across even a small number of services, some believe preventing code duplication is a bad thing. So the &lt;code&gt;resolveUserIdentity&lt;/code&gt; method leads us to the following pattern:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;We change our Auth solution to add the new claim to the JWT during impersonation.&lt;/li&gt;
&lt;li&gt;Only those services that need to care about this add support for it.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;At this point we are still 100% secure. But then:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;We update some shared libraries that support JWT verification and add the method &lt;code&gt;resolveUserIdentity&lt;/code&gt; to it.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;resolveUserIdentity&lt;/code&gt; replaces all the checks to consume the new claim.&lt;/li&gt;
&lt;li&gt;All existing services get updated to use this shared library, and are exposed to the dangers of impersonation.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A new claim won't help us. This means that now we are back to the same problem, and arguably the &lt;strong&gt;situation is worse&lt;/strong&gt;. Instead of all the services in the platform trusting the standardize &lt;code&gt;sub&lt;/code&gt;, we now maintain a bespoke solution just for our system. This is especially important, the &lt;code&gt;sub&lt;/code&gt; claim is an &lt;code&gt;OAuth&lt;/code&gt; and &lt;code&gt;OpenID&lt;/code&gt; industry standard &lt;a href="https://datatracker.ietf.org/doc/html/rfc9068" rel="noopener noreferrer"&gt;RFC 9068&lt;/a&gt;, everyone in the industry is familiar with it. However, just for your system, there is now a new claim which just ends up being treated as the &lt;code&gt;sub&lt;/code&gt; canonical sub, but it is not standard, not self documenting, unexpected and unique. Complexity reduces security. &lt;strong&gt;Strike 3.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For more about the systemic issues with a JWT or session token based permission system, permission attenuation is discussed in depth in the &lt;a href="https://authress.io/knowledge-base/academy/topics/offline-attenuation" rel="noopener noreferrer"&gt;token scoping academy topic&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution B: DOM Recording
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;See earlier impersonation use cases.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If we flash back to the original user stories that drove us to implement user impersonation in the first place, we might start to see a pattern emerge. Most of the time the issue is that­ — ­something is wrong with the User Experience. The user is stuck in some way, the data isn't being displayed correctly, some component is broken.&lt;/p&gt;

&lt;p&gt;All of these are user facing issues, and issues facing the user purely in the UI. The source of the data, and the security therein has near-zero value to us in validating the user experience. Attempting to use &lt;strong&gt;expensive&lt;/strong&gt; full user impersonation instead of simple &lt;strong&gt;UI component&lt;/strong&gt; tests, is the exact same problem we see incorrectly implementing tests at the wrong level.&lt;/p&gt;

&lt;p&gt;Let's use the Testing Pyramid as an analogy. The canonical testing pyramid is this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ljalpcn93qnjyxhi4xc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ljalpcn93qnjyxhi4xc.png" alt="The Testing Pyramid" width="584" height="493"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;At the bottom is our &lt;strong&gt;unit tests&lt;/strong&gt;, those tests are cheap and easy to write, find the most issues, and ensure our system is working without much effort.&lt;/li&gt;
&lt;li&gt;Then comes the &lt;strong&gt;service level tests&lt;/strong&gt;. Or in the case of UIs these are our screen tests. Multiple pieces of functionality and components are combined together in these tests. We don't want many of them, perhaps 10% max of all our tests test full screens or services. Most of the functionality of the service or screen is already validated in the unit tests — ­ie we know that our core functions, as well as buttons, slides, pickers, etc — all work correctly.&lt;/li&gt;
&lt;li&gt;Now comes the 1% &lt;strong&gt;integration or end-to-end tests&lt;/strong&gt;. You almost never want these, only the most critical flows of your application should be validated. When they report a failure, you have no idea what might have caused that particular failure, you just know there is a problem. In the case of an application like social media platform, The integration test you want is — making a new post. (Obviously there is no reason to test the login flow, since your auth provider has you already covered there!)&lt;/li&gt;
&lt;li&gt;At the top of the pyramid is &lt;strong&gt;manual exploratory testing&lt;/strong&gt;. That which cannot be automated, and most importantly needs the intelligence and creativity of a human to identify potential problems in your software application. This is the most expensive and you rarely have an interest in squandering this effort.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The only difference between this and a support case is the context — &lt;strong&gt;the why&lt;/strong&gt;. The services, applications, business logic, and tools that we have at our disposal are all the same. We need to trust that our tests exist to validate the problems we could have. It is always a mistake to invest effort in the top of the pyramid when we lack the assets at the bottom. Likewise, our support pyramid is this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0y7k3oxbzl126nwv8wjx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0y7k3oxbzl126nwv8wjx.png" alt="The Support Pyramid" width="584" height="493"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;At the bottom is &lt;strong&gt;application logs&lt;/strong&gt;. There is no sense in attempting to tackle any of the higher layers until you have sufficient application logs that exactly report incoming requests, outgoing responses, unexpected data scenarios, edge cases that aren't completely implemented, and systemic issues.&lt;/li&gt;
&lt;li&gt;Just above that is &lt;strong&gt;documentation&lt;/strong&gt;. This includes expected common flows, uncommon flows, and demos of the more complex to use aspects of our application. The biggest benefit of this documentation is that we can help out users. I want to repeat that it is more for us, than it is for our users. The pyramid exists to inform us what we should do, not how our users should operate.&lt;/li&gt;
&lt;li&gt;The next rung up are &lt;strong&gt;User recordings&lt;/strong&gt;. For users that are having issues, we have concrete recorded data for their flow. The flows would include anything relevant to the application, how they used it, what actions they took. All so we can actually see what happened in context for when there is a problem. No one wants to spend any time looking at recordings if they don't have to. It is also very difficult to identify the root cause of problems by reviewing a recording, but having them is indispensable to your support engineers when they need them, when a user has reported a issue. Solutions include &lt;a href="https://posthog.com/" rel="noopener noreferrer"&gt;PostHog&lt;/a&gt;, &lt;a href="https://www.fullstory.com/" rel="noopener noreferrer"&gt;FullStory&lt;/a&gt;, &lt;a href="https://sentry.io/welcome/" rel="noopener noreferrer"&gt;Sentry&lt;/a&gt;. If you don't have these recordings, then the next best alternative (which is very far away) is getting a live screencast from the user. These are less useful, and more expensive to obtain. Worst of all, they can and &lt;a href="https://blog.1password.com/okta-incident/" rel="noopener noreferrer"&gt;have been used&lt;/a&gt; to breach sensitive systems.&lt;/li&gt;
&lt;li&gt;At the very top, is of course the thing you never want to have to do, and the topic of this article: &lt;strong&gt;Full user impersonation&lt;/strong&gt;. If everything else fails then at least we have user impersonation left in our toolkit. But this must only be used after we have significantly invested in all the other strategies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Assuming we have tackled the bottom two rungs of the table, the missing next component is the &lt;strong&gt;User recordings&lt;/strong&gt;. If you have those, which offer the ability to sanitize the data coming from users, then you've got the solution to 99% of all support cases. Having people jump in and impersonate users is just not necessary. And most importantly, if we look at who often needs to impersonate users, it isn't even the people who should have access to do so.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1sf805fd1q1cce8mql4d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1sf805fd1q1cce8mql4d.png" alt="Danger of impersontation" width="800" height="207"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Revisiting user impersonation
&lt;/h2&gt;

&lt;p&gt;Do you want to see the data or do you want to see what the user sees? In almost every case it is the former, seeing the data can be through an admin app. In the rare case that it is the later, we would need the exact permissions the user has, or some safer strict subset of them. So what's the right way to handle user impersonation in the case that we just can't live without it?&lt;/p&gt;

&lt;p&gt;The most important principle here is &lt;strong&gt;Secure by Default&lt;/strong&gt;. So far a blanket implementation is wrong, and there are too many &lt;a href="https://authress.io/knowledge-base/articles/2025/01/03/bliss-security-framework" rel="noopener noreferrer"&gt;pits of failure&lt;/a&gt; with the JWT, auth session, or reference token based approach.&lt;/p&gt;

&lt;p&gt;Looking at the support engineer use case, our needs would be satisfied if we were to explicitly hand out to the support staff just the permissions &lt;code&gt;read:logs&lt;/code&gt; to handle that specific support case. But it is quite something else to generate whole valid tokens that contain the subject different from the user requesting them and give those out to specific people. So as long as we have a system that allows us to provide our team members with explicit permissions to only the exact resources they need, then we have the capability to ensure we have a secure system that also solves all our use cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Authress supports user impersonation
&lt;/h2&gt;

&lt;p&gt;I want to end this article with a discussion about how Authress solves the top of the pyramid user impersonation story. The caveat here being, that it is sometimes a trade-off some companies really want. They absolutely want to sacrifice security, increase vulnerabilities as well as their attack surface by introducing full user impersonation functionality. However from experience, very few of our customers have anything implemented in this space at all, and those that do have hooked their process into &lt;strong&gt;easy to grant permissions&lt;/strong&gt; through Authress, rather than &lt;strong&gt;full user identity impersonation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The real solution is to actually consider your support team persona when designing features. And this is what Authress optimizes for.&lt;/p&gt;

&lt;p&gt;The flow that we consider the most secure is explicitly and &lt;strong&gt;Temporarily grant your support user persona exactly one small additional set of permissions&lt;/strong&gt; relevant for the support case. When we do this we don't change how we determine identity, we only change the way we determine access. Authress supports this by allowing quick cloning of &lt;a href="https://authress.io/knowledge-base/docs/authorization/access-records" rel="noopener noreferrer"&gt;User Based Access Records&lt;/a&gt; which represent the permissions a user has. Since cloning is dynamic, a temporary access record can be created that only contains the &lt;code&gt;READ&lt;/code&gt; equivalent roles that the user has. And most cases, you can just directly assign your support engineers to a &lt;a href="https://authress.io/app/#/settings?focus=groups" rel="noopener noreferrer"&gt;Authress Permission Group&lt;/a&gt; with &lt;code&gt;READ ✶&lt;/code&gt; access, and never need to touch permissions again.&lt;/p&gt;

&lt;p&gt;Here is an example cloned access record, where the support engineer received just the &lt;strong&gt;Viewer&lt;/strong&gt; Role to all organizations so that documents and users could be &lt;code&gt;Read&lt;/code&gt; not &lt;code&gt;Updated&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv7no9dlsxhjp6ymi8uh0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv7no9dlsxhjp6ymi8uh0.png" alt="Access record example" width="800" height="292"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The firehouse recommendations
&lt;/h2&gt;

&lt;p&gt;In case you want to ignore the advise of this academy article, and instead of using Authress permissions to drive access control as recommended, I do want to include recommendations that will help reduce the impact of security and compliance issues related to user impersonation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Do not hide user impersonation, it will be tempting to obscure the usage of it from your customers. Instead make sure it is visible and clear for everyone especially your customers. I know you don't want them to know, but they should know, they may even need to know, especially if something goes wrong.&lt;/li&gt;
&lt;li&gt;Make sure all actions are recorded in an audit trail both by your admin who impersonated the user and the application user. &lt;strong&gt;Especially the admin&lt;/strong&gt;. There will definitely be questions related to the "last person that touched this" and of course "it was working before your team looked at it". You will need a way to be confident in your response to your customers when it wasn't an admin that touch it last.&lt;/li&gt;
&lt;li&gt;If you're operating in any high-security environment, FedRAMP, ITAR, or the like, always require customer user action before the support engineer has access to the account data. Some prominent cloud providers believe having an email with the user agreeing, is sufficient for this. I'm here to say — &lt;em&gt;is not sufficient&lt;/em&gt;. Because often people who can create support cases do not and should not have admin access to the customer account to view all the data. Someone without the customer admin role should be able to grant your support engineering staff access to sensitive data in the account. &lt;strong&gt;You need an admin to click a button.&lt;/strong&gt; This is usually done through a &lt;a href="https://authress.io/knowledge-base/docs/advanced/step-up-authorization#3-make-the-authorization-request" rel="noopener noreferrer"&gt;Step-Up Authorization Request&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Impersonation can be valuable in some environments however often completely useless in others. Especially in spaces with regulatory requirements, it's much better to diagnose issues from outside the impacted account, either through data replication or a permissions based approach.&lt;/li&gt;
&lt;li&gt;Ensure your impersonation logic is completely tested. There should be no better tested piece of functionality in your software system.&lt;/li&gt;
&lt;li&gt;Audit trails should always keep a "This was run-by User X" annotation on audit records, not just the user ID, but any additional information from the admin. Our recommendation is both the &lt;code&gt;Admin User ID&lt;/code&gt; and the &lt;code&gt;Support Ticket ID&lt;/code&gt;, on every log statement.&lt;/li&gt;
&lt;li&gt;Start with your customer expectations. What sort of transparency do they explicitly expect? Do not guess. Err on the side of overcommunicating, rather than under.&lt;/li&gt;
&lt;li&gt;Please revisit doing this in the first place if you don't have the capacity to have a dedicated team accountable for this functionality. Often this will involve your legal team when it doesn't go right.&lt;/li&gt;
&lt;li&gt;When (not if) credentials leak, who leaked those credentials? Was it your customer or was it through your admin application or by one of your support engineers. Always be able to tell where those credentials came from, so that you can respond to the compromise as effectively as possible.&lt;/li&gt;
&lt;li&gt;If you want to start anywhere, go back and invest in your admin/support tools so that they can expose the data that you need, rather than focusing on user impersonation. If those tools are insufficient check back at the Support Engineer Pyramid again.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For help understanding this article or how you can implement a solution like this one in your services, feel free to reach out to the &lt;a href="https://authress.io/app/#/support" rel="noopener noreferrer"&gt;Authress development team&lt;/a&gt; or follow along in the &lt;a href="https://authress.io/knowledge-base" rel="noopener noreferrer"&gt;Authress documentation&lt;/a&gt; and join our community:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://authress.io/community" class="ltag_cta ltag_cta--branded" rel="noopener noreferrer"&gt;Join the community&lt;/a&gt;
&lt;/p&gt;

</description>
      <category>authentication</category>
      <category>authorization</category>
      <category>identity</category>
      <category>security</category>
    </item>
    <item>
      <title>Migrating CloudFormation to TF</title>
      <dc:creator>Warren Parad</dc:creator>
      <pubDate>Tue, 21 Jan 2025 13:56:46 +0000</pubDate>
      <link>https://dev.to/aws-builders/migrating-cloudformation-to-tf-bo9</link>
      <guid>https://dev.to/aws-builders/migrating-cloudformation-to-tf-bo9</guid>
      <description>&lt;p&gt;One day you might find yourself in the unfortunate position of wanting to migrate away from &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html" rel="noopener noreferrer"&gt;CloudFormation (CFN)&lt;/a&gt;. While some may say that CFN is bad and should never be used. I can confirm that it is still better than:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CloudFormation CDK&lt;/li&gt;
&lt;li&gt;AWS SAM&lt;/li&gt;
&lt;li&gt;Serverless - Not "serverless", but the company that is abusing this name.&lt;/li&gt;
&lt;li&gt;SST&lt;/li&gt;
&lt;li&gt;And many others&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The truth is: CloudFormation isn't bad, however like most things, it is bad when you find out your current solution doesn't support the thing that you want it to support.&lt;/p&gt;

&lt;p&gt;So back to the problem...You want to migrate from CloudFormation to OpenTofu (since no one uses Terraform anymore after their legal scandal), and part of that problem involves a migration.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Migration
&lt;/h2&gt;

&lt;p&gt;Migrations are &lt;strong&gt;technically&lt;/strong&gt; easy. Monolith to microservices, event buses to REST, MSSQL to NoSQL DynamoDB. The hard part is always the non-technical part. The part where you figure out what you want, now that's the problem. Unless of course you have a monolith, because you should just give up now. No one successfully converts from a Monolith to microservices. They write some code, complain a lot, then apply for a new job at a new company telling their would-be manager "Look how I helped this company migrate to microservices. I'm Great!"&lt;/p&gt;

&lt;p&gt;But this isn't a story about how monoliths are bad, it is about how to migrate your &lt;strong&gt;Infrastructure as Code&lt;/strong&gt; (IaC) solution.&lt;/p&gt;

&lt;p&gt;Realistically, you have to painstakingly generate the new IaC HCL files for OpenTofu. You have existing CloudFormation as well as the real live version of your infrastructure currently supporting an massive business. And if you are like us at &lt;a href="https://authress.io" rel="noopener noreferrer"&gt;Authress&lt;/a&gt;, you might also have a &lt;a href="https://authress.io/knowledge-base/articles/2024/09/04/aws-ensuring-reliability-of-authress" rel="noopener noreferrer"&gt;99.999% uptime SLA&lt;/a&gt; you need to account for.&lt;/p&gt;

&lt;p&gt;If you have 100+ CFN stacks, you probably don't want to import these resources in OpenTofu by hand. Instead, you'll want some sort of tool to do this, and there are a bunch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://former2.com/" rel="noopener noreferrer"&gt;Former2&lt;/a&gt; - Export from AWS to HCL.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.firefly.ai/blog/cloudformation-to-terraform-migration" rel="noopener noreferrer"&gt;Firefly.ai&lt;/a&gt; - AI in the company name, yuck&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/DontShaveTheYak/cf2tf" rel="noopener noreferrer"&gt;CF2TF&lt;/a&gt; - Open source converter&lt;/li&gt;
&lt;li&gt;Doing it by hand to verify you have everything you need.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And there are still more... You could even try one of the LLMs out there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Generating the configuration
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://developer.hashicorp.com/terraform/language/import/generating-configuration" rel="noopener noreferrer"&gt;Terraform&lt;/a&gt; and &lt;a href="https://opentofu.org/docs/language/import/generating-configuration/" rel="noopener noreferrer"&gt;OpenTofu&lt;/a&gt; actually support configuration generation out of the gate as well, so we will use their strategy here, and if you want to use one of the less great ones from above, you do you!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Add the &lt;code&gt;import&lt;/code&gt; block:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;to&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_instance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;example&lt;/span&gt;
  &lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"foo"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Run the configuration generation command:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tf plan -generate-config-out=generated.tf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_instance"&lt;/span&gt; &lt;span class="s2"&gt;"example"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:ec2:eu-west-1:1234567890:instance/i-00deadc0de"&lt;/span&gt;
  &lt;span class="nx"&gt;ami&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ami-000a4d9c6067d5d0d"&lt;/span&gt;
  &lt;span class="nx"&gt;instance_type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"t3.micro"&lt;/span&gt;
  &lt;span class="p"&gt;...&lt;/span&gt;   
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Commit then new configuration:&lt;/strong&gt;&lt;br&gt;
Add the configuration to your files, and &lt;code&gt;git commit&lt;/code&gt; to your IaC repository.&lt;/p&gt;
&lt;h2&gt;
  
  
  Running the migration
&lt;/h2&gt;

&lt;p&gt;Once we have all of those generated we just need to run &lt;code&gt;tf plan&lt;/code&gt;, &lt;code&gt;tf apply&lt;/code&gt;, and then delete the &lt;code&gt;import&lt;/code&gt; statements.&lt;/p&gt;

&lt;p&gt;And you are done!&lt;/p&gt;
&lt;h2&gt;
  
  
  Cleanup
&lt;/h2&gt;

&lt;p&gt;The one thing that no one tells you at this point is that you aren't done. Importing the resources and having the committed IaC HCL does not mean you are done. If you are like me, then you care that you still have 100s of CFN stacks deployed in your AWS accounts. Maybe these stacks all have &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/detect-drift-stack.html" rel="noopener noreferrer"&gt;CFN Drift&lt;/a&gt; and don't even represent the current state of the world anymore.&lt;/p&gt;

&lt;p&gt;However, even if they do represent the current state, you probably don't want someone going into your account and accidentally updating or deleting those. Or your desire to have a pristine account compels you to delete these stacks. You probably wouldn't be someone working on this problem in the first place if you didn't care that these old stacks are still here.&lt;/p&gt;

&lt;p&gt;The problem is that there is no way to delete a stack without also deleting the resources in that stack. And of course, you want to keep the resources in those stacks, so that's a conundrum. Thankfully, I've figured out a hack to get around this.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foqy41ryjsvieb4b6tdax.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foqy41ryjsvieb4b6tdax.gif" alt="Warren disappears due to the creation of a hack" width="276" height="200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The involves utilizing three features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the &lt;code&gt;delete_failed&lt;/code&gt; status&lt;/li&gt;
&lt;li&gt;FORCE_DELETION action flag&lt;/li&gt;
&lt;li&gt;CloudFormation execution Role ARN&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The delete_failed status occurs whenever CFN tries to delete a resource that it believes is no longer necessary, but the resource is either in use &lt;strong&gt;OR CFN doesn't have access to delete the resource.&lt;/strong&gt; Take note of this second one.&lt;/p&gt;

&lt;p&gt;Second, when a stack is in the &lt;code&gt;delete_failed&lt;/code&gt; status, you are allowed to force delete the stack and retain explicit resources that you might still be using.&lt;/p&gt;

&lt;p&gt;So all we need to do is get the stack into the &lt;code&gt;delete_failed&lt;/code&gt; state, and then ask CFN to retain all the resources.&lt;/p&gt;

&lt;p&gt;CloudFormation allows, for "security reasons", you to specify a role ARN to execute CFN with. When you do that the CFN stack changes will only be executed with that role. So we'll define a new role that does &lt;strong&gt;not have access&lt;/strong&gt; to anything. We'll abuse the Role ARN property to force CFN to fail to delete any resources and thus fail to delete stack.&lt;/p&gt;
&lt;h2&gt;
  
  
  Cleanup Execution
&lt;/h2&gt;

&lt;p&gt;Create the Role:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;CfnDeleteStackRole&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
     &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::IAM::Role&lt;/span&gt;
     &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
       &lt;span class="na"&gt;RoleName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cfn-delete-stack-role&lt;/span&gt;
       &lt;span class="na"&gt;AssumeRolePolicyDocument&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
         &lt;span class="na"&gt;Version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2012-10-17'&lt;/span&gt;
         &lt;span class="na"&gt;Statement&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
           &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Effect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Allow&lt;/span&gt;
             &lt;span class="na"&gt;Principal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
               &lt;span class="na"&gt;Service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cloudformation.amazonaws.com&lt;/span&gt;
             &lt;span class="na"&gt;Action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sts:AssumeRole&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With that role, we'll call the Delete Stack:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws cloudformation delete-stack \
    --stack-name my-stack
    --role-arn arn:aws:iam::account:role/cfn-delete-stack-role
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This execution call &lt;strong&gt;will fail&lt;/strong&gt;, but we knew that was going to happen. Now, it will put the stack in the status &lt;code&gt;delete_failed&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Finally, we can execute the delete again, utilizing the force deletion parameters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws cloudformation delete-stack \
    --stack-name my-stack
    --role-arn arn:aws:iam::account:role/cfn-delete-stack-role
    --deletion_mode FORCE_DELETE_STACK
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Depending on the resources you have in your stack you or if you want extra security to prevent deleting your precious resources, you can add the flag &lt;code&gt;--retain-resources&lt;/code&gt; to the CLI command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws cloudformation delete-stack \
    --stack-name my-stack
    --role-arn arn:aws:iam::account:role/cfn-delete-stack-role
    --deletion_mode FORCE_DELETE_STACK
    --retain-resources $LOGICAL_RESOURCES_LIST
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;$LOGICAL_RESOURCES_LIST&lt;/code&gt; value set as the string list of CFN resources.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cfnTemplateFile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;readFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./cfn-template.json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cfnTemplate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cfnTemplateFile&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;resourceKeys&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cfnTemplate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Resources&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;,&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;resourceKeys&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Use resourceKeys as $LOGICAL_RESOURCES_LIST&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Repeat this for every CFN stack in every region in every AWS account in your org, and everything will be cleaned up, just the way you wanted it to.&lt;/p&gt;




&lt;p&gt;Curious about this and worth discussing more, join my community and chat with me:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://rhosys.ch/community" class="ltag_cta ltag_cta--branded" rel="noopener noreferrer"&gt;Join the community&lt;/a&gt;
&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>aws</category>
      <category>infrastructureascode</category>
      <category>devops</category>
    </item>
    <item>
      <title>Are millions of accounts vulnerable due to Google's OAuth Flaw?</title>
      <dc:creator>Warren Parad</dc:creator>
      <pubDate>Wed, 15 Jan 2025 17:01:30 +0000</pubDate>
      <link>https://dev.to/authress/are-millions-of-accounts-vulnerable-due-to-googles-oauth-flaw-33f</link>
      <guid>https://dev.to/authress/are-millions-of-accounts-vulnerable-due-to-googles-oauth-flaw-33f</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This article is a rebuttal to &lt;a href="https://trufflesecurity.com/blog/millions-at-risk-due-to-google-s-oauth-flaw" rel="noopener noreferrer"&gt;Truffle Security's&lt;/a&gt; post on &lt;a href="https://trufflesecurity.com/blog/millions-at-risk-due-to-google-s-oauth-flaw" rel="noopener noreferrer"&gt;Millions of Accounts Vulnerable due to Google's OAuth Flaw&lt;/a&gt;. (&lt;em&gt;&lt;a href="https://authress.io/knowledge-base/assets/files/truffle-security-google-oauth-vulnerability-19b387e9c84f8ccfe621c0301c2a19d8.pdf" rel="noopener noreferrer"&gt;Alt link&lt;/a&gt;&lt;/em&gt;) Even more ridiculous might be that their post got picked up by no small number of news outlets that all should be ashamed of themselves, far too many to actually link in this post.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Are millions of accounts vulnerable due to Google's OAuth Flaw?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In a true &lt;a href="https://en.wikipedia.org/wiki/Betteridge%27s_law_of_headlines" rel="noopener noreferrer"&gt;Betteridge's law of headlines&lt;/a&gt; fashion, the answer is a resounding &lt;strong&gt;No&lt;/strong&gt;. Which explains why Google ignored this vulnerability in the first place:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8shcza2985fajtll95vh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8shcza2985fajtll95vh.png" alt="Google Workspace response" width="800" height="261"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The TL;DR of the source article claims that due to the nature of how Google OAuth works, &lt;strong&gt;"Millions of Americans' data and accounts remain vulnerable"&lt;/strong&gt;. It relies on the nature of Domain Ownership.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Claim
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Google’s OAuth login doesn’t protect against someone purchasing a failed startup’s domain and using it to re-create email accounts for former employees.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Domains are the root of trust* for many businesses. At Authress we rely on &lt;code&gt;authress.io&lt;/code&gt; to establish trust with our customers, just as at your business you rely on your domains for your customers. This is "Root of Trust" with an asterisk because in reality the root of trust lies with the domain authority, the domain registrar, and the issuer of your TLS certificates for HTTPS encryption. But that is outside of the scope of this article.&lt;/p&gt;

&lt;p&gt;The claim in the original article is that it is OAuth and specifically Google's OAuth that is at fault and nothing else. And that somehow domain ownership is linked to the exposure of customer data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Background
&lt;/h2&gt;

&lt;p&gt;Gaining access to your trusted domain is one way in which attackers attempt to circumvent your security strategy and compromise your users. If malicious attackers can utilize your domain to trick your users, then they can impersonate your business and steal their personal information, bank accounts, and credit card numbers. This is the basis for why phishing is popular today. As a matter of fact phishing is so popular because compromising a domain is incredibly hard, and is usually executed through a &lt;a href="https://www.cloudflare.com/learning/dns/dns-cache-poisoning/" rel="noopener noreferrer"&gt;DNS Poising attack&lt;/a&gt;. The strategy behind phishing is to purchase alternative domains that look and feel like the valid domain as the next best thing (&lt;a href="https://www.zscaler.com/blogs/security-research/phishing-typosquatting-and-brand-impersonation-trends-and-tactics" rel="noopener noreferrer"&gt;Typosquatting&lt;/a&gt;). These facsimiles exist for exactly that reason.&lt;/p&gt;

&lt;p&gt;Besides using separate domains attackers will often also attempt &lt;a href="https://developer.mozilla.org/en-US/docs/Web/Security/Subdomain_takeovers" rel="noopener noreferrer"&gt;Subdomain takeovers&lt;/a&gt; which is a mesh between domain compromise and using an alternative domain.&lt;/p&gt;

&lt;p&gt;However, in this case, attackers cleverly will attempt to use your existing corporate domain after you believe you are done with it. The expected flow involving Google Workspace's OAuth looks something like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You buy a domain for your company, let's call it &lt;code&gt;yourcompany.com&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Sign up for an Employee Identity Solution (IdP) that provides OAuth, there are actually many solutions here, Google Workspace, &lt;a href="https://okta.com/" rel="noopener noreferrer"&gt;Okta&lt;/a&gt;, &lt;a href="https://www.microsoft.com/en-us/security/business/identity-access/microsoft-entra-id" rel="noopener noreferrer"&gt;Microsoft Entra ID&lt;/a&gt;, &lt;a href="https://www.pingidentity.com/en/resources/blog/post/okta-vs-ping-best-iam-digital-security.html" rel="noopener noreferrer"&gt;Ping Identity&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Then your employees use that identity solution to sign into to a third party product such as Stripe, AWS, PostHog, etc...&lt;/li&gt;
&lt;li&gt;Lastly you give critical data to that product, business sensitive information, like your pets' birthdays.&lt;/li&gt;
&lt;li&gt;That third party applications saves that data because they like data very much.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzrev26ltr376vgehegjt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzrev26ltr376vgehegjt.png" alt="Corporate Login Flow" width="800" height="612"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Identity
&lt;/h2&gt;

&lt;p&gt;When you log into your favorite third party application, there needs to be an identifier sent from the Employee Identity Solution to that third party. The Third Party trusts your chosen identity solution as well as that identifier. Here is an example token generated by Google Workspace:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"iss"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://accounts.google.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"sub"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"210169484474386"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"iat"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1736946817"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"exp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1736996817"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;

      &lt;/span&gt;&lt;span class="nl"&gt;"email"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"warren@yourcompany.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"hd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"yourcompany.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Warren Parad"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"given_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Warren"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"family_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Parad"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"locale"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"en"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The identifier in the token is the &lt;code&gt;sub&lt;/code&gt; claim with the value &lt;code&gt;210169484474386&lt;/code&gt;. This is my User ID (Note: this is not actually my user ID, feel free to do with it as you wish, but I made it up for the purposes of this post.)&lt;/p&gt;

&lt;p&gt;Your third party application uses this &lt;code&gt;sub&lt;/code&gt; property to uniquely identify you, and then authorize you to your company's sensitive cat photos.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Vulnerability
&lt;/h2&gt;

&lt;p&gt;Now, imagine that you close your Google Workspace account, because your company goes bankrupt (This frequently happens because as much as we want to believe companies are successful through hard work, the &lt;a href="https://www.youtube.com/watch?v=3LopI4YeC4I" rel="noopener noreferrer"&gt;truth is that it is actually luck&lt;/a&gt;). Along with your Google Workspace Account will likely be your expired domain &lt;code&gt;yourcompany.com&lt;/code&gt;, unless you have some secret prayers that one day you will be able to sell it instead of expiring worthless. Let's assume that yourcompany.com domain is now available for anyone to purchase. By purchasing that domain, an attacker can create a new Google Workspace account, in hopes to gain access to those exact same third parties you had used for your business.&lt;/p&gt;

&lt;p&gt;This actually isn't even the first time something like this has been attempted, and frequently it works due to hard-coded solutions in many applications. In a cruel twist of fate, here is a great example of being able to compromise the attackers themselves because they had a used a application which relied on &lt;a href="https://labs.watchtowr.com/more-governments-backdoors-in-your-backdoors/" rel="noopener noreferrer"&gt;expired trusted malicious domains&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This actually doesn't happen with Google OAuth. When you close the google workspace account, the &lt;code&gt;User ID&lt;/code&gt; with the value &lt;code&gt;210169484474386&lt;/code&gt;, ceases to exist. This is what Google is confirming by closing the original bug report. An attacker recreating the Google Workspace account is unable to generate the same sub again. So that even if an attacker attempted to create a new Google Workspace from the expired and unclaimed domain &lt;code&gt;yourcompany.com&lt;/code&gt;, the sub would be different and your third party application would reject access.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's the problem?
&lt;/h3&gt;

&lt;p&gt;The issue is some third party applications decided not to use the &lt;code&gt;sub&lt;/code&gt; claim. The author of the Truffle Security post suggests that this is due to some bug in the Google OAuth implementation, but the reality is OAuth has nothing to do with this problem. The failure to use the &lt;code&gt;sub&lt;/code&gt; claim stems from this shiny property in the identity token called &lt;code&gt;email&lt;/code&gt;. In the original token above you can see the users email there &lt;code&gt;warren@yourcompany.com&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A third party that utilizes this email address to uniquely identify users means that they are allowing malicious attackers who compromise employee identity providers through expired domains to take over your account. There are lots of reasons they do this, but primarily it is because they like the way the &lt;code&gt;@&lt;/code&gt; looks in their database.&lt;/p&gt;

&lt;p&gt;That means this is actually &lt;strong&gt;a vulnerability on the third party application side&lt;/strong&gt;. Any third party application that allows users to log in with just an email are inherently creating a vulnerability in their own platform and setting themselves up to expose their (ex-)users data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Vulnerability review
&lt;/h2&gt;

&lt;p&gt;So, actually this has nothing to do with Google Workspace at all. And an attacker can actually use any email provider to perpetrate this attack:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Buy an expired domain and register your domain in a new email provider&lt;/li&gt;
&lt;li&gt;...&lt;/li&gt;
&lt;li&gt;Profit&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Although in this case the &lt;code&gt;...&lt;/code&gt; is simply: &lt;strong&gt;Attempt a password reset or magic-link authentication for that third party application.&lt;/strong&gt; &lt;em&gt;In a similar attack a vulnerability was utilized by attackers through an &lt;a href="https://www.rescana.com/post/critical-zendesk-email-spoofing-vulnerability-cve-2024-49193-risks-and-mitigation-strategies" rel="noopener noreferrer"&gt;email support system&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The real vulnerability
&lt;/h3&gt;

&lt;p&gt;This shows us that OAuth and Google Workspace aren't actually the source of the issue here, it's the third party application. I've frequently condemned &lt;a href="https://authress.io/knowledge-base/articles/magic-links-passwordless-login" rel="noopener noreferrer"&gt;Magic-Link based Authentication&lt;/a&gt;, and while there are some areas where it unfortunately still provides value, it isn't worth it if you care about security. The fact that the email is provided by Google is just unfortunate. Emails are helpful for identify where to send messages to users who want emails, but it should never be used anywhere related to security.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Dismantling the solution
&lt;/h3&gt;

&lt;p&gt;The original article suggests that adding yet two more additional claims/properties to the User Identity Token, will solve the problem. One claim isn't good enough, let's have three!&lt;/p&gt;

&lt;p&gt;Given that the problem is that third party applications are ignoring the already existing &lt;code&gt;sub&lt;/code&gt; claim. I find it to be quite the naïve suggestion. No amount of additional claims will prevent third parties for incorrectly substituting in their beliefs where actual security is necessary. This is just an unfortunate truth. We see this every day and it is one of the reasons we built &lt;a href="https://authress.io" rel="noopener noreferrer"&gt;Authress&lt;/a&gt; in the first place. The defaults that exist in SDKs, frameworks, protocols, and standards, are just not enough for people to do the right thing, explicit investment had to be made in prevention of doing the wrong thing.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Third Party Application responsibility
&lt;/h3&gt;

&lt;p&gt;The last part of the problem is that the author in the original article claims&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What can Downstream Providers do to mitigate this? At the time of writing, there is no fix&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Which just isn't true. Third party applications that allow email based authentication, must delete user data after account deactivation. Once you stop paying for a third party application, that data must be deleted and never exposed again unless you resume access and the third party verifies identity. I prefer taking guidance from the &lt;a href="https://pages.nist.gov/800-63-3-Implementation-Resources/63A/verification/" rel="noopener noreferrer"&gt;NIST 800-63A&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As a user you too, can do something to. If you have sensitive data, you could decide not to use any third party applications, unless of course you actually pay for it and ensure that you delete your account before your company stops using that application. If you give someone your data, they have it, assume the worst. We can and should put more responsibility onto these third party application services who are utilizing unsafe email addresses and often SMS numbers of authentication. As long as you treat email auth as a valid solution, everyone will forever be just as culpable as third parties who rely on it. Use &lt;a href="https://authress.io/knowledge-base/docs/authentication/user-authentication" rel="noopener noreferrer"&gt;OAuth and SAML&lt;/a&gt; for your &lt;a href="https://authress.io/knowledge-base/academy/topics/implementating-user-login" rel="noopener noreferrer"&gt;business authentication&lt;/a&gt; and make sure to provide sufficient &lt;a href="https://authress.io/knowledge-base/docs/authentication/user-authentication" rel="noopener noreferrer"&gt;secure options&lt;/a&gt; to the users of the products and services you build.&lt;/p&gt;

&lt;h2&gt;
  
  
  Consumer exposure
&lt;/h2&gt;

&lt;p&gt;The original article also seems to conflate risks to consumers directly. There is nothing about this vulnerability that directly affects consumers. Sure there are impacts to consumers regarding data privacy, but the vulnerability discussed in this article doesn't include them.&lt;/p&gt;

&lt;p&gt;That's because as a consumer when you use an application, that application stores data in their primary databases. When the company that manages that application fails, both their databases and their bank accounts are empty. You don't have to worry about that data. But you do have to worry about who they gave your data to. You have to worry about that irrespective of the company, or its state. Many companies out there have started to be investigating because of just that. This is the whole premise of the &lt;a href="https://en.wikipedia.org/wiki/Facebook%E2%80%93Cambridge_Analytica_data_scandal" rel="noopener noreferrer"&gt;Facebook's Cambridge Analytica scandal&lt;/a&gt;. Facebook gave user personal data to Cambridge Analytica when they should not have access to it. Facebook didn't even need to be bankrupt for there to be a problem.&lt;/p&gt;

&lt;p&gt;The core of the issue isn't the data you have given to the company, the problem is data the they have shared to others. But no amount of praying or technological solutions is going to fix that. The problems proposed in this article regarding the domain vulnerability in question are related to the data given to the third party applications secured with by the company's corporate domain. The data that is most vulnerable in these circumstances is the business-to-business relationships. Billing information, strategic partnerships, invoices, business strategies, these are at risk.&lt;/p&gt;

&lt;p&gt;For example, at Authress we use Stripe, sometimes. In stripe we have customer account information, including customer emails for sending invoices. If you are using Stripe or another payment provider, then chances are you too are storing some sort of customer data in Stripe. If your company goes bankrupt, and attacker uses the domain vulnerability to do a password reset on your Stripe account, they will now have access to your old company's customer invoice and email data. You probably don't care, but you should.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;So I think we can say definitely, &lt;strong&gt;no there aren't millions of people at risk with this vulnerability&lt;/strong&gt;. Sure your data is at risk, it always had been at risk, it always will be at risk, but Google's OAuth implementation, while problematic, honestly doesn't change anything at all. You can continue to file your data deletion requests with your third party application providers when you don't think they are doing too well. But if they aren't doing that well, I sincerely doubt they are deleting your data, let alone deleting your data from their third party providers. I don't know what will become of the original published articles or Google's response, but I had felt strongly to first educate regarding the problem rather than lambast Google Workspace over their responses. The claim by the original author that &lt;strong&gt;millions of accounts vulnerable due to Google's OAuth Flaw&lt;/strong&gt; is just irresponsible.&lt;/p&gt;

&lt;p&gt;Curious about this and worth discussing more:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://rhosys.ch/community" class="ltag_cta ltag_cta--branded" rel="noopener noreferrer"&gt;Join the community&lt;/a&gt;
&lt;/p&gt;

</description>
      <category>googlecloud</category>
      <category>security</category>
      <category>startup</category>
      <category>oauth</category>
    </item>
    <item>
      <title>AWS Advanced: The Quota Monitor Review</title>
      <dc:creator>Warren Parad</dc:creator>
      <pubDate>Thu, 09 Jan 2025 15:56:17 +0000</pubDate>
      <link>https://dev.to/aws-builders/aws-advanced-improving-the-quota-monitor-3e6l</link>
      <guid>https://dev.to/aws-builders/aws-advanced-improving-the-quota-monitor-3e6l</guid>
      <description>&lt;h2&gt;
  
  
  $78,641.25
&lt;/h2&gt;

&lt;p&gt;Per Month.&lt;/p&gt;

&lt;p&gt;That's the predicted amount of running the &lt;a href="https://docs.aws.amazon.com/solutions/latest/quota-monitor-for-aws/cost.html" rel="noopener noreferrer"&gt;official quota monitor&lt;/a&gt; released by AWS for ~1000 AWS accounts in your organization. For us at &lt;a href="https://authress.io" rel="noopener noreferrer"&gt;Authress&lt;/a&gt;, and those that have AWS accounts per customer or spin up extra ones per team/product/service, this cost could actually be significantly higher.&lt;/p&gt;

&lt;p&gt;Obviously this isn't an entirely fair interpretation of the cost analytics. But it is pretty ridiculous that the cost here scales like this. We would full expect a near zero cost for running a Quota Monitor.&lt;/p&gt;

&lt;p&gt;That is, when we need additional resources, AWS should provide a free way for us to know that we want to pay AWS more. Running into production problems by default is always a &lt;code&gt;pit of failure&lt;/code&gt;. I call it a pit of failure, because by default we won't know that we are about to run out of a critical resource we need to make our product work. The right solution is for AWS to alert us to our contacts on file for the account, when there is a problem, let us burst above the usage and then continue the conversation. (At least that is what we do for our customers)&lt;/p&gt;

&lt;p&gt;But AWS doesn't have that, so let's attempt the next best thing. Building a better version.&lt;/p&gt;

&lt;h2&gt;
  
  
  The baseline
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://docs.aws.amazon.com/solutions/latest/quota-monitor-for-aws/architecture-overview.html" rel="noopener noreferrer"&gt;original proposed architecture&lt;/a&gt; looks something like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk9ps484sc716x3bm6z46.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk9ps484sc716x3bm6z46.png" alt="AWS Quota Monitor" width="800" height="406"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So let's dive into that. It looks a bit complicated. If you have been looking at architectural diagrams for as long as I have, then you know to trust this instinct. It is indeed far too overly complicated. And most of the cost from our above calculated &lt;strong&gt;$78,641.25&lt;/strong&gt;, comes from the use of CloudWatch metrics. Metrics should be avoided at all costs since they have exorbitant pricing. This is something I previously heavily reviewed in our quest for the perfect &lt;a href="https://dev.to/authress/aws-advanced-serverless-prometheus-in-action-j1h"&gt;serverless metrics solution&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The core requirements of what we need are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Call the Service Quota API and check where the quotas are at.&lt;/li&gt;
&lt;li&gt;Alert on this problem&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;One of the problems with the model proposed by the AWS Quota Monitor, is that it is done in isolation. That is, it doesn't rely on other alerting or monitoring infrastructure you might already be utilizing. In practice for explaining new concepts, this is helpful. But for actually building up your AWS account to a the perfectly secure and well lubricated machine, let's utilize our existing infrastructure. Having followed the &lt;a href="https://dev.to/authress/aws-cloudwatch-how-to-scale-your-logging-infrastructure-8j0"&gt;logging infrastructure already put in place&lt;/a&gt; we have strong strategy for cross-account and region alerting:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj1mvont6myykw6wtz131.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj1mvont6myykw6wtz131.png" alt="Cross Account and Region Alerting" width="800" height="461"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That means all we need to do is deploy some piece of technology which will call the Service Quota API, log a message to CloudWatch Logs, and have a CloudWatch subscription set up to forward those messages to the appropriate location. &lt;strong&gt;Simple right?&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step Zero: Architecture
&lt;/h2&gt;

&lt;p&gt;Rather than the original architecture which requires passing all the data from all the usage from the spoke AWS accounts to the Hub. We will convert this to have the spoke AWS accounts in the organization self report. Applications in the spoke AWS accounts already self report whenever there is a production problem, and Quota Management isn't something that has security implications nor requires oversight from our AWS Monitoring or AWS Logging Org-level accounts.&lt;/p&gt;

&lt;p&gt;By pushing the responsibility of alerting into the spoke AWS accounts, we can control the levels of when to alert, make decisions in real time about the potential issues with the current quotas, and utilize the share infrastructure AWS accounts in our Org only when necessary.&lt;/p&gt;

&lt;p&gt;This also has the huge benefit of self-cleanup. When an AWS account is deactivated and finally deleted, there is no historical data that is left anywhere else, no existing cross account connections, no unnecessary configuration. Every account cleans up its own configuration and data automatically. The historical information is saved, but if the account isn't being used anymore we don't need to care about whether we are hitting a quota in it.&lt;/p&gt;

&lt;p&gt;A few details of importance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Usually we would use the &lt;a href="https://aws.amazon.com/premiumsupport/technology/trusted-advisor/" rel="noopener noreferrer"&gt;AWS Trusted Advisor&lt;/a&gt; to monitor quota limits, but not every one can benefit from this service as it requires an AWS Business Support plan. Which if you spend ~$100k it will cost about $6900 per month. Or let's call it 7% of your spend. That's a lot.&lt;/li&gt;
&lt;li&gt;Not all the checks are there and sometimes we might not be interested in exactly the values that have been set but want to have our own values.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The major problem here is that the AWS Solutions' Quota Monitor only utilizes data from CloudWatch Metrics. Which means only quotas that log to CloudWatch can be used. For example, I'm looking at one of our AWS Accounts, and there are only 194 metrics in the region.&lt;/p&gt;

&lt;p&gt;Many Quotas that are interesting don't have an metric. Take Route 53. Wouldn't it be great if you knew how close you were to the limit for creating new Route 53 records.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"ServiceCode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"route53"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"ServiceName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Amazon Route 53"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"QuotaArn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:servicequotas:::route53/L-E209CC9F"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"QuotaCode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"L-E209CC9F"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"QuotaName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Records per hosted zone"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;10000.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Unit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"None"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Adjustable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"GlobalQuota"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But the problem is that Route53 doesn't log a metric for record count.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws service-quotas list-aws-default-service-quotas &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--service-code&lt;/span&gt; route53 &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'Quotas[?UsageMetric.MetricNamespace==`AWS/Usage`]'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running this CLI command will tell you in the region which quotas are being tracked. The result from this command is only:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"ServiceCode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"route53"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"ServiceName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Amazon Route 53"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"QuotaArn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:servicequotas:::route53/L-F767CB15"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"QuotaCode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"L-F767CB15"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"QuotaName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Domain count limit"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;20.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Unit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"None"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Adjustable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"GlobalQuota"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"UsageMetric"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"MetricNamespace"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AWS/Usage"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"MetricName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ResourceCount"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"MetricDimensions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"Class"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"None"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"DomainCount"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"Service"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Route 53 Domains"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="nl"&gt;"Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Resource"&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"MetricStatisticRecommendation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Maximum"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Period"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"PeriodValue"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"PeriodUnit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"MINUTE"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We recently found we had to request increased RPS for our API Gateways for Authress. So let's check to see what the Quota Monitor would have found:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws service-quotas list-aws-default-service-quotas &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--service-code&lt;/span&gt; apigateway &lt;span class="nt"&gt;--region&lt;/span&gt; eu-west-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'Quotas[?UsageMetric.MetricNamespace==`AWS/Usage`]'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Well RIP 🪦. The quota monitor appears to be completely useless. Even if we attempted to utilize different metric namespaces ourselves, the only Lambda related one is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"ServiceCode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"lambda"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"ServiceName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AWS Lambda"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"QuotaArn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:servicequotas:eu-west-1::lambda/L-B99A9384"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"QuotaCode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"L-B99A9384"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"QuotaName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Concurrent executions"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1000.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Unit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"None"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Adjustable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"GlobalQuota"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"UsageMetric"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"MetricNamespace"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AWS/Lambda"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"MetricName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ConcurrentExecutions"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"MetricDimensions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"MetricStatisticRecommendation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Maximum"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Well that's at least a bit better, but still we are missing the other 19 Quotas for Lambda. Admittedly many of these are not adjustable, so let's limit our query just to the ones which in fact are:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws service-quotas list-aws-default-service-quotas &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--service-code&lt;/span&gt; lambda &lt;span class="nt"&gt;--region&lt;/span&gt; eu-west-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'Quotas[?Adjustable==`true`].QuotaName'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"Concurrent executions"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"Function and layer storage"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"Elastic network interfaces per VPC"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So three, and only one more than we found with a Usage Pattern.&lt;/p&gt;

&lt;p&gt;So right now we are in a pretty bad place. The Quota Monitor is dependent on data that will never exist in most situations.&lt;/p&gt;

&lt;h2&gt;
  
  
  The disappointment
&lt;/h2&gt;

&lt;p&gt;Okay, sorry for the lack of any real solution. While this started out a journey to define a quota monitor that anyone can use, based on how the original monitor was built we know that there isn't an easy out of the box solution. That's because The &lt;strong&gt;Quota Monitor AWS Solution is broken by design&lt;/strong&gt;. It only reports on things that are being tracked, and AWS is only tracking usage when it is easy. When it is hard, say, &lt;code&gt;Max(RPS)&lt;/code&gt; for API Gateway in the last 5 minutes, there is no Quota Service unified usage data. And without that data being tracked there is no way to alert on it. If we want to know if we are going to hit a limit with one of the services we actually use, we needed to explicitly build a solution that tracks that specific thing.&lt;/p&gt;

&lt;p&gt;I'm currently working through setting up real monitors for our infarstructure, and for each one we'll have a dedicated post walking through how it's done. So stay tuned!&lt;/p&gt;




&lt;p&gt;Curious about this and want to chat about the things I've built, message me in the community:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://rhosys.ch/community" class="ltag_cta ltag_cta--branded" rel="noopener noreferrer"&gt;Join the community&lt;/a&gt;
&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>devops</category>
      <category>cloud</category>
      <category>aws</category>
    </item>
  </channel>
</rss>
